Why Amazon Sometimes Screws Up (and how to deal with it)

Before I quit my day job, I worked in Information Technology (IT). I got my degree in Computer Science back when terminals were still a fairly new concept, and I worked in IT my entire career. I worked for a bank, a law firm, a startup software company, and an insurance company, and all my jobs involved dealing with users who were having problems.

As a result, I have both more sympathy for Amazon’s screw-ups and a better understanding of what’s involved in fixing them than the average author probably does. Maybe my experience can help other authors understand what’s going on behind the scenes.

Individual problems

The first person you’re going to come in contact with is a Help Desk worker. This is the very bottom of the corporate IT ladder, and it’s no longer a job that requires any technical knowledge whatsoever. Help Desk workers are given a script and a logic tree for troubleshooting issues. So, for example, if you say “I can’t login” they ask “Is your caps lock on?”

It doesn’t matter if you’ve already explained in great detail what’s going on with your login ID. They have to start where they have to start and go through all the steps on their script in order. The troubleshooting tree is based on percentages. It eliminates the most common problems first, and it assumes you’re the one who caused the problem because (sorry) most of the time that’s true.

Help Desk workers are evaluated on how quickly they close problems. “Close” doesn’t necessarily mean solve, so their goal is to get you to go away, happy or not. A few tips for surviving this phase:

  1. Call when you have the time, information, and patience to go through all the steps. Otherwise you’ll just have to start all over again next time
  2. Don’t waste time arguing. Do each instruction as it’s given to you because you can’t get to the next phase of troubleshooting until they tick off everything on their checklist. And you might be surprised to discover that one of their steps actually works, no matter how sure you were that it wouldn’t
  3. Be patient. This is the job they get paid to do, and this is an appropriate level of knowledge for them to have

If you reach the end of the Help Desk script, you’ll be put in another queue for the next level of tech help. At this level, you should be dealing with someone who has some actual knowledge. You’ll likely notice the difference in the responses you’ll get. Level 2 rarely works by phone because these are busier, more expensive people and because they need time to research your problem.

Occasionally, your problem may be tricky enough that it has to go up another level, so again, be patient. Each level gets paid according to their knowledge.

System-wide problems

Here’s where it gets thorny: your problem isn’t just your problem. Something major has gone wrong. It’s frustrating to see individual authors be told there’s nothing wrong (or be told that whatever is wrong is their fault or is specific to them) when we can see from the outside that there’s a system-wide problem on Amazon’s side. This is when people get indignant. How dare Amazon cause this problem? How dare they not know they caused it? How dare they not fix it immediately?

I get it, I do. But having been on the other side, I understand that Amazon’s tech issues aren’t unique, unexpected, or the result of not caring about authors. Here’s an illustration of how major bugs get introduced without being recognized:

One day, my partner used a ShopVac to clean up a mess he’d made in the basement. A few days later it rained, and our basement flooded. Would you guess that these two events were related? To plug in the ShopVac, my partner unplugged a ratty extension cord (the ancient kind that’s not even grounded) which it turned out was powering the sump pump that lived in a hole clear on the other side of the basement. Not a great design, obviously, and now we know, but when your basement is filling with water, it takes awhile to realize that it’s because you unplugged an extension cord three days ago.

That’s how these disasters happen at Amazon (and other major corporations). One group of IT people is working on a project to improve some part of their gigantic system. They do a lot of testing to make sure their change works, BUT it turns out their change is affecting some other part of Amazon’s system in a way no one realized.

When the first calls come in, they’re treated like “I forgot my password” calls. The Help Desk workers taking these calls don’t realize they’ve got a bigger problem going on. Help Desks usually have a dashboard that shows them which systems have changed recently or what known issues are being worked on, but you’re calling about a flood and their dashboard only says something about an unsafe extension cord. And although to us it seems like everyone we know is having this same issue, to the Help Desk there’s only a minor uptick in a certain type of complaint amongst all the other complaints they’re dealing with.

As a user group, we need to help the Help Desk recognize that they’ve got a system-wide problem by mentioning that we know a lot of other people who’re having the same problem. You can also ask the technician to search their database and see if there are other open incidents that sound similar to what you’re reporting. If the technician links your call to other open calls, the priority will be raised. That means it does help for you to call, even if a lot of other people already have. The more calls that are lumped under a single incident number, the more attention that incident gets. Just remember that the first person you deal with doesn’t have the power or knowledge to do more than log a ticket. Be patient and kind.

Once Amazon has acknowledged that they have a system-wide issue, why isn’t it solved more quickly? Well, realizing that everyone’s basement flooded at the same time is an important first step, but the engineers still have to trace the problem back to that unplugged extension cord. In a system as complex as Amazon’s, that can take time. And then they have to figure out what to do about it. Should the sump pump be running off a two-prong extension cord on a non-GFI outlet halfway across the house? Clearly not. Sometimes the change that caused the problem can’t just be undone. Sometimes a whole new solution has to be crafted. And that takes time.

We have this idea that Amazon is big and should therefore be perfect, but I can tell you from 30 years of experience, that big isn’t perfect. Big is SLOW. And big is careful, despite what you may think, which means that fixes have to proceed through that same rigorous testing that should’ve caught the problem before it went live. Amazon can’t just send someone down to the basement to plug the extension cord back in.

No IT department has a hundred percent success rate for rolling out changes without having unexpected impacts, but I promise you they’re in enough trouble for that mistake without us adding fuel to the fire. The people who caused the problem feel awful about it, and the people trying to solve it are working as fast as they can. Our role is to be clear and considerate and remember that sometimes we screw things up too.

One comment

  1. You forgot to say (and I’m *sure* it was an oversight) in HelpDesk Tip 3 that you should only ring the Help Desk if you have a gin and tonic to hand, and no shortage of either in the house. I once ordered an album I had fallen in love with through a Help Line “hold” muzak, after I’d been on hold for an hour. “Noah and the Whale”, since you ask.
    Good article.

Comments are closed.