• Tuesday, November 25
Despite the jubilant mood, everyone knows they’re a long way from being fully prepared for the Black Friday promotions. As Maggie said, the plan is to do a trial run against a small subset of customers to test their readiness for the full-scale campaign on Friday—so at eleven a.m. they will conduct a campaign to just one percent of their customers. They’re doing this in the middle of the day, when everyone will already be in the office and able to quickly respond to emergencies. This will help them find vulnerabilities and weaknesses in the process so they can fix them before Friday.
To Maxine, this decision alone shows how much has changed in the organization. A couple of months ago, they would not have conducted any trials. And they would surely have scheduled the campaign to start at midnight, requiring the teams to be in the office throughout the entire night.
At nine a.m. everyone is in the war room furiously dealing with last-minute details in preparation for the one percent mini-launch. The Orca team is still fine-tuning the customer offers. Maxine is a little alarmed to learn that they’re still deciding which one percent of the customer base they’re targeting—but if they aren’t panicking, she won’t panic either. They’ve earned that level of trust over the last several weeks.
Even though they’re sending an email to only one percent of their customer mailing list, the stakes are still huge. They’ll be sending nearly one hundred thousand emails to all the persona profiles, not just the Meticulous Maintainers and the Catastrophic Late Maintainers, to learn how each segment responds.
Countless things could still go wrong. If the response rate isn’t in the same ballpark as in their early experiments, all the hopes and dreams riding on the Unicorn Project will be dashed. If they promote the wrong items, or if those items are not in stock, or if they screw up the fulfillment, they will anger their customers.
This campaign represents many firsts for Parts Unlimited. It is the first time that emails will open up the mobile app if they are being read on a mobile phone. It’s the first time they’re presenting promotions through the app—people with the app installed will get a notification about this limited-time offer, which the Promotions team believes will have higher response rates than even their carefully designed emails.
Over the last week, they’ve been continuously performing experiments in their mobile app, zeroing in on what maximizes conversion rates, such as presenting promoted items differently, using different pictures, picture sizes, typography, and copywriting. Those lessons and learnings were then considered for the email campaigns too.
The results of all these experiments were being poured back into Panther to guide the next round of experiments and trials, along with all customer activity within the app. It was a lot of data, but it had the Analytics team salivating for more. People’s appreciation for the Panther data platform kept growing.
The mobile app team has also been working around the clock to make sure that things display properly and that the buttons actually do what they are supposed to, but they are also trying to streamline the purchasing process as much as possible. Noticing that many customers dropped off when prompted for a credit card, they licensed some technology to input this information by using their phone camera and offering different payment options like PayPal and Apple Pay in the hopes that one of these might reduce order abandonment rates.
The big gamble is that all this investment in their mobile app will result in significantly higher sales than just using the mobile phone browser. It’s a gamble, but a well-informed gamble, made by an organization that is obviously and constantly learning.
But preparation and practice time is over; now it’s game time, Maxine thinks. She sees many of the technology teams starting to assemble, but the Narwhal data team is already huddled around their screens, going through checklists and whispering back and forth, making sure that everything can handle the traffic they’re expecting. Over the last week, Brent and his team have been stress-testing the entire system, routinely causing parts of the technology landscape to blow up. And then in a blameless post-mortem, they’d all work together to figure out how to fix things so that they’ll survive the actual launch.
The results of these “Chaos Engineering” exercises resulted in some surprising things breaking. But everyone has been working diligently, trying to ensure that they are as prepared as they can be for the big launch event. A few days ago, a small test run of the offer generation process kept crashing because they didn’t increase the limits for an external service they used. They had gotten in the habit of scaling everything down to save on costs, and someone had forgotten to scale it up before the test.
We still have so much to learn before we’re experts at this, Maxine thinks.
At times, it’s difficult to know who works on which team, because people are moving so fluidly between them. When everyone knows what the goals are, as Erik predicted, teams will self-organize to best achieve those goals. To Maxine, it’s been amazing to see how people are acting and reacting to each other, especially when compared to the big Phoenix launch two months ago. People across different disciplines—Dev, QA, Ops, Security, and now even Data and Analytics—are working together daily as fellow teammates instead of adversaries. They are working toward a common goal. They realize that they are on a journey of learning and exploration, and that making mistakes is inevitable. Creating ever-safer systems and continual improvement is now viewed as part of daily work.
This is worthy of the Third Ideal of Improvement of Daily Work that Erik painted many weeks ago, Maxine thinks.
Thanks to the pioneering work by Data Hub, code is now being promoted into production multiple times per day, smoothly, quickly, and mostly without incident, with any issues being resolved quickly and without blame or undue crisis. Even now, Maxine sees that there are production deployments happening, as teams are pushing out last-minute changes to ensure the success of the mini-launch.
Twenty minutes ago, someone noticed that one API was returning a bunch of “500” HTTP errors. Apparently, yesterday, someone had committed a code change that accidentally misclassified “400” user-caused errors as “500” server-caused errors. Wes pulled together a huddle, and Maxine was astonished when Wes recommended pushing out a fix, even though it was less than an hour before the mini-launch.
“If we don’t fix it, these errors could potentially hide important signals if we have a real outage,” he said. “We’ve proven repeatedly that we can push out these one-line changes safely.”
The best part was it was a developer who detected the error and who pushed out the fix. We finally trust developers, she thinks. If someone had told her a month ago that Wes would support something like this, she would have never believed it.
And best of all, Maxine’s worst fears about developers going amok and ruining the integrity of the data in the Narwhal platform never materialized. Left to their own devices, development teams will often optimize everything around themselves. This is just the parochial and selfish nature of individual teams. And that’s why you need architects, thinks Maxine.
Because they provided access to the data through versioned APIs, things remained very controlled and teams were able to keep working independently. Maxine is not just relieved—she’s elated. They designed these platforms to optimize for the system as a whole and ensure the safety and security of the entire organization.
“Sending emails and notifications to the mobile apps in 3, 2, 1 … now. Here we go, everyone,” says the marketing launch coordinator in a calm voice. Maxine looks at her watch. It’s 11:12 a.m. Emails and mobile app notifications are now going out to a hundred thousand customers.
The launch is starting twelve minutes late because of a couple unforeseen issues—a configuration problem was found in the Narwhal systems and someone noticed that there were too many email addresses in the campaign, requiring a recalculation and regeneration of the email list in Panther. Maxine gave a thumbs-up to Shannon when the Unicorn teams quickly generated and uploaded the new data in record time.
On the one hand, Maxine is mildly irritated that these details were caught so late in the launch process. But on the other hand, that’s what rehearsals are for and why everyone is assembled in the war room. Everyone needed to make these types of last-minute calls are in the room and everyone agreed that it made sense. Maggie, Kurt, the team leads, and many others are assembled here, as well as Wes and key Ops people.
Maxine looks around. Again, Sarah is nowhere to be seen. Maxine wonders if she’s the only one who is suspicious that Sarah might up to no good.
She turns her attention to what everyone else in the room is watching—the large monitor hanging on the wall. Everyone is holding their breath. On the screen are a bunch of graphs, dominated by the number of emails sent and the order funnel: this shows how many people viewed a product page, added a product to the shopping cart, hit the checkout button, had their order processed, and had their order fulfilled. The bottom shows where the most drop-offs are occurring, as well as the number of orders and revenue booked.
Underneath those graphs are the technical performance metrics: CPU loads for all the various compute clusters, number of transactions being processed by the services and databases, network traffic, and much, much more.
She could see several spikes associated with the massive calculations enabled by Panther. But now, most of the graphs are at zero. Several of the CPU graphs are at twenty percent. Those are the services that need to stay warm to make sure they don’t go to sleep. In one of their launch rehearsals, they were horrified when this happened to a key system, requiring six minutes for the system to wake up and scale out.
Nothing happens. One minute goes by. Another minute goes by. Maxine is starting to get worried that the launch was a complete dud. Maybe something terrible has happened in their infrastructure. Or maybe something terrible has happened that prevented the emails from being received. Or maybe their worst fears about terrible recommendations had come true, and they had accidentally sent offers of snow tires to people who don’t live near snow.
Maxine audibly sighs in relief when the graph for product page views suddenly jumps to ten, twenty, fifty … and keeps going up.
Everyone cheers, including Maxine. She is staring at the technical metrics, praying that the infrastructure doesn’t fall over like during the Phoenix release. She’s relieved when the CPU loads are starting to climb across the board, showing that things are actually being processed.
Minutes later, almost five thousand people are in various stages of the order funnel. So far, so good, Maxine thinks, watching the numbers continue to creep up. Again, people cheer as the number of processed orders continues to climb … Ten orders completed, then twenty, and it still continues to climb. To her excitement, the revenue generated from this campaign surges past $1,000.
Everything is working as designed. She smiles as she hears a smattering of applause throughout the room but continues to stare at the graph.
She frowns. The number of completed orders graph has flatlined, stuck at 250. She looks at the other graphs to see if they’re stuck too, but they’re still climbing. Maxine sees a bunch of people crowded around the TV, pointing at the stuck graph.
Something is definitely going wrong.
“Let’s have some quiet here!” Wes hollers out. He remains silent for a couple of moments before he turns around and finally says, “I need people to try ordering products on the web and on mobile and tell me what is actually happening! Something is preventing orders from going through!” Maxine already has the app open on her phone. She hits the Add to Shopping Cart button and blinks in surprise. She calls out, “Mobile app crashes when you add an item to the cart on an iPhone … app crashes and disappears.”
“Dammit,” she hears someone say from the other side of the room. Someone else calls out, “Getting error message on Android. I see a dialog box that says ‘An error has occurred.’”
Right next to her, Shannon hollers out, “Web shopping cart is generating an error—web page renders after you hit submit, but I’m getting a blank webpage! I think something on the back end is erroring out when we query whether items are available to ship.”
Wes says from the front of the room, “Thank you, Shannon. Get all these screenshots into the #launch channel. Okay, everyone, listen up! We’re getting errors on all client platforms—Shannon thinks it’s one of the back-end calls we make: maybe the ‘available to promise’ API call or ‘available to ship.’ Anyone have any ideas?”
Maxine jumps into action, appreciating how great it is that Wes is running the war room. Yeah, he’s cantankerous, she thinks, but he’s handled more outages than everyone else in the room combined. Having that type of experience during this high-stakes launch is a very good thing. We developers are great at what we do, but these types of crises are a part of everyday life for Ops.
It doesn’t take long to confirm that Shannon’s hypothesis is correct—it was a problem in the order entry back-end systems. All the systems in that particular cluster are pegged at one hundred percent CPU usage; unfortunately, the system being hit is part of the main ERP, which handles almost all the core financials of the company. It’s been running for over thirty years, but it’s stuck on a version that is almost fifteen years old. It’s been so customized that it’s been impossible to upgrade. At least it’s put on newer hardware every five years. But there’s no easy way to throw more CPU cores at it to speed it up.
Apparently, even the small one percent promotion is causing it to get backed up. Maxine sees that queries are taking longer and longer to return, and client requests start timing out. All those clients start resending the queries, causing even more requests to overload the back-end database.
“Thundering herd problem,” Wes mutters, referring to when simultaneous client retries end up killing a server. “We can’t do anything on the back end. How do we get all the clients to back off on the retries?”
“We can’t change the mobile apps, but we can get the e-commerce servers to wait longer before they retry,” Brent says. Wes points at Brent and Maxine and says, “Do it.”
Maxine and Brent work with the e-commerce teams to push out new configuration files to every web server. They are able to push out all these changes into production in less than ten minutes.
Luckily, this is enough to stave off disaster. Maxine watches in relief as the database error rates start decreasing and the number of completed orders starts creeping up again. Several other things go wrong over the next two hours, but none of them are as heart-stopping as the ‘available to promise’ server issue that she and Brent had to deal with.
Another forty-five minutes later, they cross over their goal of three thousand completed orders, grossing a quarter million dollars in revenue, and the orders are still coming in strong. Maggie must have snuck out, because two hours later Maxine sees her come back into the room with a bunch of people carrying champagne bottles. Maggie opens one up and starts pouring glasses, handing the first to Maxine.
After everyone has a glass in hand, Maggie raises hers with a big smile. “Holy cow, everyone. What a day! And what an amazing team effort! I want to share with you some of the early results, and, wow, they are great … people are continuing to respond to the promotions, but at this point, almost a third of the people responded to our campaign. This is, without a doubt, the highest conversion rate we’ve ever achieved by at least a factor of five!”
She pulls out her phone and peers at the screen. “Here are some early calculations from the team. Over twenty percent of people who received our offer went to view the products we recommended, and over six percent purchased. We’ve never seen any numbers like this! So thank you to everyone here who helped make this happen.
“And remember, almost all the items we promoted are high-margin items or were sitting on shelves gathering dust. So each sale we made today will have an unusually large effect on profits!” Maggie cheers and drains her glass. Everyone laughs and follows suit.
She says, “Based on these results, the Unicorn promotion campaign to our entire customer base on Black Friday is a GO! If the results are anywhere near what we saw in this test campaign, we are going to have a blowout holiday season …
“Uh, just a reminder, this is insider information. If you use this information to trade Parts Unlimited stock, you can go to prison. Dick Landry, our CFO, told me to tell you that he will assist in your prosecution as per your employment contract,” she says, and then smiles. “But having said that, there’s no doubt we’re going to crush it on Black Friday!”
Everyone cheers loudly again, including Maxine. Maggie motions for everyone to quiet down and asks Kurt and Maxine to speak. Maxine laughs, motioning Kurt to go. He says, “What an amazing effort, everyone! I’m so proud. Maxine?”
Maxine hadn’t wanted to say anything, but being cornered like this, she stands up and raises her glass. “Here’s to the Rebellion showing the ancient, powerful order how kickass engineering work is done!”
Again, everyone cheers and laughs. When that all dies down, Maxine says, “Okay, enough of that. On Black Friday we can safely expect one hundred times the load as today. We’ll surely run into tons of problems we’ve never encountered before, so we’ve got our work cut out for us between now and then. Let’s figure out how to best prepare for it.”
Kurt adds, “I’d like to send as many people home on time tomorrow as possible, given that it’s Thanksgiving on Thursday. So let’s get to work! And we’ll need people in the office early on Friday to support the launch.”
They agree to stagger the emails and mobile app notifications to prevent the systems from being slammed all at once and to better protect those unexpectedly delicate back-end servers. Brent comes up with an idea to reconfigure the load balancers to rate-limit the transactions. This will cause customer errors on the mobile app and e-commerce servers, but everyone agrees this is far better than those back-end systems crashing again.
“We’ll get on it. I think we’ll be in good shape and get everyone out of here in time for Thanksgiving!” Brent says with a big smile. “Happy Thanksgiving, everyone!”
As Brent predicted, all the work is done before five the next day. With just a couple of exceptions, people start heading out. Maxine is making the rounds, trying to shoo the stragglers home. It’s the day before Thanksgiving, and Maxine wants to get out of here by five thirty. She’s proud that she even got Brent to leave.
One team that couldn’t leave were the data analysts. Now that the one percent test proved to be a smashing success, they had to finish generating all their recommendations for millions of customers by Friday. The resulting compute loads on Panther keep growing, and they keep updating the promotions data in the Narwhal data platform. Maxine thinks with a grin, We’re racking up a heck of a bill with the cloud computing providers, but absolutely no one in Marketing is complaining because the business benefits are so spectacular.
She swings in to say goodbye to Kurt but freezes mid-stride when she sees Sarah having a heated discussion with him.
“… and I walk around this building after five and there’s barely anyone here. Kurt, I don’t know if you realize this, but the company is on the verge of extinction. We need everyone pulling their weight,” Sarah says, fuming in righteous indignation. “I think we need some mandatory overtime. Buy them more pizza and they’ll be happy to stay and do the work.
“And if that weren’t bad enough,” she continues, “I just saw a bunch of people sitting around reading books! We don’t pay people to read books; we pay them to do work. That should be pretty clear, right, Kurt?” Kurt’s expression remains deadpan.
“You’ll have to bring that up with Chris. Banning books from the workplace is above my paygrade.” She gives him a dirty look and storms out.
Kurt makes a motion to Maxine, indicating that he wants to hang himself. “It’s so strange,” he says. “She thinks we pay developers just to type, instead of paying them to think and achieve business outcomes. And that means we pay them to learn, because that’s how we win. Can you imagine banning books from the workplace?” he says, laughing and shaking his head.
Maxine just stares at Kurt. Sarah’s beliefs are like the antithesis of the Third Ideal of Improvement of Daily Work and the Fourth Ideal of Psychological Safety. Maxine knows that the only way they could have achieved what they had was by creating a culture where people felt safe to experiment, to learn and make mistakes, and where people make time for discovery, innovation, and learning.
“No argument from me, Kurt. Let me know if you convince her,” Maxine says smiling, waving goodbye. “Happy Thanksgiving.”
Maxine has a fantastic Thanksgiving. It’s the first since her father died, and she enjoys having everyone over, even if she is surreptitiously looking at her phone all the time to see how the Black Friday preparations are going.
The highlight of Thanksgiving is when Waffles, now not so little, tipping the scale at forty pounds, grabs a big piece of turkey off the table in front of everyone, to Maxine’s horror. It was the first time he had ever done that, Jake promises everyone.
Everyone pitches in to clean up after, and Maxine goes to bed early.
She needs to be in the office early the next morning.
At three thirty a.m. she’s in the office with the rest of the team. The technical teams had been going through their launch checklist, getting ready for the surge in demand that would start in a couple of hours. They grab another conference room for the extended teams who can’t fit in the first. It’s a larger affair than the one percent test they ran on Tuesday. Each conference room has a similar configuration of big, U-shaped tables with about thirty people seated. She starts her day in the room where the technical teams are assembled.
In the extended war room are the Narwhal and Orca teams, next to the monitoring team, the web front-end teams, the mobile teams, and the numerous back-end service teams responsible for products, pricing, ordering, and fulfillment. There are many more technical teams on standby in the chatroom.
All of these services have to run seamlessly for products to be presented to a customer and for orders to be placed. On the huge TV monitor on the wall are more technical graphs showing the number of visits to the website, stats on the top product pages, as well as health checks and most recent errors from all the services represented in the room.
In the primary war room, they’ve set up a second TV where some of these technical metrics are displayed. And today, they have more representatives from business and technology leadership, the entire Unicorn and Promotions team, and even people from Finance and Accounting. Everyone who matters is here to see how the campaign goes.
At four thirty a.m. Maxine is hanging out with Kurt and Maggie in the primary war room. She is looking for something to help with, but everyone seems to know what they’re doing. At this point, all she can do is get in the way. They are thirty minutes away from the beginning of the campaign launch.
Sarah is here too. As far as Maxine can tell, she appears to be haranguing someone about the pricing and promotional copy for one of the offers.
Maggie is also in the huddle, not looking happy, saying, “Look, I know we want the offers to be perfect, but the time to make changes was yesterday. The risk of making changes in the copy is just too high for something going out to so many people. It might delay the launch by another hour.”
“This may be good enough for you, but it’s certainly not good enough for me. Get this fixed. Now,” Sarah says, eliminating any further discussion.
Maggie sighs and walks away, rejoining Kurt and Maxine. “We’re going to have to make some changes,” she says, rolling her eyes. “Undoubtedly, this is going to push back the launch by at least an hour.”
“I’ll go tell the technical teams next door,” Kurt says, grimacing as he leaves the room.
An hour later, things are again finally ready to go. Maggie asks from the front of the room, “If there are no objections, let’s launch at six a.m. That’s fifteen minutes from now.”
When the launch begins, Maxine is in the business war room watching the large TV monitor like everyone else. Within two minutes, over ten thousand people have hit the website and are going through the order funnel, and the rates of arrivals keep climbing. And again, all the CPU loads start climbing, much higher than in the test launch.
People clap as the number of completed orders passes five hundred. Maxine is amazed at the scale of customers who are being mobilized by this launch.
She holds her breath, hoping that all their hard work hardening their systems will make this launch boring. She watches as the number of orders continues to climb … until they flatline, just like on Tuesday.
“Dammit, dammit,” Maxine mutters. Something is definitely going wrong again. And in the same portion of the order funnel. Something is preventing people from the shopping cart checkout.
Wes hollers out, “Someone tell me what’s going wrong with the shopping cart! Who has any relevant data or error messages?”
Shannon is the first to speak up again. Maxine marvels at Shannon’s uncanny ability to be first on scene. “Web shopping cart is generating an error. Fulfillment options aren’t being shown! I’m guessing some fulfillment service is failing. Posting screenshot in the chatroom.”
Someone from the other side of the room hollers out, “iOS mobile app crashing again.” Wes swears. The mobile app Dev manager swears.
Suddenly, Maxine tunes everything out, because in that moment, she’s suddenly afraid that maybe Data Hub is causing the problem. She’s still trying to think this through when she hears someone from the mobile team holler out: “Wes! The app just crashed after I hit the checkout button, right when it should have presented all the transaction details. I think a call to a back-end service is timing out. I thought we fixed all the places where that can happen, but we obviously missed one. We’re trying to figure out which service call is causing the problem.”
“Could that be a call to Data Hub?” Maxine whispers to Tom.
“Not sure,” Tom says, thinking. “I don’t think there’s any direct calls from the mobile app to us …”
On her laptop, Maxine pulls up the logs from the production Data Hub service, looking for anything unusual, grateful that she can do this herself now. She sees a couple of incoming order events, which generate four outgoing calls to other business systems. They all appear to be succeeding.
Seeing nothing, she turns her attention back to the front of the noisy room where Wes, Kurt, and Chris are convening. Seeing that they’re actively in discussion, Maxine joins them. She hears Wes ask, “… so what service is failing?”
Chris and Kurt pow-wow for a bit, and Wes apparently loses patience. He turns to the entire room and hollers over all commotion, “Listen up, everyone! Something in the transaction path between bringing up the shopping cart and completing an order is failing. Maxine, what are the names of each of these transactions and service calls?”
Although she is surprised at being prompted, she quickly rattles off eleven API calls and services off the top of her head. Brent calls out three more. “Thank you, Maxine and Brent,” Wes says.
Turning to the room, he hollers, “Okay, everyone, prove to me that each one of those services are working!”
Minutes later, they discover the problem. When a customer views the shopping cart, they are presented with the order details, payment options, and shipping options. When all that is correct, the customer hits the place order button.
Apparently, when displaying this page in the mobile app and on the web, a call is made to a back-end service to determine which shipping options are available based on their location, such as next-day air and ground shipment, as well as providers such as UPS and FedEx.
This service calls out to a bunch of external APIs from the shipping providers, and some of those are failing. Brent suspects that they are being rate-limited by one of them, because they’d never had Parts Unlimited servers send so many queries like this before.
Maxine can’t believe that a service that seems so trivial is jeopardizing the entire launch. She smiles and makes a note of this, because she knows that this will likely be the new normal. But for something this mission-critical, there’s no way we should depend on external services, she thinks. We need to gracefully handle the case when they’re down or when they cut us off.
Maxine joins the technology team leaders huddling in the front of the room. She suggests, “When we get shipping API failures, maybe we present just the ground shipment option. We know that this type of shipping is always available … Thoughts?”
The fulfillment service team lead nods, and they quickly work through the details with Wes and Maggie. They decide that, effective immediately, if they can’t get information from all shippers, they’ll just present ground shipment as the only option.
After all, it’s better to take their order and ship it slowly as opposed to giving them an error page.
The team lead says, “Give us ten or fifteen minutes to push the code change out. I’ll keep you posted,” and runs from the room.
Ten minutes later, Maxine is pacing, waiting for the fulfillment team to announce that they’re pushing their fix into production. When that happens, everyone will high-five each other and celebrate. She’s still waiting when someone yells out, “Wes! Web server page requests are timing out and front-end servers are crashing! These aren’t ‘404’ errors. Two servers are actually rebooting and clients are starting to get ‘unable to connect’ errors!”
Maxine looks at the dashboards and is shocked at what she sees. The entire web server farm is pegged at a hundred percent CPU utilization, with some of them X’ed out because they have hard-crashed. Page load times have gone from 700 milliseconds to over twenty seconds, which is basically forever, and still climbing.
This means some people going to their webpage won’t see anything at all because the requests for the page are not being fulfilled.
Wes is staring at the graphs too, and then attempts to load the webpage on his phone. “Confirmed. Nothing is loading in my mobile browser. Web server team, what’s going on?” he hollers out.
“They’re in the next room,” Kurt says. “I’ll go find out.” Maxine follows.
In the next ten minutes, they learn just how bad the problem is. A record number of people are hitting the e-commerce site. They had anticipated this, which is why Brent had blasted their site with a homemade bot army, making sure they could actually handle such a heavy load.
But apparently, they missed something important. They hadn’t tested for real customers coming to their site, who were presented product recommendations based on their customer profile. This was a new component they had created in the last week. The component wouldn’t render for bots, only for actual customers who were logged in.”
As real customers hit their sites, this component made a bunch of database lookups from the front-end servers, which were never tested at this scale. Now those front-end servers are crashing under the load like a house of cards.
“I need ideas on how to keep those front-end servers alive, I don’t care how crazy!” Wes says from the front of the room. The enormity of the problem is clear to everyone. Seventy percent of all incoming traffic is through the web. The largest portion of their order funnel is still the web, and if that stays down, all Black Friday goals will go down with it.
“Get more servers into rotation?” someone says. Wes responds immediately, “Do it! No, Brent, not you, you stay here. Get someone assigned to it … Other ideas, people?”
More ideas come out, and almost all of them are shot down. Brent says, “The recommendation component is what’s causing the unusual server load. Can we disable it until traffic dies down?”
Maxine groans inwardly. They had worked so hard to get it working, and now they might have to tear it out to keep the site running.
“Interesting. Well, can we or can’t we?” Wes says, asking the room.
A group of managers and technical leads huddle with Maxine and Brent, and they quickly brainstorm ideas. They finally decide to just change the HTML page, commenting out the recommendations component. A brute force approach that Maxine appreciates, because no code changes are needed. The front-end team lead says, “We can change the HTML page and push it out to all the servers within ten minutes.”
“Go!” Wes says.
Maxine watches over the shoulders of two engineers as they carefully modify the HTML file. They’re careful because a mistake in the HTML can break the website as thoroughly as any code change. When he’s done, they review it together, commit the change into version control, and initiate the push into production.
They’re surprised when there is no impact to the front-end performance, even after three minutes. They keep waiting to see a change, but servers keep dying. “What’s going on? What did we miss?” the engineer says, obviously trying to stay calm, confirming over and over that his modified HTML file is being loaded in the browser.
“I can see your changes in the HTML served by the site,” Maxine announces loudly. “There must be another path that displays the recommendations component?”
Wes is watching from behind them. “Everyone, the new HTML file has been pushed into production, but we’re still getting excessive CPU load. I need confirmation that recommendation component is still getting rendered somewhere. Give me hypotheses and ideas!”
It takes four more minutes for them to discover that there is one more place where the component can be rendered. Maxine watches as they push another HTML file and is relieved to see that sixty seconds later the CPU load drops by thirty percent.
“Congratulations, team,” Wes says, pausing to smile. He continues, “But that’s not enough to keep our servers up. What else can we do to reduce the load, people?”
More ideas are proposed, more ideas are shot down, but some are jumped on immediately. The server load finally dips another fifty percent when the most common graphic images are offloaded from their local web servers and moved to a Content Distribution Network (CDN). This takes almost an hour to fully execute, but it’s enough to prevent the site from going down entirely.
And so it goes for the rest of the day—hundreds of things going wrong, some big, some small, and never just one problem at a time. Like in their post mortems, they learn how imperfectly they understand this incredibly vast and complex system they’ve created and now must keep operating under extreme conditions.
Hours fly by. There are moments of tired smiles and high fives as heroic acts keep everything running. The number of completed orders continues to climb, and Maxine is relieved that incoming order rates peak around three p.m., giving people a reason to hope that the worst might be behind them.
Strangely, Maxine has a brief glimpse of Sarah looking sour on the sidelines—but even this doesn’t bother her. Maxine’s so proud of how well her teams did, handling every crisis that was thrown their way, quickly adapting and learning. And absolutely everyone knows that all this adversity is a great thing, because it is a consequence of the outrageous success of the Black Friday promotion made possible by the Unicorn Project.
By four, it’s clear that the worst is behind them. Order traffic is still incredibly high, but off fifty percent from their peak earlier in the day. The number of failures and near-misses are down to a less bewildering rate, and people are actually starting to relax. As evidence of this, Wes is now wearing a Parts Unlimited baseball hat with a unicorn and large flames emblazoned on the side. He is laughing and joking with the people around him, handing out hats to everybody who walks by.
Shortly before five, Maggie goes to the front of the room, and the champagne bottles and plastic cups are carried in by her staff. When everyone has a cup, she says, “What a day, everyone! We made it!”
Everyone cheers, and Maxine drains her glass. She’s exhausted, but she can’t wait to hear the business results from everything they’ve done today.
“This is the largest digital campaign that this company has ever done,” Maggie says. “We sent more emails today than ever. We pushed out more mobile app notifications than ever. We had the highest response rates. The highest conversion rates. We had higher e-commerce sales today than any other day in the company’s history. We will likely have the highest margin on sales today than any single other day. How’s that for the Unicorn Project pulling through?”
Maxine laughs uproariously and cheers loudly, along with everyone around her.
Maggie continues, “It’s going to take days to get final numbers, but you can see on the screen behind me that we’ve booked over $29 million in revenue today alone. We blew away last year’s sales record by a mile!”
Maggie looks around the room for a moment, cheers again, and then says slowly, “This is a watershed moment for Parts Unlimited. This is what we’ve been reaching for, for years. This shows that even horses can do unicorn-like things. Trust me, this will turn a lot of heads, and our job now is to start dreaming bigger. We’ve shown what an incredible business and technology team working together can do, and we’ve got to elevate the dreams, goals, and aspirations of our business leadership.
“Bigger and better things are yet to come, everyone,” she says. “But in the meantime, we’ve all earned the right to celebrate. Uh, that is, when Wes says it’s safe for us to celebrate. Kurt and Maxine, get on up here and say a couple of words.”
Kurt joins Maggie, laughing as he beckons to Maxine to join him at the front of the room. “Here’s to a kick-ass technology team that supported the Promotions team! We took a bunch of risks, and we did things that have never been done before in this company. As Maggie just said, we have a chance to make a material difference to the performance of the company.”
Kurt turns to Maxine, obviously expecting her to say something. Maxine looks at everyone for a moment. “I’m so proud to be a part of this effort. Kurt is right in that we all took a bunch of risks to get here, and I think we’ve all learned so much in this journey. I can’t believe how much we’ve gotten done since I was first exiled to the Phoenix Project only a couple of months ago. Working on the Unicorn Project has been one of the most rewarding and fun things I’ve ever done, and I’ve never been as proud as I am today.
“And I can’t wait to celebrate with all of you tonight, because I heard that Kurt is buying drinks at the Dockside. But there is one thing I need to say,” she says, waiting for people’s cheering to quiet down. “As awesome and amazing as what we pulled off today was, we’re a long way from being done. We’re basically Blockbuster, who just figured out how to do paper coupon promotions. If you think that’s enough to save Parts Unlimited, you must be smoking something.
“Maggie’s right. We’re just at the beginning of our real fight. We haven’t blown up the Death Star yet. Not by a long shot. It’s still out there. What we did today was we finally figured out how to fly our X-wings. Our world is still in grave danger,” she continues. “But we finally have the tools, the culture, the technical excellence, and the leadership to win the fight. I can’t wait for the next chapter to prove that we’re not a Blockbusters or Borders, Toys“R”Us or Sears. We’re in it to win it, not to be another causality of the Retail Apocalypse!”
Having said what she wanted to say, Maxine looks up and sees the shocked looks on everyone’s faces. Oops, Maxine thinks, realizing that she maybe should have saved that speech for a private conversation at the Dockside. Then she hears Maggie say, “Holy cow, Maxine is so right! I’m totally going to use that line with Steve and Sarah. I can’t wait for Round 2!”
Everyone laughs and then people start applauding and cheering, Maggie loudest of them all. Although at the mention of Sarah, Maxine looks around, puzzled. She’s nowhere in sight. This is a very bad sign, Maxine thinks. Usually she’d be here to claim credit. Or pounce on someone if something went badly wrong. But Maxine is feeling too exhilarated to really care.
Kurt and Maxine are the first ones at the Dockside. They push a bunch of tables together and pre-order a bunch of pitchers of beer for the large group that will be assembling there soon. Kurt looks squarely at Maxine. “By the way, this is a great time for me to tell you how much I appreciate everything you’ve done. We couldn’t have done this without you … the Rebellion changed when you arrived on the scene.”
Hearing this, Maxine smiles. “You’re welcome, Kurt! We make a great team. And I’m so grateful that you sucked me into all of this.”
She sits down as people start to file in and takes a sip of her wine, thoroughly enjoying it. She discovered a couple of weeks ago that Erik had instructed the bartender to always serve her from a special stash of wine from a vineyard owned by a friend of his.
She looked into buying a bunch of bottles but balked when she found out how much it cost. Apparently, Erik drastically subsidizes the cost for her here. She bought one bottle for her and her husband to drink on a special occasion.
As if he knew she was thinking about the wine, Erik arrives, grabbing the seat next to her. “Congratulations to you both—you did terrific today. Now, you need to show Steve and Dick how the future requires creating a dynamic, learning organization where experimentation and learning are a part of everyone’s daily work. It’s funny, when Steve was VP of manufacturing, he was very proud that hundreds of plant worker suggestions were put into production to improve safety, to reduce toil, to increase quality, and to increase flow. That too is also a form of continual experimentation. You now need it at a much larger level, liberated from the tyranny of project management and functional silos.
“The Fifth Ideal is about a ruthless Customer Focus, where you are truly striving for what is best for them, instead of the more parochial goals that they don’t care about, whether it’s your internal plans of record or how your functional silos are measured,” he says. “Instead we ask whether our daily actions truly improve the lives of our customer, create value for them, and whether they’d pay for it. And if they don’t, maybe we shouldn’t be doing it at all.”
Erik gets up, and one of the bartenders arrives with a newly opened bottle of wine. Erik takes it and places it in front of Maxine with a wink. “Congratulations, Maxine. Catch y’all later tonight!”
He leaves just as six more of their teammates walk through the door. Maggie turns to Maxine and Kurt and asks, “What was that all about?”
“I’m trying to figure that out myself,” Maxine says. “But it’s nothing that can’t wait until next week. Maybe we can find a moment to talk later tonight … But in the meantime, let’s celebrate!”
The next morning when Maxine wakes up, her head is pounding. On top of the Dockside celebration, she and her husband had a couple more drinks while watching their favorite TV series late into the night. She didn’t actually remember falling asleep, such was her sudden exhaustion.
She wants to go back to sleep on this Saturday morning, but she scans her phone. There is some chatter in the chatrooms about an ongoing issue in the stores. Apparently, store managers are having problems because of overwhelming demand for promoted items. They were completely out of stock, and it was taking them fifteen minutes per customer to create rain check orders, having to key each one into another clunky in-house ordering system.
The in-store app teams were dispatched to the stores to figure out how to speed things up. Someone thinks they could write a simple tablet app to simplify the process. Maxine likes the idea, and she has full confidence that they’ll come up with a fix that will delight the in-store managers and staff.
She smiles, satisfied that this problem could be solved without her. Over the past month, she has grown to trust and respect her teammates and appreciate what they’re doing.
Maxine grins as she looks at the tickets to Comic-Con that Jake bought yesterday for her and the whole family.
She smells bacon and eggs. Jake must be making breakfast, she thinks. Maybe she can go back to sleep after eating. Life keeps getting better.