• Tuesday, September 23
On Tuesday of the following week, Maxine arrives at work to see Kurt beaming. “I got the job,” he says exuberantly.
“Really? The Dev job?” Maxine asks.
“Yes, the Dev job!” he says, as if he can’t quite believe it himself. “It couldn’t have happened without Kirsten’s support. I’m joining the Data Hub team, and you’re coming with me.”
“That’s awesome!” Maxine says, jubilant. “How’d you get Randy to approve my reassignment?”
“Well, he wasn’t happy about losing you. He kept going on about how you’re the best thing to happen to this place since sliced bread, but . . . well, I have my ways,” Kurt says with a sly smile.
Maxine gives him a high five.
He looks around and whispers, “All the managers are talking about something very strange happening. Apparently, the technology executives had an off-site with Steve earlier this week, and one of the things they agreed upon was a one-month feature freeze. Apparently, they’re actually hitting the brakes on feature delivery to pay down all the technical debt we’ve built up over the years!”
“Really?!” Maxine is shocked.
“They realize they need to fix all the crap that’s been built,” he says. “Ops is halting all work not related to the Phoenix Project so they can pay down technical debt and automate things. And Dev and QA will halt all feature work to pay down their technical debt too.
“This is our moment to shine. Here’s our chance to show people what engineering greatness looks like,” Kurt exclaims.
Later that day, an email goes out announcing Kurt’s new role. Maxine doesn’t want to hurt Kurt’s feelings, but she’s pretty sure that the real reason he got the job was that absolutely nobody else in Development wanted it. Data Hub is being widely touted as the “root cause” of the catastrophic crashes during and after the Phoenix release. Chris even called them out by name during one of the meetings Maxine was in, which she thought was quite unfair.
Blaming the Data Hub team for the smoking crater in the ground that was the Phoenix deployment was like blaming an airline crash on the passenger in the back of the plane who didn’t fasten his seatbelt tight enough.
She knows why blaming Data Hub is so easy. It’s one of the least glamorous technology areas in the company. Data Hub is part of a big, boring message bus system, which Maxine already loves because it’s how most of the major applications and systems of record talk to each other: the product database, pricing database, inventory management systems, order fulfillment systems, sales commissioning systems, company financials, and almost a hundred other major systems, many of them decades old.
Maxine has never liked that there are actually three inventory management systems—two for the physical stores (one was inherited by an acquisition and never retired) and another for the e-commerce channel. And there’s at least six order entry systems—three supporting the physical stores, one for e-commerce, another for OEM customers, and another for service station channel sales.
Maxine loves byzantine processes like pediatricians love sick children, but even Maxine is taken aback by just how many systems Data Hub has to talk to.
The more Maxine studies up on what Data Hub does, the more perplexed she becomes. Data Hub just didn’t seem like it should be part of Phoenix at all. After all, the majority of Data Hub was written over twenty years ago, which was long before Phoenix was even a concept.
Apparently, Data Hub used to be a collection of smaller applications that were scattered across the company. Some resided in finance with the ERP systems, some inside the manufacturing business units, and others within the Development group under Chris.
As the Phoenix juggernaut started rolling, an incredible number of new demands were put on those teams, and those teams just weren’t staffed to deal with it. Tons of new Phoenix functionality was blocked because of competing business priorities in Data Hub, and soon Phoenix features were being delayed, month after month.
Finally, as part of a re-org, all those components were rolled up into a new group called Data Hub and put under the Phoenix Project, making sure Phoenix priorities always came first. And now, everyone is blaming Data Hub for what went wrong.
On Wednesday morning, Maxine and Cranky Dave join Kurt for the first meeting with the Data Hub engineers. Maxine’s surprised to see that Cranky Dave was able to join Data Hub so quickly too. She asked him how he managed to swing that.
Cranky Dave merely smiled, saying, “One of the many benefits of my winning personality—no manager passes on the opportunity to give me to a different team. It allows me to go wherever I want.”
She stands next to Cranky Dave as the other five Data Hub engineers assemble in the central meeting area.
They’re all either her age or fresh out of college, no one in between. She suspects the senior developers have been on the team since the beginning, and the younger engineers will quickly leave for more interesting work and be replaced with other new college grads.
Chris clears his throat and addresses the room. “Good morning, everyone. Please welcome Kurt Reznick, who will be taking over for Peter.”
Kurt seems surprised by the short introduction but says cheerily, “Hello, everyone. As you may know, this is the first Dev team I’ve managed. I believe my job my job is very simple: listen, do whatever you need me to do to help make you successful, and remove any obstacles in your way.” It’s clear from everyone’s unimpressed looks that they’re well aware of Kurt’s lack of experience.
Kurt continues, “I’ve talked with our numerous internal customers, and they told me how important Data Hub is. But they also told me about how we’re often the bottleneck for changes needed across the enterprise, as well as for the Phoenix Project. And as we all know, when our service goes down, so does Phoenix. I’ve scheduled a session later this week for us to brainstorm how we can make our service more reliable and resilient.”
“Blaming Data Hub and Peter for Phoenix going down is bullshit,” says one of the senior developers.
“I totally agree with you, Tom,” says Kurt. “And rest assured that I’ll be working to correct that perception.”
Kurt continues, “I really appreciate that Peter was willing to meet with me before I started. He told me that he’s been asking for additional headcount for senior developers for years because the business needs have kept growing, especially around the Phoenix integration. He recommended that I keep trying.”
Kurt gestures at Chris. “And I promise you that I’ll keep lobbying Chris for more headcount.”
“And I’ll keep lobbying Steve,” Chris replies with a tight-lipped smile. Kurt laughs. “So, in the meantime, I’ve brought with me two senior developers who have volunteered to join the team. Maxine is the most senior developer from the MRP team, and Dave is a senior developer from the Phoenix back-end server team. They’re two of the developers I trust the most.”
The Data Hub developers look at them, surprised but genuinely happy that she and Cranky Dave are here.
“There will be a directive coming out soon from Chris about a feature freeze, so we can work on fixing defects that impact our customers and fix problematic areas of our code,” Kurt says. “But don’t wait for the announcement. The top priority is to fix things you think should be fixed and, for that matter, anything that you think will help you be more productive or make Data Hub more stable. I’ll handle any complaints that come our way.”
Maxine smiles at the expressions of grudging approval from the Data Hub engineers.
As the new engineers, Cranky Dave and Maxine integrate themselves into the daily rituals of the Data Hub team. They attend the stand-ups and are quick to volunteer to help with things.
Maxine pairs up with Tom, the older developer who had commented on the unfairness of being scapegoated for the Phoenix failure. Tom is in his late forties and wears glasses, jeans, and a T-shirt. She sits at his desk with her laptop open as he explains what he’s currently working on.
As he talks, Maxine sees that Data Hub is a mishmash of technologies built up over the decades, including a big chunk that runs on Java servlets, some Python scripts, and something that she thinks is Delphi. There’s even a PHP web server.
She doesn’t judge or dismiss any of the technology stacks—after all, it’s been successfully serving the enterprise for decades. It may not be the most elegant piece of software she’s seen, but things that have been in production for twenty years rarely are. Software is like a city, constantly undergoing change, needing renovations and repair. She will, however, acknowledge that Data Hub is not the hippest neighborhood. It’s undoubtedly difficult to recruit new college grads who want to learn and use the hottest, most in-demand languages and frameworks.
But at least Data Hub is in much better shape than the Phoenix build systems, which were like uninhabitable, radioactive Superfund sites, or the shelled-out remains of a war zone.
Maxine is sitting at Tom’s desk as he explains what he’s working on, “I’m working on an urgent defect. Data Hub is occasionally generating incorrect message transactions and is crashing under high loads. It sometimes happens when in-store employees mark customer repair work as complete in the service station application,” he says. Looking embarrassed, he continues, “I’ve spent days working on this. I’ve finally created a semi-reproducible test case—it happens about one out of ten times. I’m pretty sure it’s because of a race condition.”
Talk about being thrown in the deep end, Maxine thinks. But she relishes the challenge and is sure that when they solve the problem, it will make a very positive impression on the entire team. After all, race conditions are one of the toughest categories of problems in all of distributed systems and software engineering. If working with the middle-school girls was a yellow belt challenge in karate, what Tom is describing can drive even the most experienced tenth-level black belts to despair and madness.
Maxine is impressed that Tom can even reproduce the problem at all. Someone once called these problems “heisenbugs,” referring to the quantum physics phenomena where the act of observation changes the nature of reality itself.
This type of work is very different than how coding is portrayed in the movies: a young, male programmer is typing away furiously, wearing a hoodie, of course, but curiously, also wearing sunglasses (which she has never actually seen a developer do in real life). He has many different windows open, text quickly scrolling by in all of them. Behind him, a crowd of people is watching over his shoulder, waiting anxiously. After a couple of seconds, the coder cries out, “I got it!” and everyone cheers. The solution is created, the feature is delivered, or the world is saved. And the scene ends.
But in reality, when developers work, they’re usually staring at the screen, deep in concentration, trying to understand what the code does so they can safely and surgically change it without breaking something else as an unintended side-effect, especially if they’re working on something mission-critical.
Tom walks her through the problem. “When there are multiple repair transactions being processed concurrently, sometimes one of the transactions gets the wrong customer ID, and sometimes Data Hub completely crashes,” he says. “I’ve tried putting a lock around the customer object, but it slowed down the entire application so much, it’s just not an option. We have enough performance problems as it is.”
It is almost impossible to predict how a program will behave if any other part of the program can change data that you’re depending on at any time, Maxine thinks. But she’s pretty sure she knows how to fix this problem.
“Can we walk through the code path again?” Maxine asks. As they do, Maxine goes through a mental checklist to confirm her hypothesis. There’s a thread pool that handles incoming messages. Check. Service records can be handled by multiple concurrent threads. Check. The threads pass around objects, which are being mutated when its methods are called. Check.
Hypothesis confirmed. The problem is almost definitely state mutation going wrong, she thinks. Just like at the middle school.
“You’re right, it’s definitely a race condition,” Maxine says. “And I’m pretty sure we can solve this problem without putting a lock around the entire customer object. Can I show you what I’m thinking?”
When he nods, just as Maxine did with the middle-school girls, she proposes rewriting the code path using functional programming principles. Tom’s test case has a lot of mocks and stubs to simulate the production environment: a configuration server, a database, a message bus, a customer object factory . . . .
She jettisons all of them, because those are not areas of the system she wants to test. Instead, she pushes all that input/output and side-effects to the edges and creates unit tests around how an incoming repair order message is processed, how customer data is transformed, and what outgoing messages are sent.
She has each thread make its own copy of the customer object. They rewrite each object method into a series of pure functions—a function whose output is completely dependent upon its inputs, with no side-effects, mutations, or accesses to global state.
When Maxine shows Tom a unit test that reproduces the problem 100% of the time, as well as the completely thread-safe fix that now works 100% of the time, Tom stares at her, eyes wide with wonder. “That’s … that’s … incredible.”
She knows why he’s impressed. Her code is so simple that it’s easy to understand and test for correctness. Eventually, marveling at the screen, he says, “I just can’t believe how much you simplified it. How can that achieve the same thing as the complex mess that we had before?” For the rest of the afternoon, he asks questions, obviously trying to prove to himself that Maxine’s test case captured the problem and that the rewrite is correct. At last he says, “I can’t believe it, but I think you’re right. This will definitely work!”
Maxine grins at Tom’s reaction. Another testament that functional programming principles are better tools to think with. And they’ve already made the code way better than when they found it—it’s definitely safer, easier to test, and way easier to understand. This is so much fun, she thinks. And a great example of the First Ideal of Locality and Simplicity.
“Okay, let’s get this fix merged in!” he says, opening up a terminal window, typing in some commands. He turns to Maxine. “Congratulations! You just fixed your first defect and checked in your first change!”
Maxine gives him a huge high-five, a big grin on her face. Vanquishing a race condition error on her first day is freaking awesome. “That’s great! So, let’s get this thing tested and pushed into production.” Maxine is excited at the thought of a grateful store manager thanking them.
“Errrr … Uhhhh …” Tom says, pausing. “Testing doesn’t start until Monday.”
Maxine feels her heart drop. “We can’t test it ourselves?”
“We used to be able to, before we got re-orged into Phoenix,” he says, wistfully. “The QA group took over the testing. And when they had some problems with different teams using the test environment at the same time, they took everyone’s access away. Now they’re the only ones who can log in, let alone run the tests.”
“Wait,” she says. “We write the tests but can’t run them?”
He laughs. “No, no. They write the tests. They don’t even let us see the test plans anymore.”
Maxine deflates even more, knowing where this is going. “And we can’t push it into production?”
Tom laughs again. “Nope, not any more. We used to be able to do that too. But now someone else deploys it for us. ‘Stay in your lane,’ they told us.” He shrugs his shoulders. Maxine is pretty sure she knows who said, “Stay in your lane.” That’d be Chris.
The joy that Maxine felt all day while working on the problem disappears. After all, fixing the code especially for features is just a fraction of the entire job. It’s not done until that customer can use what they’ve written. And even then, it’s probably still a work in progress, because we can always learn more about how to help that customer best achieve their goals.
“Crap,” she mutters. I’m back in the same place I was before, a long way away from the First Ideal. I still can’t actually do anything myself, Maxine thinks. Once again, she is dependent on others to create customer value.
Oblivious, Tom laughs and opens up a new window. “It’s not so bad. We just need to go into the ticketing system and mark this issue as ‘done.’ That lets the QA team know to test it, so it can be promoted into production.”
Tom looks at his watch and turns back to her, “That was great. We got a lot done today. Want to pick another defect to work on?” Maxine forces a smile and nods. This sucks, Maxine thinks. She likes finishing things, not just starting things.
Maxine continues to work with Tom all day, picking the next most urgent defect to be fixed. Tom once again compliments Maxine on how she thinks about problems. He’s impressed at how she writes unit tests that can be run without the need for a complex, integration test environment.
But there are limits—Data Hub’s job is to connect systems with each other. There’s only so much you can simulate on a single laptop. It would be nice to rearchitect Data Hub so you could, Maxine thinks wistfully.
Although she enjoys learning about Data Hub and the parts of the business it connects, there’s something about all this work that is deeply unsatisfying to her.
She thinks of Erik’s Second Ideal of Focus, Flow, and Joy. All the joy she felt vaporized when Tom told her that they had completed only a small portion of the work needed to create value. That’s just not good enough for her. In her MRP team, any developer could test their own code and even push code into production themselves. They didn’t have to wait weeks for other people to do that work for them. Being able to test and push code to production is more productive, makes for happier customers, creates accountability of code quality to the people who write it, and also makes the work more joyful and rewarding.
Maxine starts thinking about how to introduce some of the tools being built by the Rebellion. At the very minimum, we need to make standardized Dev environments available, so I can do builds on my laptop, she thinks. More things to talk about at the next Dockside meeting.
She continues to grind away, helping Tom with the work that he’s been assigned. Together they fix two defects and then tackle a crash-priority feature, this time to create some business rules around extended warranty plans, critical enough to be exempted from the feature freeze.
“Why is this so high priority?” Maxine asks Tom as she reads the ticket.
“This is hugely revenue-generating,” Tom explains. “One of the highest margin products are these new extended-warrantee plans. Customers loved the pilot warrantee program, especially for things like tires. Now in-store staff need a way to pull up this information, so they can do the repair work and file the claim with the third-party insurer.”
Tom continues, “Great for the customer, great for us, and a third-party insurer is taking all the financial risk.”
“Cool,” Maxine says, perking up. It’s features like this that support everything that Steve said in the Town Hall. It’s been a long time since Maxine’s done work on the revenue-generating side of the business.
Recommitting herself to feel relentlessly optimistic, Maxine and Tom start studying the feature, trying to figure out what is required to enable this important business capability. She tries not to think about how, even if they get it done today, it’ll just sit, waiting for the QA team to test it.
The next morning, Tom and Maxine are at a whiteboard, inventorying all the systems that they’ll need to change in order to enable extended warrantees. Two more engineers have joined them as the scope keeps increasing. And then they realize that they’ll need to talk with engineers from two other teams, as well. Maxine guesses that they’ll have to bring in six other teams because of how many business systems this affects.
Maxine is dismayed as the number of teams that need to be involved keeps growing. This is again the opposite of the First Ideal of Locality and Simplicity. Here, the changes that need to be made are not localized. Instead, their scattered across many, many different teams. This is not the famous Amazon ideal of the “two-pizza team,” where features can be created by individual teams that can be fed with two pizzas.
We’ll need a whole truckload of pizzas to ship this feature, Maxine thinks, watching as Tom draws another set of boxes on the whiteboard.
Kurt pokes his head into the conference room. “Hey, sorry for the interruption. Someone from Ops and the manager of the channel training management application are on a conference bridge. All their customer logins are failing. They say the connector has stopped working?”
“Not again,” says Tom. “Authentication has been flaky ever since the Phoenix deployment. We’re on it …”
“Roger that,” Kurt says, tapping something on his phone. “I just created a chat channel for all of us, okay?”
Maxine follows Tom back to his desk. As Tom opens up another browser window and types something, a login error appears on his screen.
“Okay, something’s definitely not working right. Let’s see if we can isolate why …” Tom mutters. “I doubt it’s actually a Data Hub connector. More likely it’s the enterprise customer authentication service or a problem in the network.”
Maxine nods, taking notes as more of the Data Hub universe comes into view. Skeptical, she offers, “Can’t we rule out network and authentication right away? If either of those were down, we wouldn’t even be able to get to the website, and authentication being down would take out every service …”
“Good point …” Tom says. “Could definitely still be networking, though … we’ve had a bunch of issues lately. Last week, the networking people accidentally blocked some internal IP addresses that caused us problems.”
“Networking. It’s always the networking people, right?” she says, smiling. “But if it’s always the networking people, why are they calling us?” Maxine asks.
“Yeah, well, all the users know is that they can’t connect to Data Hub,” he says. “We always explain that it’s not us; it’s something we need to connect to. But they don’t care.”
When Maxine sees Tom pull up the Ops ticketing system and create a new ticket, she asks, “What’s this for?”
“We need the production logs for Data Hub and its connectors to see if they’re handling traffic or if they’ve crashed,” he responds, filling out the numerous fields.
“We can’t directly access production logs?” Maxine asks, afraid of the answer.
“Nope. Ops people won’t let us,” he says, typing into the form.
“So, someone has to respond to the ticket and copy the logs off the server for us?” she asks in disbelief.
“Yes,” he says, continuing to type, obviously very practiced at filling it out. He tabs between fields, types, mouses over to hit the drop-down box, hits the submit button, only to find that there’s still another required field that needs to be filled in.
Maxine groans. The Data Hub application that they’re working on might as well be running in outer space or at the bottom of a deep well. They can’t directly access it, they can’t see what it’s doing, and the only way they can understand what’s actually happening is to talk to someone in Operations through the ticketing system.
She wonders whether the ticket will get routed to her friend Derek at the helpdesk.
Tom finally succeeds in submitting the ticket. Satisfied, he says, “Now we wait.”
“How long does it usually take?” Maxine asks.
“For a Sev 2 incident? Not too bad—we’ll probably get it within a half hour. If it’s not related to an outage, it could take days,” Tom says. He looks at the clock. “What should we do while we wait?”
Even in the Data Hub team, she can’t escape the Waiting Place.
Four hours later, after reviewing the production logs, they confirm that the problem isn’t Data Hub. Two hours after that, everyone finally agrees. As Tom had suspected, it was an internal networking change that caused the problem.
Another round of intense finger-pointing ensues between Business Operations, Marketing, and within the technology organization. Eventually, Sarah gets involved and demands that there be severe consequences.
“Uh, oh,” says Tom, watching with Maxine from the far end of the table. “This can’t be good.”
Wes Davis (Director, Distributed Operations)
All IT Employees
7:50 p.m., September 25
Effective immediately, Chad Stone in network engineering is no longer with the company. Please direct all emails to his manager, Irene Cooper, or me.
For the love of all that is holy, please stop making mistakes so that I don’t have to write these stupid emails. (And if they fire me, direct your emails to Bill Palmer, VP, IT Operations.)
Finally, the day is over, which means another meeting at the Dockside Bar. They’ve invited the entire Data Hub team to join them. Maxine approves of being over-inclusive rather than accidentally leaving some worthy people out. Tom and three other engineers show up. Maxine is glad they’re here. After the last couple days, she’s eager to brainstorm ways to dramatically improve developer productivity on the Data Hub team.
Seeing everyone having fun, Maxine observes that this is a group of people who love hanging out with each other. Kurt stands up and addresses the group.
“Hello, new Rebellion teammates! Let me introduce everyone,” Kurt says. He introduces all the Rebellion members, as he did for Maxine and Kirsten. “And if you don’t mind, now that you’ve heard about some of the subversive things we’re working on to bring joy back to Parts Unlimited engineers, how about you tell us something that could make your lives a little easier?”
Tom’s two colleagues go first, introducing themselves and sharing their backgrounds. One has been on the Data Hub team, like Tom, for nearly a decade, but he doesn’t come up with anything to complain about, saying, “Life is okay, and I appreciate the invitation for drinks.”
When he clearly doesn’t have anything more to say, Tom starts. “Like my colleague, I’ve been on the Data Hub team for a long time. Back when it used to be called Octopus. We called it that because of how it connected to eight applications. Now it connects to over a hundred.
“I’ve been having a blast pair-programming with Maxine, and I still can’t believe we fixed a race condition bug! I’m delighted at her idea to get Data Hub Dev environments that we can all use,” he continues. “I’m not proud of this, but there have been times when we’ve hired new developers and six months later, they still can’t do a full build on their machines,” he says, shaking his head. “It wasn’t always like this. When I started, it was simpler. But over the years, we’ve hard-coded some things that we shouldn’t have, updated some things here, updated other things there, never quite documenting all of it … and now? It’s a mess.”
Looking up, he smirks at his teammates around the table, saying, “You know the developer joke of ‘it worked on my laptop’? Well, in Data Hub, we can’t even get it running on most people’s laptops.”
Everyone laughs. At one point or another, every developer on the planet has had this problem. It usually happens at the worst possible time, like when something crashes in production but mysteriously works perfectly on the developer’s laptop. Maxine remembers countless times when she’s had to painstakingly figure out what exactly was different between the developer’s laptop and the production environment.
“My pain points?” Tom muses. “It’s our environments. We used to have a good handle on this, but then we got moved into the Phoenix Project and they made us use environments from their centralized environments team.
“It’s crazy. We’re puny compared to the rest of Phoenix. To run Data Hub now, we have to install gigabytes of completely irrelevant dependencies,” he continues. “It takes forever to figure out how to get everything to run, and it’s so easy to break something by accident. No joke: I back up my work laptop every day because I’m so afraid that my builds will stop working and I’ll have to spend weeks figuring out how to fix them.”
Tom laughs, “Ten years ago, I lost my emacs configuration file and couldn’t find a recent backup. I just didn’t have it in me to recreate it. I finally gave up and switched editors.”
Everyone laughs, adding their own stories of loss, anguish, and grief of having to give up their most treasured tools.
Tom turns to Maxine. “I’d love to spend a couple of days exploring how we can make a Dev environment that all of us could use in our daily work. If we had a virtual machine image or a Docker image, any new team member could do a build on any machine, any time. That would be incredible.”
“You and I are definitely going to get along,” Maxine says, smiling. “We need developers to be able to focus their best energies on building features, not trying to get builds to work. I have a ton of passion for this too and would love your help.”
“That’s terrific,” Kurt says. “We all know how important environments are. For now, feel free to spend half your time on this—I’ll hide it in the timecarding system.”
Later in the evening, Kirsten shows up and pours herself a glass of beer from the pitcher on the table. Smiling, she says, “What did I miss?”
“Just plotting the inevitable toppling of the existing order, of course,” Kurt says. The new Data Hub team members openly stare at Kirsten as she takes a seat.
Kurt asks, “Kirsten, how’s Project Inversion going? The feature freeze? I heard that Bill Palmer convinced Steve to put all feature work on hold so everyone can pay down technical debt.”
“Confirmed,” she says. “Sarah Moulton is going ballistic, complaining how ‘all the idle developers’ are jeopardizing the promises the company has already made to customers and Wall Street. I still can’t believe she doesn’t get how this helps her. But Project Inversion is definitely happening: for thirty days, Ops is not doing anything except things to support Phoenix.”
“They’re not kidding around,” Brent says. “Bill has been awesome. He’s told me in no uncertain terms that I’m to work only on Phoenix-related things. He’s taken me off of pager rotation for basically everything. He’s even taken me off of every mailing list, had me turn off notifications from every chat room, and told me not to answer the phone for anyone. And best of all, he said to absolutely not show up for any outage calls. If I do, he’ll fire me.”
Hearing this, Maxine is shocked. Bill would fire Brent? Thinking of all the people who’ve been fired lately, Maxine can’t figure out why Brent is smiling.
“It’s so fantastic,” Brent says, even appearing to be … tearing up? “Bill told me that he can’t fire the business unit executives or tell them what to do. He said that the only thing he can do is ensure that I’m not wasting time on those things. He said to tell anyone trying to reach me that I’ll be fired if I call them back.”
Brent laughs, obviously elated, finishing his beer and pouring himself another. “He’s assigned Wes to screen all my emails and phone calls and to yell at anyone trying to get ahold of me. Life is fantastic! Seriously, never better.”
Maxine smiles. She has seen how engineers can become the constraint many times in her career. It can be fun to be at the center of everything, but it’s certainly not sustainable. Down that road, only chronic wakeup calls, exhaustion, cynicism, and burnout await.
Kirsten smiles. “It’s working. Brent’s name shows up on more critical action items than anyone, and Bill has told everyone that their goal must be to protect his time.
“On the Development side, Chris promises that for thirty days, for all teams working on anything related to Project Phoenix, no new features,” Kirsten says, reading from her phone. “‘All teams need to be fixing high-priority defects, stabilizing the codebase, and doing whatever rearchitecting is needed to prevent another release disaster.’”
Maxine hears lots of excited murmurs from around the table. Maxine knows something like this is needed—and that this could be a fantastic opportunity for the Rebellion.
“There’s still a lot of disagreement among Chris’s direct reports on how to roll this out,” Kirsten continues. “They’ve spent so much time legislating what should and shouldn’t be worked on that we’ve already lost a week—lots of teams are still working on their features, business as usual. We’re going to need a lot more clarity from leadership on this—at this rate, the entire month will be gone, and we’ll have the same amount of technical debt as before, if not more.”
“I’m surprised no one is talking about all the problems they’re having with environments or automated testing or the lack of production telemetry,” Kurt says. “We’ve built some amazing capabilities that other people can use too. But we can’t be the people with a solution, peddling them to people who don’t know they have a problem.”
Kurt looks stumped. And frustrated.
“I totally want to help with this,” Shannon says, raising her hand. “I’ve worked with a bunch of the Phoenix teams. I could swing by each one tomorrow to start asking them what their constraints are and any ideas they have on how to fix them.”
“Good, good,” Kurt says, writing down some notes in his notebook.
“I’d love to help too, Shannon,” Maxine says. “But Tom and I will be a little tied up on Monday, because Monday is Testing Day. I’m going to finally get my changes tested with the QA folks. Outside of that, I’m yours!” A full tray of beer pitchers and two more glasses of wine appear.
They are soon in deep conversation about technical debt and ideas on how to take advantage of Project Inversion. Maxine turns to see Erik grabbing the seat next to her.
He joins the conversation as if he’s been there all along. “With Project Inversion, you are all on the beginning of a great journey. Every tech giant has nearly been killed by technical debt. You name it: Facebook, Amazon, Netflix, Google, Microsoft, eBay, LinkedIn, Twitter, and so many more. Like the Phoenix Project, they became so encumbered by technical debt they could no longer deliver what their customers demanded,” Erik says. “The consequences would have been fatal—and for every survivor, there are companies like Nokia who fell from the loftiest heights, killed by technical debt.
“Technical debt is a fact of life, like deadlines. Business people understand deadlines, but often are completely oblivious that technical debt even exists. Technical debt is inherently neither good nor bad—it happens because in our daily work, we are always making trade-off decisions,” he says. “To make the date, sometimes we take shortcuts, or skip writing our automated tests, or hard-code something for a very specific case, knowing that it won’t work in the long-term. Sometimes we tolerate daily workarounds, like manually creating an environment or manually performing a deployment. We make a grave mistake when we don’t realize how much this impacts our future productivity.”
Erik looks around the table, pleased that everyone is listening intently to his every word.
“All the tech giants, at some point in their history, have used the feature freeze to massively rearchitect their systems. Consider Microsoft in the early 2000s—that was when computer worms were routinely taking down the internet, most famously CodeRed, Nimda, and of course SQL Slammer, which infected and crashed nearly 100,000 servers around the world in less than ten minutes. CEO Bill Gates was so concerned that he wrote a famous internal memo to every employee, stating that if a developer has to choose between implementing a feature or improving security, they must choose security, because nothing less than the survival of the company was at stake. And thus began the famous security stand-down that affected every product at Microsoft. Interestingly, Satya Nadella, CEO of Microsoft, still has a culture that if a developer ever has a choice between working on a feature or developer productivity, they should always choose developer productivity.
“Back to 2002—that same year, Amazon CEO Jeff Bezos wrote his famous memo to all technologists, stating that they must rearchitect their systems so that all data and functionality are provided through services. Their initial focus was their OBIDOS system, originally written in 1996, which held almost all the business logic, display logic, and functionality that made Amazon.com so famous.
“But over time, it became too complected for teams to be able to work independently. Amazon likely spent over $1 billion over six years rearchitecting all their internal services to be decoupled from each other. The result was astonishing. By 2013 they were performing nearly 136,000 deployments per day. Interesting that these CEOs I mention all have a software background, isn’t it?
“Contrast that with the tragic story of Nokia. When their market was disrupted by Apple and Android, they spent hundreds of millions of dollars hiring developers and investing in rolling out Agile. But they did so without realizing their real problem: technical debt in the form of an architecture where developers could not be productive. They lacked the conviction to rebuild the foundations of their software systems. Just like at Amazon in 2002, every software team at Nokia was unable to build what they needed to because they were hamstrung by the Symbian platform.
“In 2010, Risto Siilasmaa was a board director at Nokia. When he learned that generating a Symbian build took a whole forty-eight hours, he said that it felt like someone hit him in the head with a sledgehammer,” Erik says. “He knew that if it took two days for anyone to determine whether a change worked or would have to be redone, there was a fundamental and fatal flaw in their architecture that doomed their near-term profitability and long-term viability. They could have had twenty times more developers, and it wouldn’t have made them go any faster.
Erik pauses. “It’s incredible. Sensei Siilasmaa knew that all the hopes and promises made by the engineering organization was a mirage. Even though there were numerous internal efforts to migrate off of Symbian, it was always shot down by the top executives until it was too late.
“Business people can see features or apps, so getting funding for those is easy,” he continues. “But they don’t see the vast architectures underneath that support them, connecting systems, teams, and data to each other. And underneath that is something extraordinarily important: the systems that developers use in their daily work to be productive.
“It’s funny: the tech giants assign their very best engineers to that bottom layer, so that every developer can benefit. But at Parts Unlimited, the very best engineers work on features at that top layer, with no one besides interns on the bottom working on Dev productivity.
Erik continues, “So your mission is clear. Everyone has been told to pay down technical debt, which will help you realize the First Ideal of Locality and Simplicity and the Second Ideal of Focus, Flow, and Joy. But almost certainly, you will have to master the Third Ideal of Improvement of Daily Work.” Then he gets up and leaves as quickly as he joined them.
Everyone watches him leave. Then Kirsten says, “Is he coming back?”
Cranky Dave throws his hands in the air. “What happened at Nokia is happening here. Two years ago, we could implement a significant feature in two to four weeks. And we delivered a ton of great stuff. I remember those days! If you had a great idea, we could get it done.
“But now? That same class of feature takes twenty to forty weeks. Ten times longer! No wonder everyone’s so pissed off at us,” Cranky Dave yells. “We’ve hired more engineers, but it feels like we’re getting less and less done. And not only are we slower, those changes are incredibly dangerous to make.”
“This makes sense,” Kirsten says. “By almost any measure, productivity is flat or down. Feature due date performance is way down. I did some research since our last meeting—I asked my project managers to sample a couple of features and find out how many teams were required to implement them. The average number of teams required was 4.2, which is shocking. Then they told me that many had to interact with over eight teams,” she says. “We’ve never formally tracked this, but most of my people say that these numbers are definitely higher than they were two years ago.”
Maxine’s jaw drops. Absolutely no one can get anything done if they have to work with eight other teams all the time, she realizes. Just like the extended warrantee feature she started working on with Tom.
“Well, Project Inversion is our shot to fix some of these things and to engineer our way out of this,” Kurt says. “Shannon will find out what the Phoenix teams need help on. How about us? If someone gave us the authority, and we were given infinite resources for one month, what would we do?”
Maxine smiles as she hears the suggestions fly fast and furious. They start making a list: Every developer uses a common build environment. Every developer is supported by a continuous build and integration system. Everyone can run their code in production-like environments. Automated test suites are built to replace manual testing, liberating QA people to do higher value work. Architecture is decoupled to liberate feature teams, so developers can deliver value independently. All the data that teams need is put in easily consumed APIs …
Shannon looks over the list they’ve generated, smiling. “I’ll post the updated list when I’m done interviewing the teams tomorrow. This is exciting,” she says. “This is what the developers want, even if they can’t articulate it. And that’s something I can help them with!”
It’s a great list, Maxine thinks. Everyone’s enthusiasm is evident.
“That is indeed a great list, Shannon, which could dramatically change the dynamics of how engineers work,” Erik says, sitting down next to Kirsten once again. Maxine looks around, wondering where he came from. Gesturing at Kirsten, he continues, “But consider the forces arrayed against you. The entire Project Management Office aims to keep projects on-time and on-budget, following the rules and enforcing the promises written long ago. Look at how Chris’ direct reports act—despite Project Inversion, they keep working on the features because they’re afraid of slipping their dates.
“Why? A century ago, when mass production revolutionized industry, the role of the leader was to design and decompose the work and to verify that it was performed correctly by armies of interchangeable workers, who were paid to use their hands, not their heads. Work was atomized, standardized, and optimized. And workers had little ability to improve the system they worked within.
“Which is strange, isn’t it?” Erik muses. “Innovation and learning occur at the edges, not the core. Problems must be solved on the front-lines, where daily work is performed by the world’s foremost experts who confront those problems most often.
“And that’s why the Third Ideal is Improvement of Daily Work. It is the dynamic that allows us to change and improve how we work, informed by learning. As Sensei Dr. Steven Spear said, ‘It is ignorance that is the mother of all problems, and the only thing that can overcome it is learning.’
“The most studied example of a learning organization is Toyota,” he continues. “The famous Andon cord is just one of their many tools that enable learning. When anyone encounters a problem, everyone is expected to ask for help at any time, even if it means stopping the entire assembly line. And they are thanked for doing so, because it is an opportunity to improve daily work.
“And thus problems are quickly seen, swarmed, and solved, and then those learnings are spread far and wide, so all may benefit,” he says. “This is what enables innovation, excellence, and outlearning the competition.
“The opposite of the Third Ideal is someone who values process compliance and TWWADI,” he says with a big smile. “You know, ‘The Way We’ve Always Done It.’ It’s the huge library of rules and regulations, processes and procedures, approvals and stage gates, with new rules being added all the time to prevent the latest disaster from happening again.
“You may recognize them as rigid project plans, inflexible procurement processes, powerful architecture review boards, infrequent release schedules, lengthy approval processes, strict separation of duties …
“Each adds to the coordination cost for everything we do, and drives up our cost of delay. And because the distance from where decisions are made and where work is performed keeps growing, the quality of our outcomes diminish. As Sensei W. Edwards Deming once observed, ‘a bad system will beat a good person every time.’
“You may have to change old rules that no longer apply, change how you organize your people and architect your systems,” he continues. “For the leader, it no longer means directing and controlling, but guiding, enabling, and removing obstacles. General Stanley McChrystal massively decentralized decision-making authority in the Joint Special Operations Task Force to finally defeat Al Qaeda in Iraq, their much smaller but nimbler adversary. There the cost of delay was not measured in money, but in human lives and the safety of the citizens they were tasked to protect.
“That’s not servant leadership, it’s transformational leadership,” Erik says. “It requires understanding the vision of the organization, the intellectual stimulation to question the basic assumptions of how work is performed, inspirational communication, personal recognition, and supportive leadership.
“Some think it’s about leaders being nice,” Erik guffaws. “Nonsense. It’s about excellence, the ruthless pursuit of perfection, the urgency to achieve the mission, a constant dissatisfaction with the status quo, and a zeal for helping those the organization serves.
“Which brings us to the Fourth Ideal of Psychological Safety. No one will take risks, experiment, or innovate in a culture of fear, where people are afraid to tell the boss bad news,” Erik says, laughing. “In those organizations, novelty is discouraged, and when problems occur, they ask ‘Who caused the problem?’ They name, blame, and shame that person. They create new rules, more approvals, more training, and, if necessary, rid themselves of the ‘bad apple,’ fooling themselves that they’ve solved the problem,” he says.
“The Fourth Ideal asserts that we need psychological safety, where it is safe for anyone to talk about problems. Researchers at Google spent years on Project Oxygen and found that psychological safety was one of the most important factors of great teams: where there was confidence that the team would not embarrass, reject, or punish someone for speaking up.
“When something goes wrong, we ask ‘what caused the problem,’ not ‘who.’ We commit to doing what it takes to make tomorrow better than today. As Sensei John Allspaw says, every incident is a learning opportunity, an unplanned investment that was made without our consent.
“Picture this scenario: You are in an organization where everyone is making decisions, solving important problems every day, and teaching others what they’ve learned,” Erik says. “Your adversary is an organization where only the top leaders make decisions. Who will win? Your victory is inevitable.
“It’s so easy for leaders to talk about the platitudes of creating psychological safety, empowering and giving a voice to the front-line worker,” he says. “But repeating platitudes isn’t enough. The leader must constantly model and coach and positively reinforce these desired behaviors every day. Psychological safety slips away so easily, like when the leader micromanages, can’t say ‘I don’t know,’ or acts like a know-it-all, pompous jackass. And it’s not just leaders, it’s also how one’s peers behave.”
A bartender walks up to Erik and whispers something in his ear. Erik mutters, “Again?” He looks up and says, “I’ll be right back. Something requires my attention,” and walks away with the bartender.
They stare at Erik walking away. Dwayne eventually says, “He’s so right about the Third and Fourth Ideal. What can we do about the culture of fear that’s all around us? Look at what happened to Chad. He tried to do the right thing and got fired. I probably have more reasons to dislike Chad than any of you—those rolling network outages during the day drove me crazy. But firing Chad doesn’t do a damned thing to make those outages less likely in the future.
“I did some asking around to find out what actually happened,” Dwayne continues. “Apparently, Chad had worked four nights in a row, in addition to working his normal daytime hours, to support the store modernization initiative. When I asked why, he told me he didn’t want the store teams to get dinged on their status reports because of him.”
Kirsten raises an eyebrow. Dwayne continues, “His manager kept badgering him to go home, he finally went home on time on Wednesday. But he was back online at midnight because he didn’t want to let the store launch team down. He was so worried about all the work piling up, in tickets and in the chat rooms, he wasn’t sleeping through the night anymore.
“So he comes into work early on Thursday morning, still tired from all those late nights, and he takes on an urgent internal networking change that needed to be made,” he says. “He opens his laptop, and there’s like thirty terminal windows open from all the things he’s working on. He types a command into the terminal window and hits enter. And it turns out, he typed it into the wrong window.
“Blam! Most of the Tier 2 business systems become inaccessible, including Data Hub,” he says. “The next day, he’s fired. Does that seem right to you? Does that seem fair and just?”
“Oh, my God,” Maxine blurts out, horrified. She knows exactly how this feel. She’s done it several times in her career. You type something, hit enter, and immediately realize you’ve made a huge mistake, but it’s too late. She’s accidentally deleted a customer database table thinking it was the test database. She’s accidentally rebooted the wrong production server, taking down an order entry system for an afternoon. She’s deleted wrong directories, shut down wrong server clusters, and disabled the wrong login credentials.
Each time, it felt like her blood turned to ice, followed by panic. Once, earlier in her career, when she accidentally deleted the production source control repository, she literally wanted to crawl under her desk. Because of the OS it ran on, she knew no one would ever know it was her. But despite being afraid to tell anyone about it, she told her manager. It was one of the scariest things she had done as a young engineer.
“That really, really sucks, Dwayne,” says Brent. “That could have been me … Seriously, every week I’m in situations where I could have made that same mistake.”
She says, “It could have been any of us. Our systems are so tightly coupled around here, even small changes can have a catastrophic impact. And worse, Chad couldn’t ask for help when he obviously needed it. No one can sustain those insanely long working hours. Who wouldn’t start making mistakes if you can’t even sleep anymore?”
“Yes!” Dwayne exclaims. “How did we get into this position where someone is so overworked that they’re working four nights in a row? What sort of expectations are being set when someone can’t take a day off when they need to? And what sort of message are we sending when the reward for caring so much is that we fire you?”
“An excellent point, Dwayne,” Maxine hears Erik say, once again rejoining them at the table. “You’d be surprised how deeply this sense of injustice would resonate with Steve. You’d know that if you’ve spent any time on the manufacturing floor.”
“How so?” Maxine asks. She’s spent plenty of time working with the plant floor personnel.
“Did you know that when Steve signed on as the COO and VP of manufacturing, he made it contingent upon the company publicly targeting zero on-the-job workplace injuries? He was almost laughed out of the room, not just by the board of directors but also by the plant personnel and even the union leadership,” Erik said, smiling. “People thought he was naive, and maybe even a bit addled in the head. Probably because a ‘real business leader’ would want to be measured on profitability or due-date performance. Or perhaps quality. But safety?
“Rumor was that Steve told Bob Strauss, who was CEO at the time, ‘If you can’t depend on the manufacturing workforce to not get hurt on the job, why should you believe anything we say about our quality goals? Or our ability to make you money? Safety is a precondition of work.’”
Erik pauses. “Even these supposedly enlightened days, leaders rarely talk like that. Steve had closely studied the work of Sensei Paul O’Neill, the legendary CEO of Alcoa in the 1980s and 1990s, who prioritized workplace safety above all else. His board of directors initially thought he was crazy, but in the fifteen years of his tenure as CEO, net income increased from $200 million to $1.5 billion, and Alcoa’s market cap went from $3 billion to $27 billion.
“Despite that impressive economic performance,” Erik continues, “what Sensei O’Neill talks about most is his legacy of safety. For decades, Alcoa has remained the undisputed leader of workplace safety. When he joined, Alcoa was rightly proud of having an above-average safety record. But with two percent of their workforce of 90,000 employees being injured every year, if you worked your entire career at Alcoa, you had a forty percent chance of being hurt on the job.
“Alcoa has far more hazardous working conditions than in your manufacturing plants,” he says. “In the aluminum business, you have to deal with high heat, high pressure, corrosive chemicals, end-products weighing tons that need to be safely transported …
“Sensei O’Neill famously said, ‘Everyone must be responsible for their own safety and the safety of their teammates. If you see something that could hurt someone, you must fix it as quickly as possible.’ He told everyone that fixing safety issues should never be budgeted—just fix it, and they’d figure out how to pay for it later,” Erik continues. “He gave out his home phone number to all plant workers, telling them to call him if they ever saw plant managers not acting quickly enough or not taking safety seriously.
“Sensei O’Neill tells a story about his first workplace fatality,” Erik continues. “In Arizona, an eighteen-year-old boy died. He jumped into an extrusion machine trying to clear a piece of scrap material. But when he did, a boom released, swinging around and killing him instantly.
“This boy had a wife who was six months pregnant,” Erik says. “There were two supervisors there. Sensei O’Neill said, they watched him do it, and probably trained the boy to do exactly what he did.
“In the end, Sensei O’Neill stood up in front of the entire plant and told everyone, ‘We killed him. We all killed him. I killed him. Because I clearly didn’t do a good enough job communicating how people must not get hurt on the job. Somehow it was possible that people thought it was okay for people to get hurt. We must all be accountable for keeping ourselves and everyone safe.’
“As he later said, ‘Alcoans were extremely caring people. Every time people were injured, they mourned and there was always lots of regret—but they didn’t understand that they were responsible. It had become a learned condition to tolerate injuries.’”
Erik pauses to wipe a tear from his eye. “One of Steve’s first actions was to incorporate Sensei O’Neill’s True North of zero workplace injuries into every aspect of manufacturing plant operations here at Parts Unlimited. One of his first acts on the job was to institute a policy that every workplace injury must be reported to him directly within twenty-four hours, along with remediation plans. What a magnificent example of the Third Ideal of Improvement of Daily Work and the Fourth Ideal of Psychological Safety.”
As Erik stares at the wall for several moments, Maxine suddenly realizes why Steve talks about workplace injuries at every Town Hall. He knows he can’t directly influence everyone’s daily work. However, Steve can reinforce and model his desired values and norms, which he does so effectively, Maxine realizes.
Maxine stares back at Erik. She’s never even talked to Steve. How could she possibly do what Erik suggests?
Chris Allers (VP, R&D)
All Dev; Bill Palmer (VP, IT Operations)
11:10 p.m., September 25
Project Inversion: feature freeze
Effective immediately, as part of Project Inversion, there will be a feature freeze for the Phoenix Project. We will make a maximum effort for thirty days to increase the stability and reliability of Phoenix, as well as all supporting systems.
We will suspend all feature work so that we can fix defects and problematic areas of code and pay down technical debt. By doing this, we will enable higher development productivity and faster feature throughput.
During this period, we will also suspend all Phoenix deployments, except for emergency changes, and our Ops teammates will be working on making deployments faster and safer and increasing the resilience of our production services.
We are confident that doing this will help the company achieve its most important strategic goals. If you have any questions or concerns, please email me.
Alan Perez (Operating Partner, Wayne-Yokohama Equity Partners)
Sarah Moulton (SVP of Retail Operations)
3:15 p.m., September 27
Strategic Options **CONFIDENTIAL**
Sarah—in confidence …
Good meeting yesterday. I’m glad that I had the opportunity to share with you my philosophy of creating shareholder value—in general, we favor “value” and operational discipline over “growth.” Our firm has created outsized returns by investing in companies like Parts Unlimited. My plan would create fantastic and consistent cash flow, at a rate higher than most people even think possible. At other companies, we’ve created considerable wealth for investors (and company executives).
As promised, I’m introducing you to several CEOs in our portfolio of companies whom you may be interested in talking to. Please ask them about how we helped increase shareholder value.
PS: Did I understand correctly that there’s now a “feature freeze” for Phoenix? Doesn’t that put you even further behind? And now what do you do with all those new developers you talked about last time? And what will they work on?