10. Case Study: Redundant Database Schemas – Refactoring at Scale

Chapter 10. Case Study: Redundant Database Schemas

For the first of our two case study chapters, we explore a refactor I carried out with a few other members of my team during my first year at Slack. The project centered on consolidating two redundant database schemas. Both schemas were tightly coupled to our increasingly unwieldy codebase, and we had very few unit tests to rely on. In short, this project is a great example of a realistic, large-scale refactor at a relatively young, high-growth company with a modest number of engineers and an increasingly unwieldy codebase.

This project was successful primarily because we remained hyperfocused on our ultimate goal of consolidating the redundant database tables. We drafted a simple but effective execution plan (Chapter 4), thoughtfully weighing risk and speed of execution to deliver on our solution promptly. We opted for a lightweight approach to gathering metrics (Chapter 3), choosing a narrow focus on just a few key data points. We proactively communicated our changes widely, across the entirety of the engineering team, whenever we completed a new milestone (Chapter 7). We built tooling to ensure that our changes would persist (Chapter 9). Finally, we successfully demonstrated the value of the refactor by seamlessly shipping a new feature built atop the newly consolidated schema just weeks after its completion. This enabled us to get further buy-in to kick off further refactors (Chapter 5).

Although the refactor yielded the performance improvements we sought, we took a few missteps along the way. Due to significant pressure from our most important customers, we rushed to start making headway; we did not investigate why the schemas had converged, nor commit our plan to writing for other teams to consume easily (Chapter 4). We didn’t seek broader, cross-functional support (Chapter 5), leaving the bulk of the work to our small team. Even then, we struggled to keep up the momentum, and the refactor dragged in its final few weeks (Chapter 8).

Before we dive into the refactor itself, however, it’s imperative to understand what Slack does and the basics of how it works. If you aren’t familiar with the product, I strongly recommend giving this section a thorough read. If you’re a regular Slack user, feel free to skip ahead to “Slack Architecture 101”.

Slack 101

Slack is first and foremost a collaboration tool for companies of all sizes and industries. Typically, a business will set up a Slack workspace and create user accounts for each employee. As an employee, you can download the application (on your desktop machine, your mobile phone, or both) and immediately begin communicating with your teammates.

Slack organizes topics and conversations into channels. Let’s say that you’re working on a new feature that enables your users to upload files into your application faster. We’ll call the project “Faster Uploads.” You can create a new channel name, #feature-faster-uploads, where you can coordinate development with fellow engineers, your manager, and product manager. Anyone at the company curious to know how development is going on “Faster Uploads” can navigate to #feature-faster-uploads and read through the recent history or join the conversation and ask a question to the team directly.

You can see a simple example of what the Slack interface looked like during the first half of 2017, around the time of this first case study, in Figure 10-1.

Here, our example user is Matt Kump, an employee of Acme Sites. You can see the name of the workspace we’re currently viewing at the top left, and Matt’s name immediately below it.

The leftmost sidebar contains all of Matt’s channels. We’ll ignore the starred section for now and focus on the Channels section first. We can see from this list that Matt is involved in conversations about accounting costs (#accounting-costs), brainstorming (#brainstorming), business operations (#business-ops), and a handful of others. Each of these channels is public, meaning that anyone with an account at Acme Sites can discover the channel, view its contents, and join it.

You might have noticed that the #design-chat channel has a little lock where the others have the # symbol. This indicates that the channel is private. Only users who are members of the private channel can discover it and view its contents. To join a private channel, you must be invited by someone who is already a member.

Figure 10-1. Slack interface circa January 2017

Farther down the sidebar is Matt’s list of Direct Messages. We can see that he’s in a number of direct, one-on-one conversations with fellow teammates like Brandon, Corey, and Fayaz. He is also in a group conversation with both Lane and Pavel; these work just like direct messages, but with a handful of teammates rather than just one.

Note

Understanding the distinction between public and private channels becomes important when we start discussing some of the key problems this case study refactor sought to solve.

You may have noticed that some of the channels in the sidebar appear bolded in bright white. This indicates that they contain new messages you haven’t read yet. If Matt were to select #brainstorming, he would find some new content to read, and the channel in the sidebar would fade to match the others.

While there’s much, much more to Slack, this covers the basics you’ll need to understand before we dive into the historical context leading up to this case study.

Slack Architecture 101

Now let’s explore a few basic components of Slack’s architecture that are at the core of our study. It’s important to note that some of these components have changed significantly beyond the refactoring effort outlined in this chapter, so the details provided here do not accurately reflect how Slack is architected today.

Let’s take a look at a simple request to fetch message history for a given channel. I’ll boot up my Slack instance and pop open one of my favorite channels, #core-infra-sourdough (shown in Figure 10-2), where a handful of infrastructure engineers discuss sourdough baking.

Figure 10-2. Reading the latest bread-baking advice in #core-infra-sourdough

If I monitored network traffic, I would have seen a GET request to the Slack API for channels.history with the channel ID for #core-infra-sourdough. The request would first hit a load balancer to reach an available web server. Next, the server would verify a few things about the request. This includes confirming that the provided token is valid and that I have access to the channel I want to read. If I had access, the server would fetch the most recent messages from the appropriate database, format them, and return them to my client. Voila! In just a few milliseconds, I could fetch the most recent content for the channel I selected.

How did the server know which database to reach out to in order to locate the correct messages? Within the product, everything belonged to a single workspace. All messages were contained within channels, and all channels were contained within a workspace. Having everything map to a single, logical unit gave us a convenient way of horizontally distributing our data.

Every workspace was assigned to a single database shard, where all of its relevant information was stored. If a user was a member of a workspace and wanted to get a list of all the public channels available, our servers would make an initial query to find out which shard contained the workspace’s data and then query that shard for the channels.

If a large customer grew and began to occupy more space within a shard that it shared with other companies, we redistributed these other companies to different shards, giving the growing customer more wiggle room. If a customer was the sole occupant of their shard and they continued to grow, we upgraded the shard’s hardware to accommodate the growth. All in all, our database structure looked as pictured in Figure 10-3.

Figure 10-3. Some workspaces distributed across shards

Next, we’ll take a peek at how we stored a few key pieces of information in each workspace shard. Specifically, we’ll look at channels and channel membership. At the start of 2017, Slack had a few tables responsible for storing information about channels. We had a table that stored information for public channels, called teams_channels. We had another table, groups, which stored information for private channels and group direct messages (messages among more than one user). Each of these tables contained basic information about the channel, things like the name of the channel, when it was created, and who created it. Figure 10-4 illustrates a few sample rows of two tables we used to store channel information.

Figure 10-4. Simplified table schema for teams_channels and groups

We stored information about members of those channels on teams_channels_members and groups_members, respectively. For each member, we would store a row uniquely identified by the combination of workspace ID, channel ID, and user ID. We additionally stored some key pieces of information regarding that user’s membership such as the date that they joined the channel and the time, as a Unix epoch timestamp, at which they last read content in that channel. Figure 10-5 demonstrates that these two tables were nearly identical.

Figure 10-5. Simplified table schema for teams_channels_members and groups_members

Finally, for direct messages, we had a single table called teams_ims (shown in Figure 10-6) to store information about both the channel itself and its membership.

Figure 10-6. Simplified table schema for teams_ims

In total, we had three distinct tables to store information about channels, and three distinct tables to store information about channel membership. Figure 10-7 illustrates the role of each table as it relates to the kind of channel it dealt with.

Figure 10-7. Chart designating the tables responsible for storing channel and respective membership information, depending on its kind (public, private, group DM, or DM)

Scalability Problems

Now that we have a better understanding of Slack’s basic architecture and, more specifically, how channels and channel membership were represented, we can dive into the problems that arose as a result. We’ll describe three of the most serious problems we encountered, as they were experienced by our largest customer at the time, which we’ll refer to as Very Large Business, or VLB for short, for the remainder of the chapter.

VLB was eager for all of its 350,000 employees to use Slack. It had begun using the product slowly at first but began ramping up its usage aggressively during the first few months of 2017. By April, it had just over 50,000 users on the platform, nearly double that of our second-largest customer. VLB started hitting the limitations of nearly every piece of our product. At the time, I was part of the team responsible for Slack’s performance with our biggest customers. For several weeks, our team shared a rotation whereby two of us needed to be at our desks in our San Franciso headquarters at 6:30 a.m. to be ready to respond to any immediate issues during VLB’s peak log-in time on the East Coast. As our team scrambled quickly to patch problems left and right, we began to notice that each of them was exacerbated by the fact that we had redundant database tables for storing channel membership.

Booting Up the Slack Client

Every weekday morning, starting at 9 a.m. eastern time, VLB employees would start logging on to Slack. As more people began their workday, more load began to pile up on VLB’s database shard. Our existing instrumentation showed us that the culprit was most likely one of the most crucial APIs we called on startup, rtm.start.

This API returned all the necessary information to populate a user’s sidebar; it fetched all the public and private channels the user was a member of, fetched all the group and direct messages they had open, and determined whether any of those channels contained messages that they hadn’t yet read. The client would then parse the result and populate the interface with a tidy list of bolded and unbolded conversations.

From the server perspective, this was an incredibly expensive process. To determine a user’s memberships, we needed to query three tables: teams_channels_members, groups_members, and teams_ims. From each set of memberships, we extracted the channel_id and fetched the corresponding teams_channels or groups row to display the channel name. We also queried the messages table to fetch the timestamp of its most recent message, which we compared to the user’s last_read timestamp to determine whether they had any unread messages. We executed the vast majority of these queries individually, incurring network roundtrip costs each time.

File Visibility

Sporadically throughout the day, we noticed spikes in expensive queries to the database. Our dashboards surfaced a few potential candidate callsites, including the function responsible for calculating file visibility at the core of most of our files-related APIs. Popping open the target function, we yet again came face to face with a set of complex queries.

When a user uploads a file to Slack, the servers write a new row to the files table denoting the file’s name, its location on our remote file server, and a handful of other relevant pieces of information. Whenever a file is shared to a channel, we write a new entry to the files_share table, denoting the file ID and the ID of the channel to which it was shared. When a file is shared to a public channel, it becomes visible to any user on the workspace and is denoted as publicly discoverable by setting the is_public column to true on its files row. Thus, in the simplest case, the file is public, we know it is quickly, and we can reveal it to the user.

When a file isn’t public, however, the logic becomes a little bit more complicated. We have to cross-reference all channels that the user is a member of with all the channels where the file was shared. As is the case for rtm.start, to determine a user’s complete set of channel memberships, we had to query three distinct tables. We then combined those results with those from the files_shares table for the target file. If we found a match, we could show the file to the user; if not, we returned an error to the client.

Mentions

The query that caused the most consistent amount of load on VLB’s shard for the full duration of the workday was the query responsible for determining whether a user (or the topics they subscribe to) were mentioned in a channel and hadn’t yet read those messages. A mention can be any number of things within Slack. It can be a username or a username prefixed with the @ symbol. It can be a highlight word for which the user has enabled notifications within their user preferences. The client would then use that data to populate badges with the number of unread mentions to the right of the corresponding channel name in the sidebar. You can see one of the many complex mentions-related queries in its 40-line glory in Example 10-1.

This query, yet again, required fetching a user’s memberships across the three membership tables. The tricky part was when we needed to exclude any memberships for which the associated channels were deleted or archived, requiring us to join the membership results with their corresponding channel row on either groups or teams_channels.

Example 10-1. Query to determine whether to notify a user of a mention; % symbolizes substitution syntax
SELECT
    tcm.channel_id as channel_id,
    'C' as type,
    tcm.last_read
from
    teams_channels tc
    INNER JOIN teams_channels_members tcm ON (
        tc.team_id = tcm.team_id
        AND tc.id = tcm.channel_id
    )
WHERE
    tc.team_id = %TEAM_ID
    AND tc.date_delete = 0
    AND tc.date_archived = 0
    AND tcm.user_id = %USER_ID
UNION ALL
SELECT
    gm.group_id as channel_id,
    'G' as type,
    gm.last_read
from
    groups g
    INNER JOIN groups_members gm ON (
        g.team_id = gm.team_id
        AND g.id = gm.group_id
    )
WHERE
    g.team_id = %TEAM_ID
    AND g.date_delete = 0
    AND g.date_archived = 0
    AND gm.user_id = %USER_ID
UNION ALL
SELECT
    channel_id as channel_id,
    'D' as type,
    last_read
FROM
    teams_ims
WHERE
    team_id = %TEAM_ID
    AND user_id = %USER_ID

Consolidating the Tables

Now that we have sufficient background on the problem we aimed to solve, we can begin to discuss the refactor. I wish I could say that consolidating teams_channels_members and groups_members into a single table was a well-planned and smartly executed project, but that would not be true. In fact, the more chaotic portions of the refactor are what inspired and informed a great deal of the ideas in this book. We kicked things off with a sense of urgency, didn’t keep great tabs on progress as we went along, and in the end, although we knew we had decreased the load across most of our database tier, we could only point to a single metric to demonstrate roughly by how much. What ultimately made the project a success was the smart, dedicated set of individuals who helped us cross the finish line. Although our largest customers stood to benefit the most from the refactor, all of our customers ultimately benefited from the project.

We started the project somewhat immediately and without a written plan. Our top priority was to get the consolidation of the tables just to the point where we could migrate the one query that was hammering our database shards the most: the mentions query.

Although we knew that a great many queries would equally benefit from the consolidated table, their migration was strictly secondary. In Chapter 1, I strongly suggested that you not embark on a large-scale refactor unless you are confident that you can finish it. In this case, we certainly intended to finish the table consolidation; we just didn’t know whether other, more pressing performance issues might creep up and need to be prioritized over the refactor. We were willing to take the risk, given the urgency of the problem at hand, fully aware of the consequences if we failed to finish the migration.

First, we created a new table, channels_members. We combined the schemas of the membership tables, completed with the same indices, and introduced a new column to denote whether a row originated from teams_channels_members or groups_members, both to ease the migration and ensure that we could respect any business-logic dependencies around the initial tables. Figure 10-8 shows our goal state as compared to Figure 10-7, our starting state.

Figure 10-8. Our goal state

Gathering the Scattered Queries

Rewriting our queries to target a single new table would not be easy. Slack’s codebase was written in a very imperative style, with everything from short functions to long functions, distributed across hundreds of loosely namespaced files. Its original authors had stuck to what they knew well and steered clear of object-oriented patterns due to performance concerns with PHP. They preferred writing individual queries inline rather than relying on an object-relational mapping library and risk bloating the codebase early.

One-off queries to either teams_channels_members or groups_members were strewn across 126 files. Many of the queries hadn’t been touched since well before the product launched. To top it off, we knew much of the code that contained these queries didn’t have great unit test coverage. To give you a sense of what these might have looked like, I dug up some old code, which you can see in Example 10-2.

Example 10-2. An inlined SQL query to teams_channels_members
function chat_channels_members_get_display_counts(
    $team,
    $user,
    $channel
){
    // Some business logic

    $sql = "SELECT
        COUNT(\*) as display_counts,
        SUM(CASE
                WHEN (is_restricted != 0 OR is_ultra_restricted != 0)
                    THEN 1
                ELSE 0
            END) as guest_counts
    FROM
        teams_channels_members AS tcm
        INNER JOIN users AS u ON u.id = tcm.user_id
    WHERE
        tcm.team_id = % team_id
        AND tcm.channel_id = % channel_id
        AND u.deleted = 0";

    $ret = db_fetch_team($team, $sql, array(
            'team_id' => $team['id'],
            'channel_id' => $channel['id']));

    // A bit more business logic

    return $counts;
}

Business logic code surrounding these queries would index directly into the resulting columns, cementing a tight coupling between our database schemas and the code. Whenever we introduced new columns, we had to update corresponding code to take it into consideration. Say we had a column on the files table called is_public to denote whether the file was public. If we later introduced additional logic that required us to check an additional property to determine whether the file was public, any code that relied on a simple check of if ($file['is_public']) would need to be updated to accommodate for that change properly.

To consolidate teams_channels_members and groups_members into channels_members, we needed to identify all the queries to either table scattered across the codebase. A quick grep of the codebase and we were able to extract a list of all the locations where we queried groups_members or teams_channels_members. We plugged the list of files and line numbers directly into a shared Google Sheets file, shown in Figure 10-9.

Figure 10-9. Google Sheets file to track queries to teams_channels_members and groups_members

We decided to create a single file where we could house all the queries related to channel membership. Our effort to revive our struggling membership queries conveniently arose around the same time engineers had begun having conversations about centralizing our queries. We were a growing team, trying to execute quickly, and needing to remember to update queries in haphazard corners of the codebase every time we altered a table was getting tedious. A few proposals had been shopped around, with engineers in favor of storing all queries to a given table in a single file. While some wanted an approach that would allow them to generate queries, given a set of parameters, leading us to build a more complex data access layer, others wanted to continue to be able to read the queries inline. We decided that with this project, we’d prototype minimal query generation as a means of limiting the number of individual functions in our new file. We decided to call this new pattern unidata, or ud for short, thus naming our target file ud_channel_membership.php.

Developing a Migration Strategy

Now that we had a table and a set of queries to migrate, we could get started. We needed to identify each of the queries from our initial grep, which inserted rows, updated values, or deleted rows. For each query, we created a corresponding function in our unidata library containing a copy. Each function would take a parameter to indicate whether to execute the query on teams_channels_members or groups_members, alongside some logic to execute the same query conditionally against our new table, channels_members. The general idea is shown in Example 10-3.

Example 10-3.
function ud_channel_membership_delete(
    $team,
    $channel_id,
    $user_id,
    $channel_type
){

    if ($channel_type == 'groups'){
        $sql = 'DELETE FROM groups_members WHERE team_id=%team_id AND
                group_id=%channel_id AND user_id=%user_id';
    }else{
        $sql = 'DELETE FROM teams_channels_members WHERE team_id=%team_id AND
                channel_id=%channel_id AND user_id=%user_id';
    }

    $bind = array(
        'team_id'    => $team['id'],
        'channel_id' => $channel_id,
        'user_id'    => $user_id,
    );

    $ret = db_write_team($team, $sql, $bind);

    if (feature_enabled('channel_members_table')){
        $sql = 'DELETE FROM channels_members WHERE team_id=%team_id AND
                channel_id=%channel_id AND user_id=%user_id';
        $double_write_ret = db_write_team($team, $sql, $bind);

        if (not_ok($double_write_ret)){
            log_error("UD_DOUBLE_WRITE_ERR: Failed to delete row for
                channels_members for {$team['id']}-{$channel_id}-{$user_id}");
        }
    }

    return $ret;
}

Once we had successfully moved over all write operations, we wrote a backfill script to copy all existing data from both membership tables onto our new table. Note that we migrated write operations before starting a backfill to ensure that the data in the new table would be accurate. We then backfilled all membership data for our own workspace, followed promptly by VLB during off-hours to prevent any unnecessary load during their workday. We tripled-checked that no errant writes to either table remained outside of our new library, but given that the engineering organization was moving quickly, there was a nonzero chance we had missed one or two queries. We had not yet put any mechanisms in place to prevent an engineer on a different team from adding a new query without alerting us, so to ensure that the backfilled data remained consistent with the live data, we warned our engineering team about our process (see Figure 10-10) and wrote a script we could manually kick off to identify any inconsistencies and optionally patch them if desired.

Note

In some of the screenshots included in this chapter, you might see some references to TS. TS is short for Tiny Speck, the previous name of the company before Slack, the product, was launched publicly in 2014. If you see a reference to something being “enabled to TS,” this just means that we’re enabling the change to our own workspace.

Figure 10-10. Announcement that we’ve started double-writing to the new table

After enabling double-writing for VLB, we watched its database health carefully; teams_channels_members and groups_members rows were updated very frequently. Whenever a user read a new message, the client issued a request to the servers to update the user’s last_read timestamp on their membership row. Now, with the addition of channels_members, we were issuing double the number of writes. We spent a day monitoring traffic to gain confidence that the workspace had enough bandwidth to handle the additional load.

Now that our tables were in sync and we were double-writing updates, we could execute on our most important milestone: migrating the mentions query. Whenever we were ready to give something a try in production, we first rolled it out to our own team. This was (and still is) the typical strategy for testing our work in production, whether it’s a new feature, a new piece of infrastructure, or, in our case, a performance enhancement. We typically would have rolled out to free workspaces next, slowly working our way up the payment tiers, leaving our largest, most performance-sensitive customers last; but with this particular endeavor, we wanted to ease the load on those top-tier customers first. So we flipped our strategy on its head.

We enabled optimized mentions to our team. Because we didn’t have much automated testing and our unit testing framework was unable to test the query properly, we relied on folks internally to spot any regressions before we enabled the query to any other customers. We carefully monitored channels where employees typically reported bugs. We later enabled this behavior for VLB.

Quantifying Our Progress

We knew that our databases were overloaded. We measured their health by looking at what percentage of their CPU was idle. Typically, this would hover at about 25 percent but would regularly dip to 10 percent and below. This was troubling because the more time it spent at less than 25 percent idle, the less likely it would be able to handle a sudden increase in load. VLB was putting our product through its paces, and we never knew which part of the product would lead to an unexpected uptick in database usage next.

When we began the consolidation effort, we already had multiple other projects running in parallel to help address the load. Among the range of ongoing workstreams, the added load due to double-writing, recurrent fluctuations, and product engineering continuing to build out new features, we couldn’t rely on our database usage data to confirm that the refactor was effective. Besides, our monitoring data disappeared after about a week, so unless we had chosen a quiet day to capture some screenshots and record a series of data points, the data wouldn’t have been available to us upon completion to serve as a good baseline.

Instead, we chose to rely primarily on query timings data. We instrumented each query with timing metrics, allowing us to confirm whether the new query was in fact more performant. EXPLAIN plans can be quite insightful, but nothing beats having actual metrics to track the time spent executing a query from the server’s perspective. In an abundance of caution, instead of enabling the new treatment to all VLB users immediately, we randomly assigned incoming requests to either query. We first verified that the feature flag was enabled for the workspace and then randomly distributed the traffic 50-50. This enabled us to be a little bit more careful with our introduction of the change and confirmed that the new query was in fact more performant with a customer as large as VLB.

We waited a few hours before taking a look at our data. We needed to make sure that the new query was consistently faster, meaning it needed to be faster both when the database was under average load and when it was at peak usage. Thankfully, the data looked promising across the board with a 20 percent speed-up! You can see the original data we pulled in Figure 10-11. The first query joined across both teams_channels_members and groups_members and on average completed in about 4.4 seconds. The second query read from channels_members alone and on average completed in about 3.5 seconds. We managed to shed nearly a second by using the consolidated membership table. (Both queries were too long to show in full, so only the first few lines are visible in the timings chart.)

Figure 10-11. Timings data on one of the mentions queries for VLB

With the confirmation that our refactor did the trick for our most important use case, we could justify moving forward with the remainder of the consolidation. We referred back to our Google Sheet tracker and began divvying up the remainder of the read queries to engineers on our team.

Attempting to Keep the Team Motivated

Unfortunately, it was difficult to get the help we needed to finish the migration. Given so many fires to put out, everyone on our team was parallelized across distinct remediation efforts. It was tough to get anyone else to take a few hours out of their day to carefully extract a handful of queries. To top it off, most of the code surrounding the remaining queries was untested, making what should have been a simple, straightforward change quite dangerous. Spending an afternoon migrating queries was simply not enticing.

I considered reaching out to other teams in the Enterprise engineering team for their help and tapping a handful of other performance-minded developers across the company but, ultimately, decided to keep trudging through on my own, with the occasional help from my immediate teammates. Because the work was risky and not particularly intellectually stimulating, I thought it might be too much of an uphill battle to convince a wider circle of engineers to contribute. In hindsight, I think I could have found a way to make the effort more compelling, distributed the work more evenly, and likely shaved off a few weeks.

When progress slowed to a crawl just a few weeks later, I attempted to bribe the team with cookies, which you can see in Figure 10-12. While there is a number of more traditional options for getting engineers motivated to help out (see Chapter 8), sometimes food is the best incentive of all.

Figure 10-12. An attempt at bribery

Communicating Our Progress

Although our team was widely distributed across a number of projects, we still needed each other’s support. We relied on one another for code reviews, talking through tough bugs, and the occasional gut check. To make sure we could be effective in those roles while remaining highly focused on our own endeavor, we would regularly debug performance problems in public channels (oftentimes our own team channel) and hold in-person weekly meetings to discuss progress and blockers. For me, that meant a regular avenue to call out what percentage of queries were still littered across the codebase and talk through any bugs or inconsistencies I’d spotted in the data.

Whenever we reached a meaningful milestone, like enabling double-writes to our own workspace, or enabling the new mentions query to VLB, we’d announce the change in both our team channel and in a few engineering-wide channels for added visibility. The more engineers that were aware of the changes we were making, the better! It meant that an engineer on another team was less likely to introduce a new query against either table we were actively deprecating without referring to our new library. It also meant that as we triaged incoming customer bugs, any engineer could isolate and solve a related problem much more effectively.

Tidying Up

Once no more entries were left in our tracker, we slowly began enabling all other teams beyond our own (and VLB) to read from the new table. We let the changes sit for two weeks before deciding it was safe to stop double-writing data to the old tables. We wanted to be certain that our database tier responded well to the new table, that its data was consistently correct, and that no new bugs related to the refactor were logged. Had double-writing not been expensive from both a load and monetary perspective, we might have allowed the changes to bake a bit longer, but we were eager to remove the overhead.

Finally, we stopped double-writing, first for our own team, then for VLB, and finally for the remainder of our customers. As with every important step of our refactor, we communicated it broadly, as shown in Figure 10-13. We then quickly tidied up our new library by removing all references to teams_channels_members and groups_members. We wrote some new linter rules, preventing engineers from writing new queries against either deprecated table and enforcing all new queries against the channels_members table to be properly located in our new centralized library. We wanted to prevent confusion among engineers about how far along we were with the refactor. Not everyone reads all announcements in cross-functional channels, especially if they are out on vacation or leave, so it’s important to make sure you don’t rely on those announcements alone for engineers across your organization to know what to do when they come across code that has been changed as part of your refactor.

Figure 10-13. Announcing we were no longer double-writing for our own workspace

Here’s a close-up of the graph in Figure 10-13’s Slack message:

Of course, we didn’t forget the most important final step: celebrating! As was tradition for much of the engineering team in San Francisco, we ordered a cake (Figure 10-14) adorned with the name of our new table to commemorate the completion of the project.

Figure 10-14. Funfetti cake to celebrate our refactor!

The project’s complete trajectory is shown in Figure 10-15, highlighting the number of queries executed against each table on a daily basis from May to September 2017.

Figure 10-15. Query volume for teams_channels_members, groups_members, and channels_members throughout the refactor

Lessons Learned

There are a number of lessons to be learned from this case study, both from what went well and what could have gone better. We’ll start with where the project struggled, describing the pitfalls of not having a written execution plan, forgoing understanding of how the code had degraded, skimping on the number of tests we wrote, and failing to motivate teammates. Then we’ll discuss what went well, highlighting our sharp focus on dynamic milestones and a well-defined set of metrics.

Develop a Well-Defined, Well-Communicated Plan

Because the whole project began so fast, we didn’t have much of a written plan. Our team was familiar with the process involved for migrating data from one table to another. We knew the mentions query was our top priority and that we would complete only as much of the migration as was necessary to do so; we would reevaluate later. The only time the process appeared in written form was when we posted updates in our team channel (rather than in a channel dedicated to the project); even then, these were only pertinent subsets of the overall plan.

The fact that we never deliberately wrote down each of the steps involved from start to finish meant that we were more likely to forget something critical along the way. Perhaps most worrisome of all was the fact that we never shopped our plan around to other teams across the company to ensure that everyone had a chance to verify whether they might be affected by the change and voice their concerns if that was the case. We simply plowed through, on the assumption that performance was the most important thing we could be doing to improve our relationship with our largest customer (and, by extension, the most important thing we could be doing for the company). We also believed that we could implement the change in a way that would disrupt as few other engineering teams as possible.

This assumption proved to be wrong on multiple fronts. First, when a handful of inevitable bugs crept up and we hadn’t adequately socialized the change, engineers responding to those bugs were unpleasantly surprised. Second, we overlooked a team altogether that would bear an acute impact from the change. About a month before we finished migrating the final few membership queries, a teammate reminded me that we should probably warn the data engineering team about the changes we were making. By moving membership onto a new table, and nearing the stage at which we would disable writes to the old tables, we risked disrupting most of their pipelines, including pipelines responsible for calculating important usage metrics. We were fortunate that the data engineering team was quick to respond and update the necessary pipelines, and a serious crisis was averted.

These mishaps show just how important it is to develop and vet a thorough execution plan. We were lucky that we recovered from these oversights quickly, but why leave to chance what could have been addressed more deliberately during the early planning stages? As was highlighted in Chapters 4 and 7, having a concrete plan is crucial to uncovering gaps early and minimizing cross-functional communication gaps.

Understand the Code’s History

I highly recommend developers begin their code archeology expedition before they begin to execute their refactoring effort, because the added context can give a different shape and direction to the project. Unfortunately, due to the urgency of our work, we skipped the deliberate process of understanding and empathizing with the existing code and went right to execution. It was only well after we had begun migrating queries that I started to wonder why we’d made a distinction between teams_channels_members and groups_members in the first place.

As the weeks passed and there were still dozens of queries to migrate, I grew frustrated with the redundant tables and the way our SQL queries were strewn about. The more frustrated I became, the longer the project seemed to take (and the more tempting it became to cut corners in an attempt to reach the finish line faster).

After we had completed the refactor, I contacted a few of our early engineers to get some insight into why these tables had been distinct. I learned that keeping private and public channel information on separate tables isolated them from one another and served as a security precaution. Product history played a role as well; public channels and private channels felt like vastly different concepts in the early days of Slack. As the two concepts gradually converged, so did the table schemas.

Gaining this perspective proved helpful for subsequent refactors, informing how we went about consolidating teams_channels and groups into their own unified table. It gave me a newfound appreciation for decisions made early in Slack’s history, and a more positive attitude toward refactoring as an opportunity to improve something that had probably served us well for some time but no longer could, rather than as an opportunity to improve “bad” code. This experience is precisely why in Chapter 2 I recommend that engineers take the time to understand where the code they seek to improve came from, and how circumstances may have led it to degrade over time. If we have more empathy for the code, we stand to keep a more open mind and be more patient throughout the refactor.

Ensure Adequate Test Coverage

In Chapter 1, I asserted that it’s important to have adequate test coverage before refactoring, to ensure that the application’s behavior is properly maintained at every step. In this project, the vast majority of the code we were modifying had been written early in Slack’s development and, due to the push to get the product to market quickly, much of it lacked adequate tests. The refactor to consolidate the channel membership tables was under significant time pressure as well; performance for our largest customer was a growing concern, so we did our best to make the necessary changes carefully, opting to write tests for only the most critical untested codepaths.

This decision led us to ship a handful of bugs throughout the refactor, each of which could have been prevented had we taken the time to write the requisite tests. We arguably spent more time recovering from the regressions we introduced than we would have writing the tests in the first place. Having adequate test coverage is essential for a smooth refactor, preventing your customers from experiencing bugs and your team from spending time solving them.

Keep Your Team Motivated

Rather than continuing to plow through alone, I should have found a better way to get other engineers involved more seriously at the outset and again when progress slowed a few weeks later. The last 10 percent of queries took about the same amount of time to migrate as the first 50 percent. Once we had successfully improved the mentions query for VLB, we began to lose the sense of urgency we had experienced at the start of the project. With every new bug or inconsistency in our data, we lost a little steam. By the time the project was nearly complete, everything about it felt like pushing a boulder up a mountain.

What we had not considered was soliciting help from engineers outside our own team. We could have been more strategic about asking for help from those on other product engineering teams, asking them to migrate the queries within their own features. We could have sold them on the effort by demonstrating the performance boost they stood to gain. Distributing the work could have allowed us to halve the amount of time it took to complete.

If momentum on your refactor starts to slow, seek ways to give it a boost early, before progress slows further. Slow refactors are more likely to lose priority, leaving behind a significant amount of code stuck between two states, which, as was pointed out in Chapter 1, poses its own set of problems. Chapter 8 covered a number of ways to keep your team motivated; do not hesitate to ask for more support if you need it!

Focus on Strategic Milestones

We had preliminary data in the form of query EXPLAIN plans to support our hypothesis that combining the two membership tables would improve query performance. We needed further confirmation of that hypothesis during the early stages of the refactor so that we could pivot if the consolidation proved insufficient. By focusing on making only the changes necessary to enable the migration of the mentions query for VLB, we secured the confirmation we needed within just a few weeks and successfully alleviated load from the VLB database shard, buying us more time to see the remainder of the refactor through.

Proving your refactor’s effectiveness early ensures that your team does not waste any time continuing to execute a lengthy project that may not yield the desired results. By focusing on strategic milestones, those meant to benefit from the refactor can reap those benefits sooner; this can help your team, further bolstering support for the effort while it is still underway. For more details on how to identify strategic milestones, refer to Chapter 4.

Identify and Rely on Meaningful Metrics

We had a specific set of metrics that enabled us to show conclusively that our project was successful for both our intermediate milestones and, once we’d completed the rollout, to all customers. By collecting EXPLAIN plans for queries before and after the consolidation, we were able to document progress as we migrated each of the more complex membership queries. By instrumenting the mentions query with timings metrics, we could monitor its performance in real time and immediately see the positive impact.

Keeping a close eye on your metrics helps you prove that your refactor is tilting the needle in the right direction throughout its development. If at any point the metrics stop improving (or worse, start regressing), you can dig in immediately, addressing problems as soon as they arise, rather than at the project’s conclusion. Refer to Chapter 3 for suggestions on how to measure your refactor.

Takeaways

Here are the most important takeaways from our refactor to consolidate Slack’s channel membership tables.

  • Develop a thorough written plan and share it broadly.

  • Take the time to understand the code’s history; it might help you see it in a new, more positive light.

  • Ensure that there is adequate test coverage for the code you’re seeking to improve. If there isn’t, commit to writing the missing test cases.

  • Keep your team motivated. If you’re losing momentum, find creative ways to boost it back up.

  • Focus on strategic milestones to prove the impact of your refactor early and often.

  • Identify and rely on meaningful metrics to guide your efforts.