11. Case Study: Migrating to a New Database – Refactoring at Scale

Chapter 11. Case Study: Migrating to a New Database

For the second of our two case study chapters, we’ll explore a refactor carried out by a group of engineers from the product engineering team and infrastructure teams at Slack. The project was built on the consolidation of our channel membership tables discussed in the previous chapter. If you haven’t read through the first case study yet, I recommend you do so; there’s important context you’ll want to understand to get the most out of this chapter.

Unlike the previous case study, which was primarily motivated by performance, this one was chiefly driven by Slack’s need to enable greater flexibility in the product. Having channel memberships tied to distinct workspace shards made it difficult for us to build more complex features stretching beyond single workspaces. We wanted to enable complex organizations with multiple workspaces to collaborate seamlessly within the same set of channels and facilitate communication between distinct Slack customers, allowing companies to coordinate with their vendors directly within the application. To unlock this ability, we needed to reshard channel membership data by user and channel rather than by workspace. This refactor illustrates the many challenges that come with large-scale database migrations, multi-quarter projects, and heavily cross-functional engineering efforts.

The refactor was successful because we had a strong understanding of the problem we needed to solve and how our evolving product strategy had led us to outgrow past architectural decisions (Chapter 2). We planned the project thoughtfully, choosing to juggle a few more variables than were strictly necessary, knowing it would render the refactor even more worthwhile (Chapter 4). We derived a careful rollout strategy, developing tooling that enabled us to carry it out as reliably as possible (Chapter 8). Finally, throughout the entire effort, we maintained a simple communication strategy.

Although the refactor ultimately gave us the ability to stretch our product in new and interesting ways, it took nearly double the time we had initially estimated to complete. We were too optimistic in our estimates (Chapter 4); it took over a year to finish what we had originally anticipated would take only six months. We underestimated the product implications of the refactor and only learned to leverage the expertise of product engineers after spending several months making little progress (Chapter 6).

As with the previous case study, we’ll start off with some important context, including a brief overview of why the way we distributed our data was becoming a bottleneck, and the motivations behind our adoption of a new database technology, Vitess. Once we’ve established a solid foundation and the motivations for our refactor, we’ll describe our solution and walk through each phase of the project.

Workspace-Sharded Data

To appreciate the problems we sought to solve with this refactor, we need to describe how our data was distributed across our databases in MySQL. Before we kicked off our refactor, the vast majority of our data was sharded by workspace, where a workspace is a single Slack customer. We touched on this in the previous case study under “Slack Architecture 101”; you can see an illustration of how different customers’ data was distributed across different shards in Figure 10-3.

While this worked just fine for a number of years, this sharding scheme grew increasingly inconvenient for two reasons.

First, we struggled to support our biggest workspace shards from an operational perspective. The shards housing our largest, fastest-growing customers suffered from frequent, problematic hotspots. These customers, already occupying isolated shards, were quickly approaching the data size at which we would no longer be able to upgrade their hardware space. With no simple mechanisms by which we could horizontally split their data, we were stuck.

Second, we were making important changes in our product that were actively leading us to break down the barriers between workspaces we had long upheld, both in the way our code was written and in how our data was structured. We had built features enabling our biggest customers to bridge together multiple workspaces and launched the ability for two distinct Slack customers to communicate directly within a channel they shared.

The mismatch between our product vision and the way our systems were architected meant that our application grew ever more complex. This was a perfect example of code degradation due to shift in product requirements (as you might recall from Chapter 2!). To illustrate this problem more concretely, in the year leading up to this case study, we sometimes needed to query three distinct database shards to locate a channel and its memberships successfully. This was confusing for our developers, who needed to remember the correct set of steps to fetch and manipulate channel-related data.

To address our operational concerns with MySQL and our difficulty scaling, we started evaluating other storage options. After weighing multiple solutions, the team decided to adopt Vitess, a database clustering system built at YouTube that enables horizontal scaling of MySQL. With the migration to Vitess, we would finally be able to shard our data by something other than workspace, giving us the opportunity to free up space on our busiest shards and distribute our data in a way that made it easier for our engineers to reason out!

Migrating channels_members to Vitess

Given these circumstances, we decided to migrate the channel membership table, channels_members, to Vitess. Because this was one of our most high-traffic tables, resharding it would free up considerable space and load from our busiest workspace shards. The migration would also substantially simplify business logic around fetching memberships for channels that existed across workspace boundaries.

The project was spearheaded out of the Vitess infrastructure team, with help from a handful of product engineers who had intimate knowledge of our application query patterns against the channels_members table. We knew it would be a winning combination. The infrastructure engineers would contribute deep knowledge of the database system so that we could avoid any pitfalls during the migration and efficiently debug database-related issues as they arose; because they had the most expertise with table migrations to date, they’d be best suited to lead the project, with Maggie at the helm. The product engineers, including me, would provide crucial insight as to the new schema and sharding scheme and help with rewriting application logic to query the migrated data correctly.

We kicked things off in earnest by creating a new channel, #feat-vitess-channels, where we could easily bounce ideas off one another and coordinate workstreams. We invited everyone to join and jumped right into our first task.

Sharding Scheme

Before we could begin migrating channel membership data to Vitess, we needed to decide how it would be distributed (i.e., which keys to use to reshard the table). Here, we had two options:

  • by channel (channel_id), to locate all memberships associated with a channel easily by querying a single shard

  • by user (user_id), to find all of a user’s memberships by querying a single shard

Having recently completed the consolidation of our membership tables per our first case study, my impression was that the majority of queries dealt with fetching membership for a given channel rather than for a given user. Many of these queries were crucial to the application, powering important features like Search, and the ability to mention everyone in a channel (via @channel or @here).

At the time (and still today), we logged a sample of all database queries to our data warehouse to keep tabs on our MySQL usage across requests to our production systems. To confirm my intuition that most of the traffic to channels_members relied on channel_id, I ran a few queries against this data, looking at sampled membership queries executed over a month-long period, and brought it to the team. The results are shown in Figure 11-1.

Figure 11-1. Number of queries run against channels_members filtered by channel_id

One of the product engineers working with us, who had more experience with Vitess, pointed out that sharding by user might be a better bet. Pulling from the same set of query logs, he showed us the top 10 most frequent queries hitting the table filtered by user_id. The results are shown in Figure 11-2. If we wanted our application to perform well, we would need to account for this behavior.

Figure 11-2. Top 10 most frequent queries against channels_members and whether they filtered the data by user_id

We weighed both options, doing some back-of-the-napkin math to determine the database querying capacity required to support either option. We ultimately decided to compromise, denormalizing the membership into two tables, one sharded by user, the other sharded by channel, double-writing for both use cases. This way, point queries would be cheap for both.

Developing a New Schema

Next, we needed to take a hard look at our existing workspace-sharded table schema and determine whether we wanted to modify it for both our user- and channel-sharded use cases. Although we could have migrated our existing schema to both sharding schemes, this refactor gave us a unique opportunity to rethink some of the decisions we’d made with the original table design. We’ll take a closer look at the schema we derived for each, starting with the user shard. Example 11-1 shows the schema on the workspace shards, before the migration.

Example 11-1. CREATE TABLE statement showing the existing channels_members table, sharded by workspace
CREATE TABLE `channels_members` (
  `user_id` bigint(20) unsigned NOT NULL,
  `channel_id` bigint(20) unsigned NOT NULL,
  `team_id` bigint(20) unsigned NOT NULL,
  `date_joined` int(10) unsigned NOT NULL,
  `date_deleted` int(10) unsigned NOT NULL,
  `last_read` bigint(20) unsigned NOT NULL,
  ...
  `channel_type` tinyint(3) unsigned NOT NULL,
  `channel_privacy_type` tinyint(4) unsigned NOT NULL,
  ...
  `user_team_id` bigint(20) unsigned NOT NULL,
  PRIMARY KEY (`user_id`,`channel_id`)
)

User-sharded membership table

For the user-sharded case, we decided to maintain the majority of the original schema, with one exception: we made a significant change to how we stored user IDs. To understand the motivations behind this decision, we’ll give a brief overview of the two kinds of user IDs we stored and how they came about.

At the start of the chapter, we briefly mentioned that Slack sought to enable complex businesses, split into multiple workspaces according to department or business unit, to collaborate more easily. Without any centralization, not only did employees have difficulty communicating across departments, it was also difficult for the company to manage each individual workspace properly. To this end, we enabled our biggest customers to bring together their many workspaces under a single umbrella.

Unfortunately, in grouping workspaces, we needed a way to keep users in sync. Let’s illustrate how this works with a simple example.

Acme Corp. is a large corporation. It has a number of departments, each with its own workspace, including one for its engineering team and customer experience department. As an employee of Acme Corp., you have a single, organization-level user account. If you happen to be an engineer, you are a member of the Engineering workspace to collaborate with your teammates, and the Customer Experience workspace to help the support team troubleshoot customer issues.

What appeared to be a single account at Acme Corp., however, was actually multiple accounts under the hood. At the organization level, a user had a canonical user ID. The same user had distinct local user IDs for each workspace they were a member of. This means that if you were a member of the Engineering and Customer Experience workspaces, you had three unique user IDs, or, to generalize, n + 1 IDs, where n was the number of workspaces of which you were a member.

As you might imagine, translating between these IDs quickly became exceedingly complicated and bug-prone. Within a year of launching this feature, a number of product engineers hatched a plan for replacing all local user IDs with canonical user IDs. Because most of the data stored in Slack’s systems refer to a user ID of some kind (authoring a message, uploading a file, etc.), a high degree of complexity was involved with correctly (and invisibly) rewriting these IDs.

The workspace-sharded channels_members table stored local user IDs in the user_id column. Because a project was already underway to replace all local user IDs with canonical user IDs, we decided to collaborate with them and ensure that we stored canonical user IDs across all user ID columns.

Channel-sharded table schema

Beyond our concerns with user IDs, we had some unease about the write bandwidth to the secondary, channel-sharded membership table. We examined the queries we planned to send to these shards to try to identify ways we could decrease write traffic. During that process, we noticed that most of the columns on the original table were entirely unused by their consumers, including the ones that were updated most often, like a user’s last read position in the channel. For example, if we queried for all the memberships associated with a given channel, the application logic would usually only use the user_id and user_team_id columns. By omitting these unnecessary columns in our new schema, we could dramatically decrease the write frequency, giving our channel shards a bit more breathing room. Example 11-2 shows the table schema for the channel-sharded membership table.

Example 11-2. CREATE TABLE statement for the second of our new channels_members tables, sharded by channel
CREATE TABLE `channels_members_bychan
  `user_id` bigint(20) unsigned NOT NULL,
  `channel_id` bigint(20) unsigned NOT NULL,
  `user_team_id` bigint(20) unsigned NOT NULL,
  `channel_team_id` bigint(20) unsigned NOT NULL, 
  `date_joined` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`channel_id`,`user_id`)
)

Renamed team_id to channel_team_id

Detangling JOINs

We next needed to update our application logic to accommodate the changes to our schemas and point to the Vitess cluster. Thankfully, most of these changes were straightforward and before we knew it, we’d updated the majority of our application logic accordingly.

Where the migration became more difficult was with complex queries involving JOINs with other tables in our MySQL cluster. Because we were moving the table to an entirely new cluster, we could no longer support these queries and had to split them up into smaller point queries, performing the JOIN directly in the application code.

We knew at the project’s outset that we would likely need to split up a handful of JOIN queries. What we did not anticipate was that most of them powered core Slack features and had been carefully hand-tuned for performance over a number of years. By splitting up these queries, we risked anything from slowing down notifications, to introducing data leaks, to bringing down Slack entirely. We were pretty nervous, but we needed to push on.

We put the day-to-day migrations on pause and compiled a list of the queries we were most concerned about, of which there were 20. Poring through the set, we worried that we didn’t have the product expertise required to adequately detangle each and every one. We estimated that without any additional help from product engineering, we’d need months to detangle each of the JOINs successfully. Fortunately, a number of product engineers responded to our call for help and together we developed a simple process that we could apply to split up each query safely.

To illustrate each step, we’ll walk through how we split up the query shown in Example 11-3, which was responsible for deciding whether a user had permission to see a specific file.

Example 11-3. A sample JOIN we needed to detangle; % symbolizes substitution syntax
SELECT COUNT(*)
FROM files_shares s
LEFT JOIN channels_members g
  ON g.team_id = s.team_id
  AND g.channel_id = s.channel_id
  AND g.user_id = %USER_ID
  AND g.date_deleted = 0
WHERE
  s.team_id = %TEAM_ID
AND s.file_id = %FILE_ID
LIMIT 1

We first needed to identify the smallest subset of data we could fetch earliest; this would help us minimize the intersection of data we needed to work with as early as possible.

With the file visibility query, we knew from typical usage patterns that the number of places where a file was shared was usually much smaller than the number of channels that a user was in. (We could also verify this assumption by looking at a query’s cardinality.) So, instead of first querying for a user’s channel memberships and cross-referencing those with the channels where the file was shared, we fetched the locations where the file was shared first and then determined whether the user was in any of these channels. You can see an example of the query split up into its two components in Example 11-4.

Example 11-4. The JOIN with files_shares split into two queries
SELECT DISTINCT channel_id
FROM files_shares
WHERE team_id=%TEAM_ID AND file_id=%FILE_ID

...

SELECT COUNT(*)
FROM channels_members
WHERE
  team_id=%TEAM_ID
  AND user_id=%USER_ID
  AND channel_id IN (%list:CHANNEL_IDS)
LIMIT 1

We then verified that the test coverage was sufficient. If it wasn’t, we would write a few additional test cases to verify the results of the original query. Once we were satisfied, we wrapped the new logic in an experiment to enable a gradual rollout and give us the ability to rollback quickly in an emergency. We ran our tests against both implementations, fixed any bugs that crept up, and repeated the process until we felt confident with our new logic. Finally, we instrumented both calls with some timings metrics to track the execution time of both the JOIN and its detangled version. Example 11-5 provides a rough outline for what the file visibility check looked like with both query implementations and corresponding instrumentation.

Tip

For the riskier query splits (including file visibility), we worked with the quality assurance team to manually verify the change in both our development environments and production before rolling it out to more users. The majority of the JOINs we sought to detangle dealt with critical Slack functionality, so we wanted to be particularly careful that our changes perfectly replicated intended behavior.

Example 11-5. Function for determining whether a user can see a specific file
function file_can_see($team, $user, $file): bool {

  if (experiment_get_user('detangle_files_shares_query')) {
    $start = microtime_float();

    # First, we want to find all of the channels where
    # the file was shared. Because we can share a file to the
    # same channel multiple times, we may find multiple files_shares
    # rows with the same channel ID but different timestamps
    # at which it was shared.
    $channel_ids =
      ud_files_shares_get_distinct_channel_ids(
        $team,
        $file['id']
      );

    # Next, we want to find the intersection of the channels
    # the file was shared in ($channel_ids) and the channels the
    # user is in.
    $membership_counts =
      ud_channels_members_get_counts(
        $team,
        $user['id'],
        $channel_ids
      );

    $end = microtime_float() - $start;

    # If there is at least one membership row, then the user
    # can see the file. If not, the user cannot see the file.
    _files_can_see_unjoined_histogram()->observe($end);
    return ($membership_counts['count'] > 0);
  }

  $start = microtime_float();
  $sql .=  "SELECT 1 FROM files_shares s
        LEFT JOIN channels_members g
        ON g.team_id = s.team_id
          AND g.channel_id = s.channel_id
          AND g.user_id = %USER_ID
          AND g.date_deleted=0
        WHERE s.team_id = %TEAM_ID
          AND s.file_id = %FILE_ID
          AND (g.user_id > 0) LIMIT 1";

  $bind = [
    'file_id' => $file['id'],
    'user_id' => $user['id'],
    'team_id' => $team['id']
  ];

  $ret = db_fetch_team($team, $sql, $bind);
  $end = microtime_float() - $start;
  _files_can_see_join_histogram()->observe($end);

  return (bool)db_single($ret);
}

We enabled the new implementation to our own internal Slack instance before rolling it out to real customers. This was an important step to confirm that we were properly ingesting timings metrics and further ensure that we had not unintentionally introduced a bug.

Slack’s workspace has all sorts of quirks, and our usage patterns don’t always match those of our customers. While it often makes for a decent litmus test for catching bugs early, the workspace was not a suitable candidate to help us determine whether the added latencies of the detangled queries were acceptable. For a subset of the JOINs, performance of the detangled queries was particularly aggravated on our own workspace, and as we continued the rollout to free teams, followed by larger paying customers, the metrics stabilized.

We repeated the process for nearly every JOIN, gingerly slicing queries apart, instrumenting them, and gradually rolling them out to customers. The only exception was two pesky mentions queries, which we left untouched for several months. Unfortunately, these queries posed a number of unique challenges, including JOINs against tables that were undergoing their own Vitess migration. We decided to defer on their migration until all their subcomponents had properly fallen into place. Overall, five of us took about six weeks on and off, with our time split between the refactor and other commitments, to finish migrating the majority of the JOINs.

Tip

It’s often the case that refactors don’t go entirely according to plan; we encounter hurdles that require us to reshuffle priorities or, in some cases, stop partway through a given step in favor of coming back to it later. Although it feels deeply unsatisfying to hit pause and shift gears, it can sometimes make a huge difference in our ability to deliver the overall project in a timely manner.

For this effort, had we waited for the remainder of the migrations we were depending on to land, it would have set the refactor back by several months. By instead choosing to move forward with the vast majority of the channels_members queries we had successfully rewritten, we were able to continue making headway, uncovering issues as they crept up; when the time finally came to revisit the mentions queries again, we were in a much stabler place to do so.

A Difficult Rollout

When we began our migration of channels_members, approximately 15 percent of our total queries per second (QPS) was powered by Vitess. We’d already migrated and resharded critical workloads, such as notifications-related tables, and the teams table responsible for listing each Slack customer instance. We had built reliable techniques and tooling to facilitate nearly 20 migrations, complete with dashboards and a framework for efficiently comparing data sets across the old and new clusters.

The channels_members migration was unique, however, in that it alone accounted for nearly 20 percent of our total query load, nearly doubling the QPS we had learned to manage on Vitess to date. Because of the scale, we were nervous about running into unexpected issues during the migration. That said, we were highly motivated to move these more sizable workloads off of MySQL, because it was struggling under the load of our largest customers. We were stuck between a rock and a hard place.

Our best bet was to lean heavily on the migration tooling we’d built during previous Vitess migrations. We hoped it would be stable enough for this table as well.

The rollout process we had developed for enabling migrations consisted of four high-level modes:

1. Backfill

During this stage, we double-wrote queries to both the new cluster (with the new sharding scheme) and to the old cluster. This mode further allowed us to backfill our new cluster with existing data from the old cluster.

2. Dark

This mode sent read traffic to both clusters and compared the results, logging any discrepancies in the data retrieved from the new Vitess cluster. Consumers of the read traffic were provided with results retrieved from the old cluster.

3. Light

This mode sent read traffic to both clusters, again comparing results and logging any discrepancies as they arose. However, instead of returning results from the old cluster, Vitess results were returned to the application.

4. Sunset

During this stage, we continued to double-write to both clusters but send read requests strictly to the Vitess cluster. This mode allowed us to discontinue the expensive process of reading from two distinct data sources, all the while enabling any downstream consumers to continue to rely on data stored in the old clusters until they were updated to read from Vitess. (This included systems such as our data warehouse.) At this stage, if any problems were uncovered, the only option was to fix forward; there was no easy or safe way to go back to consuming data from the legacy data source.

Fast, simple configuration deployments enabled us to swap easily between modes as well as ramp up and down within a single mode. The system also provided us with rather granular controls, whereby we could swiftly opt tiers of customers and users into distinct modes. We took advantage of being able to tweak these settings to ramp up and back down rapidly when we encountered any issues.

Backfill Mode

Every migration began with Backfill mode. In this mode, there were two primary goals. The first goal was to set the stage for running a complete backfill of the data from the old cluster in preparation for the migration of read queries. For the majority of our previous migrations, this phase was quite simple; the write queries for the new cluster would be identical (or nearly so) to the corresponding write queries to the old cluster. Because we were actively changing the data model, we ended up having to rewrite many of our application’s SQL queries to conform to our new schema (including propagating the share_type correctly and translating local user IDs to their canonical counterparts). Luckily, thanks to the prior consolidation discussed in Chapter 10, we were able to readily identify each query requiring a rewrite.

The second goal was to unveil any performance problems associated with write load to the new cluster. For most of these migrations, we considered the Backfill and Dark modes to have relatively little (if any) performance impact on the application in production. This was primarily because:

  • We used Hacklang’s async cooperative multitasking mode to send queries to both clusters concurrently. We set a short, one-second (1s) time-out on the query hitting the new cluster in Vitess, so that in the very worst case, the performance penalty for these queries would be 1s minus the time it took to execute the query from the old cluster.

  • We were not yet returning the results to the application from the Vitess cluster! This would occur in Light mode.

Again, our assumptions proved wrong with this migration. The user-sharded Vitess database cluster to which we were moving channels_members was already populated with highly used production data (including saved messages and notifications). As we ramped up Backfill mode, we began saturating database resources on Vitess, leading to time-outs and errors for queries to the critical tables already residing on the cluster. Digging in, we discovered that we had a number of update and delete queries lacking our sharding key (user_id), thus scattering them across every shard in the cluster. We made a configuration change so that these could run more efficiently, and then tentatively kicked off a second gradual ramp-up of Backfill mode. We quickly reached 100 percent and began the next stage, Dark mode!

Dark Mode

We entered the Dark mode portion of the refactor in earnest, having carefully rewritten most of the channels_members queries (including many of the troublesome JOINs) to read from Vitess, and successfully completed the backfill process in just over three months. Because our migration system enabled us to opt subsets of queries into different phases (i.e., one query could be in Dark mode while another was in Light mode), in an effort to parallelize as much of the refactor as possible, we began to ramp up Dark mode before we’d rewritten all our queries to read from the Vitess cluster properly.

Dark mode, as with Backfill mode, had two primary goals. Once again, one of our objectives was to reveal any potential performance problems associated with the read traffic being sent to the new cluster.

Performance

As we began ramping up traffic to read from Vitess concurrently with our legacy system, we noticed that a handful of queries with high QPS returned an alarming number of rows. The combination of high QPS with the large number of rows returned made the overall rows returned per second the largest in our cluster. Figure 11-3 shows that at peak, we were returning about 9,000 rows per second from a single shard’s channels_members table. In fact, these queries were so frequent and memory-intensive that they caused out-of-memory errors (OOMs) to flood the database host itself! During the days following our ramp-up, we saw 1/256 of our hosts running out of memory every day.

At first, we believed that our cloud provider was at fault; perhaps something was wrong with the way we had provisioned our largest database cluster. Eventually, we realized that it wasn’t a configuration mishap or random bad luck, and we swiftly ramped down to start isolating the source of the OOMs.

Figure 11-3. Rows returned from +channels_members+ on a single shard

Figure 11-4 shows our surprise with the OOMs during our weekly status update.

Figure 11-4. Weekly project status reporting on OOMs

The refactor was a high-priority project both within the infrastructure and among engineering organizations at large. From a database reliability perspective, moving channels_members to Vitess was an important step in continuing to develop our muscle memory around operating the new system, so when the OOMs proved particularly elusive, we began working with the entire database team at Slack, debugging from all angles directly in the channel we had set up to coordinate the effort, #feat-vitess-channels. We attempted to resize the memory allocation for our MySQL processes, digging into memory fragmentation and allocation at both the MySQL and operating system levels. During this process, we upgraded minor versions of MySQL to have access to a new setting that allowed us to specify the nonuniform memory access (NUMA) interleave policy for the buffer pool! Meanwhile, we continued to split up more JOINs, and began ramping up more Dark mode query load. Each time, we thought we might stop encountering OOMs, only to be disappointed as we kept encountering them as we ramped up more load.

At this point, the project had just surpassed the six-month mark, obliterating our initial estimate; the whole team very much felt as through we were consistently taking two steps forward and one step back. After weeks of trial and error, we discovered that other storage systems at Slack (including our monitoring cluster and Search cluster) had hit problems with a restrictively small value for min_free_kbytes, a low-level kernel setting responsible for controlling how aggressively the kernel decides to free memory. The larger the value, the more breathing room the kernel will give itself by shedding more data held in RAM. With the substantial number of queries returning a large number of rows at high QPS, we would sporadically hit spikes of requests that required a sudden allocation of a large amount of RAM, leading to OOMs, because the kernel couldn’t free RAM quickly enough to return results. Bumping this min_free_kbytes to a higher value enabled our hosts to manage the memory pressure associated with these queries better and finally resolved our OOMs.

We spent eight whole months in the Dark mode phase; not only did we spend more time in this phase alone than we had initially anticipated spending on the project as a whole, it accounted for nearly two-thirds of the entire endeavor once we’d completed it. What happened?

Data discrepancies

Given our configuration changes, we were comfortable ramping up 100 percent of the traffic to the Vitess cluster without the risk of affecting site-wide performance. At this point, nearly all JOINs were detangled, with all point queries updated to read from the Vitess cluster as well. During this second step, our primary goal was to reveal any discrepancies in the data sets returned from the new queries. We could easily compare the two sets side by side because we concurrently ran our queries against both the new and old clusters and logged diffs as we encountered them (using results from the existing query against our legacy data source as the source of truth). We aggregated discrepancies in a number of ways so that we could get a broad sense of the scope of the problems we needed to address, in addition to logging primary keys whenever a pair returned different results.

We spent a few weeks in this phase, meticulously combing through the diffs. Because our user-sharded schema incorporated more information than the original, workspace-sharded channels_members table, we were juggling many more variables during the rewrite process than we might have otherwise. We sought to improve the developer experience for engineers working with shared channels and Enterprise Grid, requiring us to consider tricky product logic thoughtfully with each query we migrated. This meant that the potential for mistakes was much greater than had we done a one-to-one migration (as was the case with every table we had moved to Vitess to date).

Large portions of the differences in the data sets were due to single problems; fixing a single instance would often lead to a large reduction in the volume of diffs logged. For example, if on the legacy system we were selecting a different set of columns than on Vitess, every query would return mismatched results, logging a diff. As we reported on in Figure 11-5, finding and fixing the discrepancies to ignore mismatched columns decreased the number of diffs logged against the channel-sharded table from 10 percent of all queries to just 0.01 percent.

Figure 11-5. Reducing diffs on the channel-sharded channels_members table

Here’s a close-up of the graph in Figure 11-5’s Slack message:

Alas, not all diffs were as easy to fix. Reading through the differences in data sets, we uncovered a few spots where our logic for shared channels was not quite right, and a few others where we had made mistakes in our backfill. It was tedious work and, due to the product implications, oftentimes required a profound understanding of the inner workings of our application. Although our manipulations were hidden between feature flags and experiments, the changes we were making had real ramifications for our production systems, and we had to proceed with real caution. Given these factors and the fact that the project was progressing at a slow crawl, we asked for more resources from product engineering.

Bringing new folks onboard brought new life to the project. Those of us who had been involved for many months were eager to get new perspectives on the many problems we’d been facing. We used pairing to ramp up new engineers quickly, joining forces to debug a small set of data discrepancies. It was the perfect context from which to demonstrate the Vitess migration tooling and the phased rollout process, and to talk through the new schemas. The work was tedious, but with a bigger arsenal of engineers at our disposal, we managed to boost our momentum drastically and banish the final few discrepancies. We did not get down to zero diffs, but settled for feeling good at 99.999 percent correctness. Since we knew that each channels_members row could change quite rapidly as a user read messages in Slack, moving their last_read cursor state, we felt comfortable with some amount of discrepency that could be attributed to rapid read-after-write situations. Digging into the remaining .001 percent of differences, when we examined rows directly in the database after a diff occurred, we noticed that the rows would converge to the same state.

Wrapping up the Dark phase was significant. Knowing that 100 percent of channels_members traffic could run in a performant way on Vitess and return correct results was absolutely crucial to the overall success of the refactor. Although we weren’t quite finished yet, being able to close the book on Dark mode was a relief to everyone. Finally, we were ready to ramp up to Light mode for a small subset of beta users within the company.

Light Mode

During Light mode, we wanted to test-drive the data retrieved from executing queries against the Vitess cluster, certifying that swapping over traffic to our new tables would not introduce any user-facing regressions. We were fairly confident that there would be relatively few bugs, in great part because of the work completed during the previous phases to address data discrepancies. However, because channel membership is at the core of Slack, if there were any bugs at all, they risked being quite serious. So we started our Light mode ramp-up carefully, starting off with a small group of volunteers at Slack, with the eventual goal to enable it to our entire customer base.

Most things worked fine, but we quickly ran into a problem when sometimes, after joining a channel, users would be unable to send messages. We immediately ramped down the experiment and dug into query logs, which we kept on all database hosts for up to two hours. These logs allowed us to debug easily, grepping for any modifications to the user’s membership row in the given channel and the callers responsible for them.

We quickly identified the culprit: a background process, triggered after any Grid user joined a workspace-level channel they’d previously been a member of, which would locate and replace membership rows that had a canonical user ID with the user’s local user ID. This was a problem because our new database schema in Vitess intentionally used canonical user IDs; after the process had rewritten the user ID, we could no longer locate the user’s membership row, thereby preventing them from sending messages.

We were puzzled about why this process existed and curious to understand whether we needed to preserve this strange behavior or had uncovered a more nefarious problem. A journey into Slack conversations and git history from years prior revealed that the code was written to paper over a problem specific to an Enterprise Grid feature, where we sometimes wrote placeholder membership rows with canonical user IDs and updated them once users rejoined those channels.

This issue did not manifest itself in the discrepancies we inspected during the Dark mode phase, nor did it appear during several rounds of manual quality assurance (QA) and in the unit tests we wrote, because it only arose under precise, highly uncommon circumstances. Fortunately, we determined that we no longer needed this process and deleted it entirely. Problem solved!

From start to finish, we spent one month ramping up Light mode to all customers. Once we’d gained confidence in the overall correctness of the data in the Vitess cluster with our small set of volunteers, we continued the ramp-up. We began with our own Slack instance and then went on to teams on the free tier, followed by paying customers, and finally our largest Enterprise customers. During the ramp-up, we noticed that our customer with the greatest number of shared channels was seeing time-outs on the API called when viewing a channel (conversations.view). We quickly noticed that one of the Vitess channels_members queries executed during the API call was timing out. Unfortunately, because the query was relatively low volume, we hadn’t been alerted to the problem during the Dark mode phase. We immediately rolled back Light mode for the customer, fixed the query, and ramped right back up.

Sunset Mode

A mere three days after successfully opting all customers into Light mode, we began the final stage, Sunset mode. During this phase, although we continued to double-write to both data sources, we only routed read traffic to the new Vitess clusters. By enabling Sunset mode to our users, we decreased the query load on our overloaded legacy systems by 22 percent, giving them much-needed breathing room. Figure 11-6 shows the dip in query volume we observed across our workspace shards.

Figure 11-6. Removing read queries from the legacy workspace-sharded clusters

Tidying Up

After Sunset mode, a handful of important tasks remained. Namely, once our data warehouse dependencies had been properly migrated to consume channel membership data from Vitess, we needed to drop the old workspace-sharded channels_members tables. We bade them farewell roughly a month later. We then spent the following weeks tidying the channel membership unidata library, carefully unwinding any feature flags and removing double-writing logic.

Dropping writes from the legacy shards was a huge, timely win. We removed 50 percent of writes and completely eliminated replication lag on the enterprise shard for our largest customer (VLB from Chapter 10), just as it was beginning to struggle under the pressure of the incessant write traffic. In the days leading up to dropping the table, the shard had been experiencing replication lag upward of 20 minutes. Figure 11-7 shows the steep drop in write traffic to VLB’s enterprise shard.

Figure 11-7. Removing writes from VLB’s shard

Figure 11-8 shows a distinct lack of spikes in replication lag following the removal of the write load.

Figure 11-8. No more replication lag!

Here’s a close-up of the graph in Figure 11-8’s Slack message:

Unfortunately, just as we were finishing up, the coronavirus was beginning to spread, and our offices around the world shut down, with Slack’s entire workforce transitioning to working from home. With the global shift to remote work, Slack saw a sharp increase in demand; we were acquiring new customers at a breakneck pace, and our existing customers were sending more messages than ever before. The entire infrastructure team, including those of us winding down the channels_members migration, urgently shifted their focus to scaling our systems to unprecendented levels. Although we were relieved to bring the refactor to a close, we were never given the proper opportunity to revel in our achievement.

With this project at a close, other engineers at Slack started scheming about ways to take advantage of the newly resharded table. Quickly, prototypes of new features started emerging even when we were in SUNSET mode, and many following projects were staffed on multiple teams quickly to take advantage of the new data model and simplify other queries around both Grid and shared channels.

Lessons Learned

As with our previous case study, there are a number of important lessons to be learned from our migration of channels_members to Vitess. We’ll start with ways the project might have gone better, describing how we might have set more realistic estimates and sourced the right teammates sooner. Then we’ll discuss ways it succeeded, detailing our decision to increase project scope carefully at the outset and the merits of our simple communication strategy.

Set Realistic Estimates

By the time we started our migration of the channels_members table to Vitess, we had done a number of Vitess migrations already. We had built and refined tooling to improve the process, making it easier and safer with every iteration. We based our initial estimates on our experience with our most recent migrations, which had been decidedly quicker than the first few. We optimistically assumed that this migration would be no more difficult than the last.

We should have known, however, that channels_members would be a different beast for a number of reasons. First, the query load far exceeded any of our previous migrations. Second, we decided to shard the data across two keys, user and channel, rather than just one. Finally, we chose to use canonical user IDs and make meaningful changes to the schema to improve developer productivity, thereby further increasing the complexity of the project. Our estimates should have reflected these important decisions and their implications.

The team took a morale hit when we surpassed our original estimate, and engineering leadership turned a more watchful eye on the project. Fortunately, we were able to secure more resources and move forward with the refactor, but our estimate certainly did not set the expectations it should have at the start.

Setting unrealistic estimates can have much more serious consequences: the refactor might lose priority, and engineering leadership might lose faith in your ability to drive large software projects. Your career risks taking a hit. Had we taken the time to brainstorm each of the potential pitfalls and leaned on the strategies discussed in Chapter 4, we might have set better expectations for both ourselves and our stakeholders at the start of the refactor.

Source the Teammates You Need

When we started the project, we assumed that the majority of the work would be best handled by infrastructure engineers. We could reach out to product engineers as necessary, asking questions or seeking code review on an ad hoc basis. Only once we ran into difficulties detangling the JOINs did we ask for more significant resourcing from product engineering. It was at that point that we realized that we could work faster by working alongside engineers who were intimately familiar with the queries we were migrating. Their involvement was crucial throughout the lengthy Dark mode phase, during which we debugged a number of data discrepancies that led to strange behaviors in the product. Had they been more present from the beginning, we might have migrated queries more quickly and more correctly (including the JOINs), cutting down on the time spent in later phases.

As discussed in Chapter 5, sometimes the teammates you have are not the ones best suited for the job. Because large-scale refactors have far-reaching impact, they often involve engineers from different teams and disciplines. The team you identify at the start of your project is very rarely set in stone. If you believe your team is no longer the right one, figure out who it is missing and seek out those individuals. If you think you need more resources than you had initially anticipated, ask for them.

Plan Scope Carefully

An important decision we made early in the refactor was to use canonical user IDs for all user ID–related columns in the Vitess channels_members schemas. We knew that Slack was aiming to adopt canonical user IDs throughout, but the first few phases of the project were unlikely to conclude before our table migration was complete.

By choosing to adopt canonical user IDs, we intentionally increased the scope of the refactor. We could have spent the time canonicalizing user IDs on our legacy workspace-sharded clusters first, only migrating to Vitess once the data had been properly updated. Likewise, we could have migrated the table without canonicalizing the IDs and initiated the process once it had safely landed in Vitess. We believed that by doing both at the same time, we would save both time and effort. (While we had no great way of measuring this, we do believe it turned out to be true!)

In Chapter 4, we learned that keeping a moderate scope is important to ensure that a refactor is completed within a reasonable amount of time and does not affect more surface area than is necessary. However, there are circumstances when adding some additional scope is worthwhile and will ultimately make the effort more successful. Be mindful of these opportunities during the project planning stage and make a deliberate decision to take advantage of them well before the project is in full swing. This way, when you communicate your plan more broadly, stakeholders will have an opportunity to voice an opinion about the additional scope, and everyone’s expectations should be appropriately aligned.

Choose a Single Place for Project Communication

Throughout the refactor, we leaned heavily on our project channel, #feat-vitess-channels, to collaborate, coordinate, and provide important updates. Because it served as our central point of contact, everyone kept up to date with new messages. It was a great place to ask questions or post code for review; you were sure to get a response within a few minutes. On several occasions, teammates would debug issues in threads for others to chime in or catch up on later. During the Light mode portion of the refactor, users who had volunteered to be opted in to the new queries would come to #feat-vitess-channels to report bugs and other strange behavior they’d encountered. If it was related to moving channels_members to Vitess, you could find it in this channel.

Most importantly, #feat-vitess-channels was a place for us to keep each other motivated. As the refactor dragged on, with engineers cycling on and off and Dark mode continuing to throw us a number of curveballs, it became increasingly difficult to stay optimistic about our progress. Engineers from across the company would occasionally pop in with an encouraging “You got this!” or a series of emoji reactions to a weekly status update. Small, thoughtful acts of support can go a long way to boost team morale, and having a convenient place where colleagues could share their encouragement helped make it a common occurrence.

By keeping all communication pertaining to the project in a single place, it’s easy for everyone involved with the refactor to stay on the same page. Teammates can join and leave the effort without extensive knowledge transfers. External stakeholders can check in on the latest progress without pinging you directly. Perhaps most importantly, it can be a place of support and encouragement. For ideas on how to establish good communication habits, refer to Chapter 7.

Design a Thoughtful Rollout Plan

The migration of the channel membership table to Vitess had a well-defined rollout strategy split into four concrete phases. At each stage, we had a strong vision of when we should opt different groups of users into our changes (i.e., users at the company first, followed by customers on the free tier, regular paid customers, and our largest customers last). On top of this procedure, we used highly reliable tooling built explicitly for the Vitess migration use case, which enabled us to quickly ramp up (and down) each of the different modes to distinct slices of users at our preferred pace.

Each of these factors helped us move forward quickly, but perhaps the most effective piece was our ability to roll back immediately if we began to notice a detrimental impact to our users. Having that power at our fingertips meant that we weren’t afraid to move forward aggressively. It was particularly useful when we entered the Light mode phase as we used volunteers within the company to read data from the Vitess cluster.

Even the most thoughtfully planned, meticulously executed refactor will lead to a handful of bugs, and it is often impossible to identify them all before beginning a rollout. If you can control who is opted in to your changes at important milestones, and can roll back swiftly, you’ll be able to make progress much more nimbly, surfacing potentially terrible regressions well before they become a serious incident.

Takeaways

Here are the most important takeaways from our refactor to migrate channels_members from our workspace-sharded clusters to user- and channel-sharded clusters in Vitess.

  • Set realistic estimates. Optimism is great, but missed deadlines can have serious ramifications.

  • Source the teammates you need; the ones available to you or currently on your team may not be the ones best suited for the job. Don’t be afraid to ask for new (or more) resources if you need them.

  • Plan project scope carefully. Any added scope should be accounted for during the planning phase to set expectations appropriately.

  • Choose a single place for project communication and stick to it.

  • Design a thoughtful rollout plan and invest in building the tooling you need to make ramp up (and down) as easy as possible.