3. Measuring Our Starting State – Refactoring at Scale

Chapter 3. Measuring Our Starting State

Every spring, I take the time to clean out my closet and reevaluate all of the clothing I own. While some opt for a Marie Kondo–like approach to cleaning out their closets, seeing whether each item “sparks joy,” I take a more methodical one. Each year, when I kick off the process, I know that by the end, a number of items will be in the donate pile. What I don’t know is which pieces these will be, because it entirely depends on how all of my clothing works together in the first place.

Before I start packing some bags for Goodwill, I take a comprehensive look at the whole. I organize everything by clothing type: sweaters in one pile, dresses in another, and so on, accounting for the practicality of each item of clothing as I go. Which seasons is this dress good for? How comfortable is it? How often have I worn it in the past year? Next, I approximate how many outfits the item can be integrated with. It’s only once I have a strong sense of everything I own, and understand the role each item of clothing plays in my closet, that I can start to identify the clothing I can comfortably donate.

The same logic applies to large refactoring efforts; only once we have a solid characterization of the surface area we want to improve can we begin to identify the best way to improve it. Unfortunately, finding meaningful ways of measuring the pain points in our code today is much more difficult than categorizing items of clothing in our closets. This chapter discusses a number of techniques for quantifying and qualifying the state of our code before we begin refactoring. We’ll cover a few well-known techniques as well as a few newer, more creative approaches. By the end of the chapter, I hope you’ll have found one (or more) ways to measure the code you want to improve in a way that highlights the problems you want to solve.

Why Is Measuring the Impact of a Refactor Difficult?

There are a number of ways to measure the health of a codebase. Many of these metrics, however, might not move in a positive direction as a result of a large-scale refactor simply because they are orthogonal to the pain points the project aims to address. So, in measuring the starting state of our codebase, we want to choose a metric that we believe will summarize the problem well and accurately highlight the impact of our refactor.

Measuring the impact of any refactoring effort is tricky, primarily because when executed successfully, refactoring should be invisible to users and lead to no behavioral changes whatsoever. This isn’t a new feature we’re hoping will drive user adoption or a tweak. We often put a great deal of effort into monitoring critical pieces of our applications to ensure that our users are getting a reliable experience when using our product, but because these metrics capture behavior that our users are likely to notice, most of them remain unaffected when we’ve refactored correctly. To characterize the impact of a refactor best, we need to identify metrics that measure the precise aspects of the code we want to improve and establish a strong baseline before moving forward.

Large refactoring efforts are particularly difficult to measure because they rarely take place in the span of just a few weeks. More often than not, the work involved from start to finish spans far beyond the typical feature development cycle, and unless product development was completely paused while the refactoring effort was ongoing, it might be difficult to isolate its impact from the work of other developers in the same section of the application. Reliance on a handful of distinct metrics can help you paint a more holistic picture of your progress and better distinguish your changes from those introduced by other developers iterating on the product alongside you.

Measuring Code Complexity

Many of us are motivated to refactor as a means of boosting developer productivity, making it easier for us to continue to maintain our applications and build new features. In practice, this often means simplifying complex, convoluted sections of code. Given that our goal revolves around decreasing code complexity, we need to find a meaningful way of measuring it. Quantifying the code’s complexity gives us a starting point from which we can begin to assess our progress.

Measuring software complexity is easy in two main ways. First, if our code resides in version history, we can easily travel through time and apply our complexity calculations at any interval. Second, a vast number of open-source libraries and tools are readily available in many programming languages. Generating a report for your entire application can be as simple as installing a package and running a single command.

Here, we’ll discuss three common methods of calculating code complexity.

Halstead Metrics

Maurice Halstead first proposed measuring the complexity of software in 1975 by counting the number of operators and operands in a given computer program. He believed that because programs mainly consisted of these two units, counting their unique instances might give us a meaningful measure of the size of the program and therefore indicate something about its complexity.

Operators are constructs that behave like functions, but differ syntactically or semantically from typical functions. These include arithmetic symbols like - and +, logical operators like &&, comparison operators like >, and assignment operators like =. Take, for instance, a simple function that adds two numbers together, as shown in Example 3-1.

Example 3-1. A short function that adds two numbers together
function add(x, y) {
  return x+y;
}

It contains a single operator, the addition operator, +. Operands, on the other hand, are any entities we operate on, using our set of operators. In our addition example, our operands are x and y.

Given these simple data points, Halstead proposed a set of metrics to calculate a set of characteristics:

  1. A program’s volume, or how much information the reader of the code has to absorb in order to understand its meaning.

  2. A program’s difficulty, or the amount of mental effort required to re-create the software; also commonly referred to as the Halstead effort metric.

  3. The number of bugs you are likely to find in the system.

To illustrate Halstead’s ideas better, we can apply our operator and operand counting technique to a slightly more complicated function, which calculates an integer’s prime factors, as in Example 3-2. We’ve enumerated each of the unique operators and operands, along with the number of times they occur in the program, in Table 3-1.

Example 3-2. Operators and operands in a short function
function primeFactors(number) {
  function isPrime(number) {
    for (let i = 2; i <= Math.sqrt(number); i++) {
      if (number % i === 0) return false;
    }
    return true;
  }

  const result = [];
  for (let i = 2; i <= number; i++) {
    while (isPrime(i) && number % i === 0) {
      if (!result.includes(i)) result.push(i);
      number /= i;
    }
  }
  return result;
}
Table 3-1. Unique operators and operands, with their frequencies
Operator Number of occurrences Operand Number of occurrences

function

2

0

2

for

2

2

2

let

2

primeFactors

1

=

3

number

7

<=

2

isPrime

2

()

4

i

12

.

3

Math

1

++ (postfix)

2

sqrt

1

if

2

FALSE

1

===

2

TRUE

1

%

2

result

4

return

3

<anonymous>

1

const

1

includes

1

[]

1

push

1

while

1

&&

1

! (prefix)

1

/=

1

Unique operators: 18

Total operators: 35

Unique operands: 14

Total operands: 37

Given that our prime factorization program has 18 unique operators (n1), 14 unique operands (n2), and a total operand count of 37 (N2), we can use Halstead’s difficulty measure to calculate the relative difficulty associated with reading the program with the basic equation:

D = n 1 2 · N 2 n 2

Substituting in our values, we obtain an overall difficulty score of 23.78.

D = 18 2 · 37 14
D = 23 . 78

Although 23.78 might not signify much on its own, we can gradually acquire an understanding of how this score maps to our experiences, working with individual sections of our code. Over time, through repeated exposure to these values alongside their implementations, we become better able to interpret what a score of 23.78 signifies within the greater context of our application.

Tip

Each of the three distinct metrics described in this section can be generated at different scales; they can quantify the complexity of a single function or a complete module. You can calculate the Halstead difficulty metric for an entire file, for instance, by summing up the difficulties of the individual functions contained within it.

Cyclomatic Complexity

Developed by Thomas McCabe in 1976, cyclomatic complexity is a quantitative measure of the number of linearly independent paths through a program’s source code. It is essentially a count of the number of control flow statements within a program. This includes if statements, while and for loops, and case statements in side switch blocks.

Take, for example, a simple program with no control flow components, as shown in Example 3-3. To calculate its cyclomatic complexity, we first assign 1 for the function declaration, incrementing with every decision point we encounter. Example 3-3 has a cyclomatic complexity of 1 because there is only one path through the function.

Example 3-3. Simple temperature conversion function
function convertToFahrenheit(celsius) {
  return celsius * (9/5) + 32;
}

Let’s look at a more complex example, like our primeFactors function from Example 3-2. In Example 3-4, we reduce it and enumerate each of the control flow points to yield a cyclomatic complexity of 6.

Example 3-4. Operators and operands in a short function
function primeFactors(number) { 
  function isPrime(number) {
    for (let i = 2; i <= Math.sqrt(number); i++) { 
      if (number % i === 0) return false; 
    }
    return true;
  }

  const result = [];
  for (let i = 2; i <= number; i++) { 
    while (isPrime(i) && number % i === 0) {
      if (!result.includes(i)) result.push(i); 
      number /= i;
    }
  }
  return result;
}

Function declaration is the first control flow point.

First for loop is our second point.

First if statement is our third point.

Second for loop is the fourth point.

while is the fifth point.

Second if is the sixth point.

When we’re reading a chunk of code, every time there is a branch (an if statement, a for loop, etc.), we have to begin to reason about multiple states with multiple paths of execution. We have to be able to hold more information in our heads to understand what the code does. So, with a cyclomatic complexity of 6, we can infer that primeFactors is probably not too difficult to read and understand.

Counting the number of decision points in a program is a simplification of McCabe’s proposed method of calculating its complexity. Mathematically, we can calculate the cyclomatic complexity of a structured program by generating a directed graph representing its control flow; each node represents a basic block (i.e., a straight-line code sequence with no branches), with an edge linking them if there is a way to pass from one block to the other. Given this graph, its complexity, M, is defined as in the following equation, where E is the number of edges, N is the number of nodes, and P is the number of connection components, where a connected component is a subgraph where the nodes are all reachable from one another.

M = E - N + 2 P

Figure 3-1 shows an example control flow for primeFactors.

Figure 3-1. Control flow graph for primeFactors, with blue nodes signifying nonterminal states and red nodes signifying terminal states. For this example, we have 13 edges, 11 nodes, and 2 connected components.

NPath Complexity

NPath complexity was proposed as an alternative to existing complexity metrics in 1988 by Brian Nejmeh. He argues that focusing on acyclic execution paths did not adequately model the relationship between finite subsets of paths and the set of all possible execution paths. We can observe this limitation in the fact that cyclomatic complexity does not consider nesting of control flow structures. A function with three for loops in succession will yield the same metric as one with three nested for loops. Nesting can influence the psychological complexity of the function, and psychological complexity can have a large impact on our ability to maintain software quality.

McCabe’s metric might be easy to calculate, but it fails to distinguish between different kinds of control flow structures, treating if statements identically to while or for loops. Nejmeh asserts that not all control flow structures are equal; some are more difficult to understand use properly than others. For example, a while loop might be trickier for a developer to reason about than a switch statement. NPath complexity attempts to address this concern. Unfortunately, this makes it a bit more difficult to calculate, even for small programs, because the calculation is recursive and can quickly balloon. We’ll walk through the calculations for a few examples with if statements to get familiar with how it works. If you’d like to gain a better understanding of how to calculate NPath complexity, given a greater range of control flow statements (including nested control flows), I highly recommend reading Nejmeh’s paper.

Tip

Control flow metrics can help you determine the number of test cases your code needs. Cyclomatic complexity offers a lower bound, and NPath complexity provides an upper bound. For instance, with primeFactors, cyclomatic complexity indicates that we would want at least six test cases to exercise each of the decision points.

Our base case for NPath complexity is the same as for our previous temperature converter function in Example 3-3; for a simple program with no decision points, the NPath complexity is 1. To illustrate the multiplicative component of the metric, we’ll take a look at a simple function with a few nested if conditions.

Example 3-5 shows a short function that returns the likelihood of receiving a speeding ticket, given a provided speed. Reading through the function, we reach a first if statement, at which point the given speed can either be less than or greater than 45 km/h. There are then two possible paths: if the speed is greater than 45 km/h, we enter the code inside the if block; if not, we simply continue. We next need to verify whether the speed is greater than 10 km/h over the supplied speed limit, at which point we again have two possible paths through the code. Eventually, we return our calculated risk factor.

Example 3-5. A short function with two, sequential if statements, with different sections annotated A, B, C, D, E, and F
function likelihoodOfSpeedingTicket(currentSpeed, limit){
  risk = 0;  // A

  if (currentSpeed < 45) {
    risk = 1; // B
  } // C

  if (currentSpeed > (limit + 10)) {
    risk = 2; // D
  } // E

  return risk; // F
}

NPath complexity calculates the number of distinct paths through a function. We can enumerate each of these paths by calling likelihoodOfSpeedingTicket with a range of values, exercising each set of conditions. We’ll walk through one input together, highlighting the path we traverse through the function. All other unique paths are labeled in Table 3-2.

Table 3-2. All unique paths through likelihoodOfSpeedingTicket
Inputs Path

30, 10

A, B, D, F

30, 50

A, B, E, F

90, 50

A, C, D, F

90, 110

A, C, E, F

Unique paths: 4

Say we call likelihoodOfSpeedingTicket with a currentSpeed of 30 and limit of 0. Our first if statement will evaluate to true, leading us to B. Our second if statement will also evaluate to true, leading us to D. Then we reach our return statement at F. Repeating this pattern for a variety of inputs, we determine that there are four unique paths through the function. Therefore, our NPath score is 4.

Note

Some easy forms of refactoring won’t have any impact on your CFG metrics. Some complexity is unavoidable simply due to complicated business logic. You have to make each of these checks and iterations to ensure that your application is doing what it needs to be doing. When the code you want to refactor involves simplifying unnecessarily complicated logic, then NPath or cyclomatic complexity are great options. If not, then I recommend using a different set of metrics. Do be mindful, however, that even if you are detangling some spaghetti code, NPath or cyclomatic complexity should not be your only metrics; you won’t be able to characterize the impact of your refactoring effort holistically and properly with only a single data point.

Lines of Code

Unfortunately, control flow graph metrics can be difficult (and sometimes expensive) to calculate, particularly for very large codebases (which are precisely the ones we’re looking to improve). This is where program size comes into play. Although it may not be quite as scientific as Halstead, McCabe, or Nejmeh’s algorithms, combined with other measurements, program size can help us locate likely pain points in our application. If we’re looking for a pragmatic, low-effort approach to quantifying the complexity of our code, then size-based metrics are the way to go.

When measuring code length, we have a few options available to us. Most developers choose to measure only logical lines of code, omitting empty lines and comments entirely. As with our control flow metrics, we can collect this information at a number of resolutions. I’ve found the following few data points to be quite helpful reference points:

LOC (lines of code) per file

Every codebase has the kind of files that look as if you might not reach the end if you started scrolling from the beginning. Measuring the number of lines of code for these would likely accurately capture the psychological overhead required to understand their contents and responsibilities when a developer pops them open in their editor.

Function length

For every endless file, there’s an endless function. (More often than not, the endless functions are found in the endless files.) Measuring the length of functions or methods within your application can be a helpful way of approximating their individual complexities.

Average function length per file, module, or class

Depending on how your application is organized, you may want to keep track of the average function or method length per logical unit. In object-oriented codebases, you likely want to keep track of the average length of each method within a class or package. In an imperative codebase, you might measure the average length of each function within a file or larger module. Whatever the greater organizational unit, knowing the average length of the smaller logical components contained within it can give you an indication of the relative complexity of that unit as a whole.

LOC might vary wildly, depending on the language of a program or programming style, but if we’re comparing apples to apples, we shouldn’t be too concerned. When refactoring at scale, we’re generally concerned with improving code within a single, large codebase. In my experience, the vast majority of developers working with these codebases have invested in establishing style guides, defining a set of best practices, and often enforcing these rules with autoformatters. Some variation is inevitable across teams and components, but broadly speaking, the application as a whole tends to look similar enough that two sets of LOC metrics from distinct sections of the codebase should still be comparable.

Test Coverage Metrics

When we’re developing new features, there are a few testing philosophies we can adopt. We can opt for a test-driven development (TDD) approach, writing a thorough suite of tests first and then iterating on an implementation until the tests pass; we can write our solution first, followed by the corresponding tests; or we can decide to alternate between the two, incrementally building an implementation, pausing to write a handful of tests with each iteration. Whatever our approach, the desired outcome is the same: a new feature, fully backed by a quality set of tests.

Refactoring is a different beast. When we’re working to improve an existing implementation, whatever the extent of our endeavor, we want to be sure that we’re correctly retaining its behavior. We can safely assert that our new solution continues to work identically to the old by relying on the original implementation’s test suite. Because we are relying on the test coverage to warn us about potential regressions, we need to verify two things before beginning our refactoring effort: first, confirm that the original implementation has test coverage and, second, determine whether that test coverage is adequate.

Say we want to refactor our primeFactors function in Example 3-2. Before we consider making any changes, we need to measure whether it has test coverage and, if it does, whether that test coverage is sufficient. Verifying that the implementation has test coverage is easy. We can just pop open the corresponding test file and take a peek at what it contains. For our example, we find just one test, shown in Example 3-6.

Example 3-6. A simple test for primeFactors
describe('base cases', () => {
  test('0', () => {
    expect(primeFactors(0)).toStrictEqual([]);
  });
});

Determining whether that test coverage is adequate, however, is a trickier task. We can evaluate it in two ways: quantitatively and qualitatively. Quantitatively, we can calculate a percentage representing the proportion of code that is executed when the test suite is run against it. We can collect metrics for both the number of functional lines of code and the number of execution paths tested by our simple unit test, yielding 40 percent and 35.71 percent, respectively. Example 3-7 shows the test output generated with the Jest unit testing framework.

Example 3-7. Jest test coverage output for primeFactors, given our single test case
-----------------|---------|----------|---------|---------|-------------------
File             | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
-----------------|---------|----------|---------|---------|-------------------
All files        |   35.71 |        0 |      50 |      40 |
 primeFactors.js |   35.71 |        0 |      50 |      40 | 3-6,11-13
-----------------|---------|----------|---------|---------|-------------------
Test Suites: 1 passed, 1 total
Tests:       1 passed, 1 total

Now, we have to decide whether this is adequate test coverage. Neither metric fills me with great confidence that primeFactors is particularly well-tested; after all, this indicates that over three-fourths of the function is not being exercised by our current suite. Test coverage is primarily useful in two ways:

  • Helping us identify untested paths in our program

  • Serving as a ballpark measure of whether we have tested enough

Note

If you are looking for strategies for testing legacy software, I recommend picking up a copy of Working Effectively with Legacy Code by Michael Feathers. He discusses a bevy of options for how to introduce unit tests retroactively by capitalizing on seams in the code, strategic places where you can change the behavior of your program without modifying the code itself.

To improve the test coverage for our example, we can add one more test case, as shown in Example 3-8. If we recalculate our coverage (see Example 3-9), we notice that with just one additional test case, we can achieve near-perfect coverage. Does this mean that our test coverage is adequate? Quantitatively it might appear to be sufficient; qualitatively it might not be. Peeking back at our implementation for primeFactors, we can easily identify a few missing test cases, such as providing a negative number, or the number 2.

Example 3-8. A simple test for primeFactors
describe('base cases', () => {
  test('0', () => {
    expect(primeFactors(0)).toStrictEqual([]);
  });
});

describe('small non-prime numbers', () => {
  test('20', () => {
    expect(primeFactors(0)).toStrictEqual([2, 5]);
  });
});
Example 3-9. Jest test coverage output for primeFactors, given our two test cases
-----------------|---------|----------|---------|---------|-------------------
File             | % Stmts | % Branch | % Funcs | % Lines | Uncovered Line #s
-----------------|---------|----------|---------|---------|-------------------
All files        |     100 |    83.33 |     100 |     100 |
 primeFactors.js |     100 |    83.33 |     100 |     100 | 12
-----------------|---------|----------|---------|---------|-------------------
Test Suites: 1 passed, 1 total
Tests:       2 passed, 2 total

In my experience, thoughtfully written code generally has between 80 and 90 percent test coverage. This shows that the majority of the code is tested. Be forewarned, however, that test coverage alone is not an indication of how well-tested something is. It’s easy to write low-quality unit tests to reach perfect or near-perfect test coverage. If high test coverage is incentivized by management, you will typically find that a significant portion of your unit tests make little effort to assert the corresponding code’s important behavior.

From a qualitative standpoint, determining whether test coverage is sufficient is not so simple. There is a great deal of thoughtful writing about this already, most of which goes beyond the scope of this book, but at a high level, I think suitable test quality has been attained if the following points hold true:

  • The tests are reliable. From one run to the next, they consistently produce passing results when run against unchanged code and catch bugs during development.

  • The tests are resilient. They are not so tightly coupled to implementation that they stifle change.

  • A range of test types exercise the code. Having unit, integration, and end-to-end tests can help us assert that our code is functioning as intended with different levels of fidelity.

If we have asserted that the test coverage and test quality is substantial enough, then we should be confident in moving forward with our refactoring effort. If tests are lacking either in coverage or quality, we need to spend the requisite time writing more, and better, tests up front. Measuring the test quantity and quality of each of the sections of code we intend to refactor is an important step in helping us determine how much additional work we need to commit to before we begin refactoring.

Documentation

Before we start refactoring something, we should take stock of any existing documentation about it. Reading through the documentation may help us gain valuable, additional context on the code. While documentation is not a great source of numerical metrics we can use to measure our starting state, it is a critical source of evidence we can use to describe the current problems we seek to improve. We’ll discuss two forms of documentation we should be concerned about when trying to understand and quantify our starting point in anticipation of a large refactoring effort. These are formal and informal forms of documentation.

Formal Documentation

Formal documentation is everything you most likely think of as documentation. This doesn’t have to follow any official, industry-level standard (like Unified Modeling Language [UML]). Rather, what makes it formal is that it was deliberately authored (and, in many cases, is actively maintained) to inform the reader about your system. Technical specs, architecture diagrams, style guides, onboarding materials, and postmortems are a few examples of formal documentation.

We can use things like technical specs as evidence that our refactor is necessary or useful by referencing design decisions, assumptions, or other designs considered or rejected. Say, for instance, you work on a subsection of your application responsible for processing all user-related actions within your product. The current implementation requires developers writing new features to remember and enumerate every kind of event that needs to be fired and propagated to sibling subsystems when a user modifies their profile. If your team has a history of writing technical design specs for each of their features, you can locate the original specification document for event propagation. This document describes the current implementation, its limitations, and any alternative approaches.

The limitations section states that while it might be convenient to trigger each required event individually at every location, if the team introduces a substantial number of new events, it might become clumsy and burdensome. Today, your system is experiencing that exact problem. It handles more than a dozen event types and your team is struggling to keep track of the sprawl. With every new feature, your team fears forgetting to trigger a critical event type and potentially introducing a pesky bug. You’ve done your best to assert the desired behavior with tests but decide that refactoring how these events are handled is the best solution to taming the chaos of repetitive logic.

Technical specs can be very helpful in supporting your hypothesis of exactly what needs to be improved and how. Occasionally, these documents outline alternative approaches considered but not ultimately chosen. You may be able to explore one of these suggestions with your refactoring effort.

Maintainers of style guides and onboarding materials can sometimes leave traces of their experiences in the documentation they produce. If they’ve recently made an unexpected discovery about how something works and sought to improve the documentation as a result of that experience, you might be able to catch a glimpse of that in their writing. You might find warnings in large, bolded text of exactly what not to do. It’s also not uncommon to see a disproportionate amount of content devoted to particularly complex pieces of the codebase in these kinds of documents; more people across the company will have devoted more time to trying to steer readers in the right direction, away from the pitfalls they themselves fell into. If the code you want to refactor is documented in these sources and follows these patterns, it might be good evidence that it can be measurably improved. Think about the ideal tone and content of the documentation for your target code and use that as inspiration.

Postmortems can serve as great supporting evidence. If your team follows the PagerDuty incident response process and has been doing so for some time, then you likely have access to dozens of postmortem documents detailing the what, where, when, why, and how of every instance where your application wasn’t behaving as expected.

When building a case for code that is worth refactoring, I search for postmortems summarizing incidents I believe directly involved that code. Then I read through two sections: “Contributing Factors” and “What Didn’t Go So Well?” When I suspect that the complexity of the code had a direct impact on the time to resolution or perhaps even caused the incident in the first place, these two sections will likely confirm it. A count of the number of incidents that list the area you want to refactor as a problem makes a valuable metric.

Note

It’s also important to take note of third-party or publicly facing documentation. While refactoring is not meant to modify the behavior for consumers of your application, this documentation can be particularly useful for bolstering your understanding of the code you’re intending to rewrite.

Informal Documentation

Alongside our formal documentation, we produce a wide range of informal documentation. These are the kinds of written artifacts that we don’t consider to be proper documentation simply because they don’t typically occur in document form. In my experience, I’ve found more speckled throughout informal sources than in any formal documentation.

Finding these sources is all about thinking outside the box. I’ll enumerate a few here but keep your eyes peeled for other sources around you. You just might surprise yourself!

Chat and email transcripts can provide insightful information about the code you’re seeking to refactor. Best of all, these often grant a good deal of context, both historical and organizational pieces of information. Say, for instance, you want to refactor how asynchronous jobs are structured in your application. The job queue system currently accepts a dynamic set of arguments of arbitrary size to maximize flexibility for its consumers. Unfortunately, this has led to quite a bit of confusion around its actual limitations, putting the system at risk of running out of memory when processing jobs with extremely large argument payloads, or crashing abruptly when it is unable to parse malformed inputs.

You want to be certain that your experience with the system’s ambiguity is not anecdotal to you and your team. To measure how troublesome writing new jobs is, you search your company’s Slack (or other messaging solution) for a set of keywords that relate to job queue arguments. Unsurprisingly, you come across a number of messages where someone was surprised or concerned that their job didn’t work as intended. Developers across the company are asking whether they should provide raw or opaque IDs. Why one over the other? Do we log these job arguments? If so, do we need to be careful about including personally identifiable information? How much data can we send via these arguments? Are we able to serialize entire objects and supply these to the job queue system?

You create a document that points to each of these messages, with a short description of the context around each. (This should be easy to do with a short backscroll through the conversation.) Now you can reference these instances to demonstrate the difficulty that developers are currently running into.

Chat history gives you the unique ability to peek into conversations that occurred long before your arrival. You might be surprised to see people spread across a variety of engineering teams talking about the problems you’re eager to fix months or years before your first day on the job. You might encounter others asking the same question at a regular cadence. When this happens, not only is it extremely validating to your endeavor, but you may get some valuable allies by reaching out to the folks on those teams and asking them about their experience with the code you want to improve. Quantitatively, you can use these conversations to approximate how many engineering hours are lost due to confusion about the code you want to improve and answer questions about it.

Depending on your engineering team’s project management tools of choice, you may be able to gather some important metrics related to the code you want to refactor by searching for related bugs in your bug tracking system. You might also be able to estimate the amount of time other teams or individual developers have spent investigating and fixing bugs or implementing changes related to your target code.

Say the code around a particular feature or feature set has been gaining complexity over time. You want to invest effort in tidying it up so that your team can develop at a quicker pace. If you suspect that your team’s velocity has decreased, you can use your project management software to confirm it. Note that this is a very coarse metric (and as with all of our other metrics, only quantifies a single aspect of the overall problem). You will probably need intimate knowledge of how your team organizes its development cycles and confidently remove outliers in your data to be able to tease out a compelling metric here, but for some teams, it can be an indisputable one!

Note

Technical program managers at some companies can be a great resource for helping you collect, filter, and disseminate these kinds of metrics. They are often whizzes at navigating project management tools and locating hard-to-find documents. Who knows, you might even make a new friend!

At this point, this may all sound like an excessive amount of investigatory work to quantify a given problem. That’s okay! It’s up to you to decide which metrics will have the most impact in communicating the severity of the problem and the potential benefit of fixing it. You may not want or need to spend the time digging through hundreds of tasks or postmortems, but if this information is easy to consume and search, it might be worthwhile. These metrics can especially come in handy when trying to convince management and leadership teams that are highly removed from the code that refactoring is worthwhile.

Version Control

We primarily think of version control as a tool to manage changes to our applications. We use it to move forward incrementally, allowing for the development of multiple features at once, and progressive shipment of those features. Sometimes, we use it to refer to previous versions of our code to track down a bug or locate someone who might know about the section of code we’re reading. We rarely think of version control as a source of information about our team’s development patterns when analyzed in aggregate. Turns out, we can glean quite a bit about the problems our engineering team is facing when we take a look at our commits from a different perspective.

Commit Messages

Although not everyone makes writing descriptive commit messages part of their development method, if you work on a team where a majority of developers do, these short descriptions can provide a glimpse into the issues that they might be running into. We can identify patterns either by searching for a set of keywords or by isolating commit messages associated with changes to a set of files we’re interested in.

Let’s say we’re looking at our job queue system problem from earlier. We know that engineers regularly forget to sanitize their arguments before enqueueing jobs, resulting in logging personally identifiable information (PII). We can search through our commit messages and identify commits where the corresponding messages include words like “job,” “job handler,” or “PII.” From this result set, we might find a substantial set of commits that either introduced a new job responsible for leaking PII or fixed a job already leaking it. Alternatively, if our job handlers are conveniently organized into distinct files, we could narrow our search to include only commits with modifications to these files and comb through the derived set for similar patterns.

Some development teams relate their commits or changesets to their project management tools by highlighting bug or ticket numbers in the commit message or branch name. If this information is available to us, we can link the changeset back to our previous collection of metrics on development velocity and bug count. It all comes full circle!

Commits in Aggregate

In his book, Software Design X-Rays, Adam Tornhill proposes a set of techniques for teasing out important development patterns from version history. He hypothesizes that these development behaviors can help you identify which sections of your application you should prioritize when refactoring, illustrate how the complexity of certain functions have changed over time, and highlight any tightly coupled files or modules. I highly recommend reading his research to comprehend fully the psychology behind why these measurements are so enlightening, but I’ll summarize the basic techniques here so that you might consider them ahead of your next big refactor.

Change frequencies are the number of commits made to each file over the complete version history of your application. You can easily generate these data points by extracting file names from your commit history, aggregating them, and ordering them from most to least frequent. In practice, Tornhill noticed that these frequencies tended to follow a power distribution, where a disproportionate number of changes occur in a small subset of core files. Knowing the files that are committed to most often tells us exactly which files need to be the easiest to understand and navigate for developers and, therefore, which files we should spend the most effort maintaining, from a developer productivity perspective.

We can apply the same concept of change frequencies to files as well. By looking at individual commits, we can carefully attribute changes to respective functions within individual files, producing total frequency numbers for each of them. By combining this data with one of our earlier complexity metrics, lines of code, we can map complexity changes over time across the entire codebase. This information shows us potential hotspots ripe for improvement. We can later regenerate these metrics once we’ve completed our refactor to confirm that not only the complexity of these hotspots decreased, but hopefully their change frequency had as well.

Tornhill also describes a method for pinpointing tightly coupled modules in your program by looking at sets of files modified within the same commit. To depict this idea, let’s say we have three files, superheroes.js, supervillains.js, and sidekicks.js. In a subset of our commits, we have the following changes: commit one modifies both superheroes.js and sidekicks.js; commit two modifies all three files; commit three again modifies superheros.js and sidekicks.js; and commit four only touches superheroes.js. From this subset of our version history, depicted in Table 3-3, we notice that of four commits, three of them modified both superheroes.js and sidekicks.js. This insinuates that some kind of coupling between these two files exists. Certainly not all coupling is bad (as is the case for changes in source code and the corresponding unit test files), but in some cases these patterns can indicate an erroneous abstraction, copy-pasted code, or sometimes both. Once we’ve pinpointed these problems, we can work to fix them and then rerun the analysis sometime later to confirm that they no longer exist.

Table 3-3. Files modified per commit
Commit # superheroes.js supervillains.js sidekicks.js

1

x

x

2

x

x

x

3

x

x

4

x

As with each of our quantitative metrics in this chapter, there are some caveats to this kind of measurement. Different developers have different practices around committing changes. Some programmers will make a large quantity of tiny commits; others will make large commits, including dozens of changes across multiple files, into a single changeset. Moreover, it’s entirely likely this analysis will reveal some outliers (configuration files frequently changed or hotspots in autogenerated code). We have to be vigilant about these anomalies when poring over the data to mitigate the risk of finding problems where there might not be any.

Reputation

Whether we’re aware of it or not, each of the many sections of our software systems have distinct reputations. Some reputations are stronger than others; some are positive, some are deeply negative. Whatever the reputation, however, it is slowly built up over time, spreading across the engineering organization as more and more engineers interact with the code. Word of the most disastrous codebases sometimes even travels outside of your company and into the wider industry, discussed over dinner among friends and on internet forums. Whether these reputations continue to hold true or not, they can tell us plenty about some of the most troublesome pieces of our applications and just how desperately they need our attention.

A simple, low-effort means of collecting reputation data is to interview fellow developers. Let’s assume you work on an application that charges customers for a monthly service and you want to improve your application’s billing code. You set up some interviews with developers that fall into a few categories: those who work directly with the billing code on a regular basis, and those who have worked with it on occasion. For each of these two sets, you’ll want to speak to developers who have a range of tenures on their current team and within the company; the experiences of those who have worked integrally with the billing code for years are probably pretty different from those of an engineer who was hired six months ago.

We then derive a set of questions that will help us characterize their experience. We begin with a few questions to frame their background and then delve into their thoughts about the code. A few are suggested in Table 3-4 to get you started.

Given your experience with the billing code, when you were evaluating which files could benefit the most from a thorough refactor, you immediately thought of chargeCustomerCard.js+. You decide to ask your interviewees about the file to see what sort of reaction it elicits. If the second you mention chargeCustomerCard.js, your interviewee grimaces, whether they have intimate knowledge of the inner workings of that file or not, that’s a strong indication that the file could probably use a little bit of love.

If we want to solicit feedback from a larger group of engineers or are tight for time on establishing our starting metrics, we can rephrase our interview questions to fit a standard set of answers. This will make aggregating the responses easier and allow us to derive conclusions from them faster. Be warned, however, that by reducing your fellow developers’ thoughts to a set of scores, you’ll be stripping away some of the nuance that you might have been able to glean from an in-person (or virtual) interview.

From experience, interviews tend to give you more flexibility to explore ideas and topics that bubble up candidly. It’s often the back and forth banter that brings out the best aha! moments. If we sent around a developer survey with long-form interview-like questions, not only would we not be able to ask the respondents in real time to provide more details about their answers, but we would likely get fewer responses. I’m very guilty of opening up a survey, noticing that it is a series of half-a-dozen open-ended questions, and almost immediately setting myself a reminder to do it later. If you want to solicit feedback from engineers in survey form, keep it short; this way, you have a better chance of getting a high response rate.

Table 3-4. Suggested developer interview and survey questions
Interview question Survey question Notes

How long have you been working with X code?

Select the option that best describes the amount of time you have spent working with X code: > 6 months; 6 months to 1 year; more than one year.

Note that in the survey question version, you should choose time ranges that make the most sense for your engineering organization. At high-growth, younger companies, the ranges are probably on the order of months; at larger, more established companies, the ranges could be on the order of years.

If you could change one thing about working with X code, what would it be? Why?

If you could choose only one of the listed options to improve your experience working with X code, which one would it be?

For the survey question, choose some options that you think would make the most impact and optionally provide a write-in field. If the code doesn’t have any tests, add an option that states that the code is fully tested. If a large proportion of the code is contained within a few functions that are hundreds of lines long, add an option that states that the code is split up into small, modular functions.

Tell me about a bug you recently had to fix that involved X code. What would have made it easier to solve?

Of the Y options listed below, what about X code makes it the most difficult to fix bugs efficiently?

Have you strategically avoided working in X code before (i.e., fixing a bug at a level above or below the problem area)? Tell me about that experience.

On a scale from 1 to 5, 1 being not likely at all and 5 being very likely, how likely are you to find a way to avoid making changes to X code?

How does the complexity of X code hinder your ability to develop new features?

With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the time it takes for me to develop new features.

How does the complexity of X code hinder your ability to test and/or debug your code?

With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the difficulty to test and/or debug my code.

How does the complexity of X code hinder your ability to review other developers’ changes to the code?

With 1 being strongly disagree and 5 being strongly agree, rate the following statement: The complexity of X code is a significant contributor to the time and difficulty involved for me to review other developers’ changes to the code.

Reputation can also hinder a team’s ability to hire and retain engineers. Say the billing code is known to be particularly treacherous at your company. While the team probably has a handful of developers who are committed to their roles, working in a frustratingly complex codebase can take a toll on morale. Organizations don’t like to admit that they’ve lost engineers due to code quality and development practices, but it happens all the time. If you’re able to collect information on engineers’ reasons for leaving the team and tie those back to code complexity, it can be an incredibly compelling metric for dedicating some much-needed resources to refactoring.

Building a Complete Picture

Now that we’ve familiarized ourselves with a wide range of potential metrics, we have to choose which ones to use. To build the most comprehensive view of the current state of the world, you must identify the metrics that best illustrate the specific problems you want to address. None of these metrics alone can quantify the many unique aspects of a large refactoring effort, but combined, you can build a multifaceted characterization of the problem.

I recommend picking one metric from every category. Find a way to approximate code complexity in a way that makes the most sense given the nature of your problem and the tools you have already available to you. Generate some test coverage metrics to make sure you start off on the right foot. Identify a source of formal documentation you can use to illustrate the problems your refactor aims to solve; back it up with some informal documentation as well. Gather information about your hotspots and programming patterns by slicing and dicing version control data. Last, consider the code’s reputation by chatting with your colleagues.

If you find that most of these metrics can help you quantify the current state of the code you are aiming to refactor and the impact it has on your organization, consider choosing the subset that has the greatest chance of showing significant improvements. These are the metrics that will make the most compelling case to your teammates and, ultimately, management. In the end, you’ll have to make a convincing argument to those you report to that the time and energy you and your teammates are ready to devote to the refactor will pay off.

We’ve successfully gathered evidence to help us properly characterize the problem we’re experiencing, but setting the stage is only one piece of the puzzle. Next, we have to use the data we’ve collected to assemble a concrete execution plan.