2: A systematic approach to debugging – Debugging Embedded and Real-Time Systems

2: A systematic approach to debugging

Abstract

Before we begin to look at specific problems and the tools used to solve them, this chapter discusses a general strategy for fixing and finding problems. Even the best tools need to be set up properly, and the best tool to start with is the engineer.

I provide several examples of correct and incorrect methods of debugging, with a particular focus on students.

Keywords

Occam’s Razor; Differential testing; Software timer loop; DS1232

In this chapter, I am speaking directly to the students and faculty. I would hope that experienced engineers are well versed in the techniques that I’ll be discussing, and new engineers are more at the student’s end of the spectrum. To the student reading this, please excuse the fact that I seem to be directing the content toward your instructor because I am referring to you in the third person.

I thought I’d introduce this chapter with something that Joe Decuir,a an IEEE Fellow and an affiliate faculty member of my division at the University of Washington Bothell, sent me.

The six stages of debugging

  1. 1. That can’t happen.
  2. 2. That doesn’t happen on my machine.
  3. 3. That shouldn’t happen.
  4. 4. Why does it happen?
  5. 5. Oh, I see.
  6. 6. How did that ever work?

Now, back to the problem at hand. After observing my students struggle time after time trying to find and fix problems with their hardware, their software, or both, I decided to share with them some “best practices” that I’ve learned over the years to approach the problems of:

  1. 1. Observing a bug.
  2. 2. Being able to reproduce the bug.
  3. 3. Hypothesizing the cause of the bug.
  4. 4. Testing the hypothesis.
  5. 5. Achieving high confidence that the hypothesis is correct.
  6. 6. Making the correction.
  7. 7. Retesting to validate the fix.

In short, it didn’t work. If I gave a quiz or question on a test, I’m confident that most students would get the right answer. But suppose that you’re struggling with a design problem, or a circuit that isn’t doing what you designed it to do, would you know where to begin? Would you make matters worse by “shotgunning” the circuit problem?b

Shotgun debugging is the debugging of a program, hardware, or system problem using the approach of trying several possible solutions at the same time in the hope that one of them will work. This approach may work in some circumstances while sometimes incurring the risk of introducing new and even more serious problems.

In my experience, it never works. Shotgun debugging of software may just be a time sink, but for hardware it is the kiss of death. Why is this? For starters, printed circuit boards are not designed for continual heating and reheating cycles on the pads or traces. Delicate parts are also not designed to withstand repeated heating and reheating cycles.

So, what do students do when confronted with a hardware bug? They start to change parts and hope that will fix it. Most of the time they just make matters worse. They destroy the parts that they only have one of. Ditto with the PC board. Time goes by and they are no closer to solving the problem and deeper in the hole. More telling, they have no idea where their original starting point was, so they can’t get back to it even if they wanted to. In the extreme, they will start over again, hoping that by just redoing the board or rewriting the code, the bug will resolve itself. This doesn’t work either.

So, if they actually know better, why don’t they follow the best practices that I’ve laid out for them? I suspect they’d rather try something quickly and hope for the best than spend the time required to set up a systematic debug process to find the defect.

At this point, I’m going to take a short detour and caution the reader about one debug technique that is fraught with danger. When we teach our incoming EE students their first circuits class, we provide them a lab kit of parts that includes a solderless breadboard. Fig. 2.1 is a photo of a typical solderless breadboard.


Fig. 2.1 A solderless breadboard (https://www.auselectronicsdirect.com.au/arduino-solderless-breadboard-840-points?gclid=EAIaIQobChMI2O6Nt5S-2gIVmnZgCh2f2gLZEAQYASABEgKlqPD_BwE). Vertical rows of five dots (two are highlighted) are connected together. Horizontal rows (one row is highlighted) are connected together. Inserting 22-gauge solid core wire makes the necessary connection. The two blue squares indicate points where the horizontal rows are segmented into eight separate buses. Jumper wires are required to connect them to form four continuous strips for power and ground.

If you were an incoming student taking our Circuits I and II classes, you would build simple DC and low frequency AC circuits with these breadboards and within that context, they sort of work okay. Provided that you are working with signals greater than 10 mV or so and frequencies under 1 kHz, the solderless breadboard will enable you to do rapid prototyping.

Depending on how neatly you do the wiring, you (more or less) can do your lab experiments, get reasonable results, and pass the course. The same holds true for simple digital circuits. As long as the circuits are simple and low frequency, then the logic chips work as promised, although close examination of fast rising and falling edges shows significant overshoot, undershoot, and ringing, even at nearly DC switching speeds.

The situation becomes dangerous when you use the solderless breadboard as your go-to debugging tool for all circuit design problems and for rapid prototyping of higher-sensitivity or higher-frequency circuits. At that point, the solderless breadboard becomes a negative productivity tool. It causes more time to be wasted as the student attempts to figure out why their circuit isn’t working. You may have wired it correctly, but the signal either looks terrible or it is buried in ground loop noise. Here’s how I learned not to use solderless breadboards.

I was a relatively new engineer at the Colorado Springs Division of Hewlett-Packard. I was doing my first real circuit design and I was prototyping part of the circuit on a solderless breadboard. Resistors, capacitors, wires, and transistors were all standing up in a three-dimensional circuit.

A senior engineer peeked over the wall to my cubicle, saw what I was doing, walked in, and pushed down on the circuit with his palm, squashing everything together. Never saying a word, he walked out. That was my lesson.

When is it OK for you to use a solderless breadboard? I won’t freely admit this, except under duress or in this tell-all book, but I will on occasion resort to using a solderless breadboard. Almost all the time, it is when I don’t understand something in the data sheet regarding the part I’m trying to design into the circuit, and I want to confirm some functional feature of the part. Reading the data sheet is often futile, so If I don’t have the simulation model for the part, I’ll try out a sample part on a solderless breadboard in order to confirm that I understand how it works. If I really get stuck, I’ll call an applications engineer at the company that manufactures the part.

Having just warned you of the dangers of using a solderless breadboard, I couldn’t resist adding the cover photo, Fig. 2.2, from Bob Pease’s classic book, Troubleshooting Analog Circuits [2]. As you can see, he does not heed my warnings about the use of solderless breadboards. I suspect that this photo was staged for effect. Here’s what Bob, one of the gurus of circuit design, had to say about solderless breadboards:

I didn’t even think about solderless breadboards when I wrote my seriesc because I see them so rarely at work. They just have too many disadvantages to be good for any serious work. So, if you insist on using these slabs of trouble, you can’t say I didn’t warn you.


Fig. 2.2 Cover photo from the book, Troubleshooting Analog Circuits, by Robert A. Pease. Photo used by permission of the publisher, Elsevier, Ltd.

Who’s at fault here?

I believe that educators are remiss in not teaching how to find and correct errors in design with the same rigor as we teach the basic principles of engineering circuit design or writing software. A survey conducted in 2005 at the Embedded Systems Conference in San Franciscod indicated that:

…debugging is the most time-consuming and costly phase of the software development lifecycle, with a majority of respondents citing debugging as the most significant problem they encounter.

Therefore, if debugging is the costliest and most time-consuming phase of the development life cycle, why aren’t engineers taught the best ways to debug code (or hardware, or Verilog)? Here are some thoughts:

  •  Debugging is viewed as a negative; a bug is a flaw. It’s better to design code without bugs in the first place.
  •  After you teach the basics of coding, or designing circuits, there isn’t enough time left in the quarter or semester to teach debugging as a discipline worthy of study.
  •  The best debugging tools have yet to be invented because current tools aren’t up to the challenge.
  •  Debugging has not been designated as an academic discipline worthy of study, scholarship, and research, so it just isn’t taught.
  •  Best practices are handed down from senior engineers to junior engineers and are not disseminated outside a tightly knit design team.

A bug of my own

Let’s start now to try to define an overall strategy to approach the identification of the root cause of a defect in hardware (or software) and then how to fix it. In order to create the strategy, we’ll start with a case study. I’m going to describe a simple embedded system that has a bug, one that I designed for my course at the University of Washington Bothell, B EE 425, Introduction to Microprocessor System Design.

This particular design was for the laboratory part of the course. I wanted students in the class to learn how to use a logic analyzer to look at the interplay between hardware and software. The board contained a 4 MHz Z80 CPU, 32 K ROM, 32 K RAM, a single 7-segment display, and some glue logic, including a Dallas Semiconductor (now Maxim) micromonitor IC that handled power-on reset, watchdog timer, and a reset push button. This is the 8-pin IC located just behind the Z80-CPU.

The Z80 was chosen because it is an excellent teaching example of how a microprocessor works. The student can see the signals on the processor's I/O pins in real time while the code is executing. Everything that the processor does is unambiguously visible to the logic analyzer. There is no cache and no input pipeline queue; just the basic clock, address bus, data bus, and status bus. Yes, the Z80 is old and slow, but if you want to give a lecture on a typical timing diagram for a generic microprocessor, a logic analyzer trace of a Z80 bus cycle unambiguously matches the lecture diagram.

The seven-segment display was added in a later revision of the board because students invariably complained that their board was dead. The display and an appropriate test ROM chip would show them that the board was working.

After building and loading a test run of three boards, I wrote test code that would exercise the board and allow me to look at all the bus signals with both a logic analyzer and an oscilloscope probe. The scope showed the fidelity of the signals and the 34-channel logic analyzer allowed me to see almost all the Z80’s address, data, and status pins (Fig. 2.3).


Fig. 2.3 Lab experiment board for a microprocessor course. The 40-pin connector connects the board to a LogicPort Logic Analyzer.

All the test boards turned on. I next wrote the code that ran the display. Against all rules of good design practice, I wrote a software timing loop so that the display would run slowly enough for a human being to observe it. I didn’t have the room to add a separate timer to the board, so the software loop was my only way to slow the system down without adding more hardware. The three boards ran my code well and the displays continuously scrolled b-E-E-4-2-5.

Buoyed with confidence, I ordered 20 more boards and enough parts to stuff them. Because I didn’t want to build 20 boards myself, I organized a Saturday morning soldering class for the EE students who’d never soldered a board before. Before turning them loose on the good boards, I had them watch several YouTube videos on soldering. Next, I gave them some practice PC boards and assorted parts to solder. When the students were able to demonstrate that they “sort of” knew what to do, we turned them loose. Several student assistants with soldering experience circulated and also kept an eye on the soldering work.

When a board was completed, we plugged in the display ROM to see if it worked. It was gratifying that about 15 of the 20 boards turned on right away. The other five did not, so we put those aside. Later, I carefully examined the five that did not turn on. Two had easy-to-fix soldering issues and worked after I repaired them. Three looked like the soldering was good—no bridges, no cold solder joints—but the boards still would not turn on.

Rather than discuss how I found the bug, I’m going to describe the systematic process that I teach students to follow to find and fix the defect. Write down what you know and what you’ve observed; you need a paper trail. This is like placing rocks along a poorly marked hiking trail so you can find your way back. Also, creating a paper trail will help you organize your thoughts and create some hypotheses that you can then test.

  1. a. Observed: Three Z80 boards out of a batch of 20 failed to turn on using the EPROM test code that activates the display.
  2. b. Known:
    1. b.1. The code and EPROMs work in other boards.
    2. b.2. Visual inspection shows good solder joints and no solder bridges on any boards.
  3. c. Possible causes:
    1. c.1. Power to ground short circuit.
    2. c.2. Manufacturing defects in boards.
    3. c.3. Bad clock.
    4. c.4. Shorted or open address, data, or status signals.
    5. c.5. Overheating pads or traces during soldering.
    6. c.6. Parts improperly inserted in sockets.
    7. c.7. Bent IC pins not in socket.
    8. c.8. Problem in seven-segment display circuit.
    9. c.9. Other possible causes I haven’t thought of yet.

At this point, let’s take a breath and review where we are. We’ve identified a minimum of nine possible hypotheses to start investigating. Of course, there could be more. We haven’t even considered a software bug, but the simplest line of testing appears to be a hardware defect because these are new boards built by inexperienced engineering students loading and soldering their first boards.

The plan of attack will be to check the boards in more detail, starting from the simplest possible cause and working up to full-scale board debugging. If necessary, connect a logic analyzer and observe all the relevant signals.

In my notebook (or your lab notebook), I created a list of each test I was going to perform, what I expected to see, and what I observed.

Tests to perform:

  1. 1. Using an ohmmeter, check for a short between the power and ground.
    Expected: Should measure some resistance greater than 0 Ω.
    Measured: 150 Ω.
  2. 2. Check power to board.
    Expected: All Vcc pins have + 5 VDC.
    Measured: Every Vcc pin measured 5.25 VDC.
    Comment: Used 5 V wall supply from lab kits, 1A DC current limit.
  3. 3. Check the parts. Are they properly inserted? Are they overheating? Do a thorough visual inspection, and then remove all parts from the socket, inspect the pins, and reinsert.
    Expected: All parts have been properly oriented and inserted in their respective sockets.
    Observed: All parts were properly inserted and no parts were excessively hot.
  4. 4. Check clock signal.
    Expected: 4 MHz, 0–5 Vpp (Vpp = peak to peak voltage) square wave observed at crystal oscillator output and at Z80 clock input.
    Observed: Clock looked good at the crystal and at Z80c clock input (pin 6).
    Comment: Used lab oscilloscope grounded to ground test point on the board.
  5. 5. Using a logic analyzer, check for open or shorted address, data, or status traces.
    Expected: Trace listing should agree with source code binary output.
    Comments:
    1. a. Used state mode.
    2. b. Triggered logic analyzer trace on exiting reset (~ RESET goes high).
    3. c. Concerned about software timing loop filling buffer.

    Observed: Code executed properly and entered software delay loop, filling buffer. Problem does not appear to be shorted or open traces. Still did not observe display operation.
  6. 6. Check to see if code exits software delay loop.
    Expected: Code will properly exit the loop, indicating problem is somewhere else.
    Comments: Trying to eliminate software delay loop by reconfiguring logic analyzer trigger circuit to sequentially trigger. Set condition “A” to be the first address in the software delay loop and then set condition “B” to be the first address after the software delay loop.
    Observed: LA did not trigger. Software never exits the delay loop.
    Here is the first appearance of a clue. It doesn’t seem to be a hardware bug, but we can’t rule it out just yet. More tests are necessary. After scratching my head for a while, and rereading the help menu for the logic analyzer, I modified the trigger condition so that state “B” was any address outside the range of the software delay loop that the code is failing to properly exit.
    Before going any further, I need to write down what I just observed so that I have a data point to which I can return. Next, I repeated the test with the logic analyzer, but now with the modified trigger conditions. Set trigger sequence so condition “A” is the first address in the delay loop. Condition “B” is any address outside the delay loop.
    Observed: Logic analyzer triggered.
    Comment: Trace listing showed that the RESET went low and the processor began to reexecute the initialization code.
    Now I knew where the problem was; I just didn’t know why it was happening. Once in the delay loop, the processor would reset itself and start over. It never exited the loop to drive a new value to the display.
    The only way to generate a hard RESET signal is to push the RESET button, which I definitely was not doing, or for the RESET signal to be generated by the hardware. The suspect now was the micromonitor IC from Maxim. In addition to the power-on reset, reset button, and power supply monitoring, the micromonitor also has a watchdog timer. Could that be triggering and asserting RESET?
    I thought I understood how the part worked, but I went back to my data sheet and carefully read the section about the operation of the watchdog timer. According to my design, I left an input pin labeled TD, for timer delay, unconnected. When this pin is left unconnected, the watchdog timer will time out in 600 ms if it does not see an input on its STROBE pin. When the timer times out, it generates a RESET signal. I was using one of the status signals from the Z80 to prevent the watchdog timer from asserting RESET, but that signal was not active when the software was in the timing delay loop.
    In an incredible validation of Murphy’s Law, the watchdog timer timeout value and my timing delay loop were just about equal. There was just enough variability in the exact timeout delay in the DS1232 micromonitor chip so that most of the chips worked, but some did not. The simplest way to test this hypothesis was to take a working board and a defective board and swap the micromonitor chips. If the failure followed the chip, I knew the cause of the problem and rereading the data sheet gave me the solution.
  7. 7. Swap the micromonitor chips in a working and nonworking board.
    Observed: The failure follows the DC1232 chip.
    Proposed solution: Rewrite the delay loop code or change the watchdog timer timeout.
  8. 8. Connect a jumper wire from the TD input of the micromonitor chip to the Vcc. This should increase the timeout from 600 to 1200 ms.
    Observed: A board with this failure mode now worked properly.
    Once again, I stopped and documented what my tests had shown. I knew what was causing the problem and wrote down what I did to verify that I had indeed found the root cause of the problem. I also wrote down a possible software-only solution that would negate the need to solder a jumper wire on the 20 boards.
  9. 9. Rewrite the software delay loop to continuously output a signal to the STROBE input on the micromonitor chip so that I would not time out and generate any more errant RESET signals.
    Expected: All boards should now work without the need for a hardware jumper wire.
    Observed: All defective boards now worked.

What are the lessons to take away from this exercise?

  1. 1. Stop, think, and plan before attacking the problem:
    The exercise of first writing down what I knew, what I suspected, and what I thought I should check was critical. It provided a framework for isolating the defect before I started to change anything.
    Also, by writing my plan down, I forced myself to slow down and not jump in. If the problem was much more complex and there were multiple engineers involved, this first step would likely involve a meeting and a brainstorming session to capture our ideas and then structure them into a coherent plan.
    There can be a lot of steps in a plan and, once we start to debug the system, we may have to revise our roadmap. That’s okay and to be expected. As we get deeper into identifying a problem, new clues may present themselves and we would need to look into other areas that were not part of the original plan.
    Another part of the preliminary work is to identify as many variables as you think you are going to need to track. This can easily become a laundry list of compiler versions, makefiles, linker command files, data sheets, etc. This is probably overkill, but when you are under the gun and trying to fix bugs so you can meet your product release deadline, having all the information at hand is priceless.
    Of course, you can keep your notes in electronic form. I’m old school. I’m used to having a lab notebook to capture ideas and make notes.
    If you’ve ever done any repair work on your car, you’ve probably invested in a factory service manual. These manuals are designed by the automotive engineers for the garage or dealer mechanics who have to fix a car. Go to any section in the manual and you will find a flow chart of the recommended process for isolating a mechanical or electronic failure.
    A flow chart is an alternate form of the list of steps, but has the added advantage of allowing you to view your debugging plan as a roadmap and also list what you want to do at decision points. Perhaps it is overkill, and I can honestly say I’ve never had a problem that required this type of approach, but it could be beneficial. Let’s just leave that in the “might work” category.
  2. 2. Thoroughly read and understand the data sheets:
    In the book “Debugging: The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems” [3], one of the key points author David Agans makes is to understand your system. This means thoroughly and carefully reading all the data sheets as well as all the code you are using that you didn’t write yourself, or code you wrote so long ago that you no longer remember how it works.
    Agans further makes the point that you should carefully read all the comments in any software that is linked to the defect, or might be linked, or even has no apparent linkage. He gives one example where he was debugging an embedded processor system and the code was written in assembly language. One of the on-chip registers was getting corrupted. Here’s the author’s description of the problem:

    We were debugging an embedded firmware program that was coded in assembly language. This meant we were dealing directly with the microprocessor registers. We found that the B register was getting clobbered, and we narrowed down the problem to a call into a subroutine. As we looked at the source code for the subroutine, we found the following comment at the top: “/* Caution—this subroutine clobbers the B register. */”


    Agans devotes an entire chapter to this point. You should read everything, particularly the comments in code that you did not write, or you wrote a long time ago (anything older than 2 weeks is a long time). Only then can you eliminate the critical variable that you don’t fully understand the system. Here’s a summary of his nine indispensable rules for debugging hardware and software bugs. I’ll be returning to these points over and over in this chapter, and subsequent chapters to follow:
    •  Understand the system.
    •  Make it fail.
    •  Quit thinking and look.
    •  Divide and conquer.
    •  Change one thing at a time.
    •  Keep an audit trail.
    •  Check the plug.
    •  Get a fresh view.
    •  If you didn’t fix it, it ain’t fixed.
      While I’m on the subject of the need to completely understand your design, here’s another relevant real-world example. Just prior to my departure from Hewlett-Packard’s Logic Systems Division (LSD) in the mid-1990s, I was managing a project that was in the initial phases of product definition. We had a product idea, but we weren’t given the go ahead to start developing the product. The code name was “Farside.” Not because of the Gary Larson comic strip, but because we were looking beyond conventional tools.
      As we defined it, Farside was going to be the penultimate debugging tool for embedded systems. We were aiming this debugging tool at engineers who needed to understand already working systems that needed to be upgraded. We settled on this type of debugging tool when we were doing customer research at a large telecommunications company on the East Coast. There we met with a team of engineers who were tasked with taking an existing product that they had not developed and making it better.
      All they had to go by was the standard documentation, service manuals, all the software, and all the hardware schematics. What they lacked was any documentation describing the intent of the original designers or their original design philosophy. What impressed me was their ability to read through these reams of documents and extract the design philosophy and the places where improvements could be made. What they were asking from us was a reverse engineering tool or suite of tools that would enable them to see the design at a higher level than just reading through the source code or schematic diagrams.
      These superb engineers defined the way to learn about your system.e
  3. 3. Trust Occam’s Razor:
    As I’m sure most of you know, Occam’s Razor is the problem-solving principle that, when presented with competing hypothetical answers to a problem, one should select the one that makes the fewest assumptions.f
    Another way to state this principle is that other things being equal, simpler explanations are generally better than more complex ones.f
    How does Occam’s Razor apply to debugging? Let’s reconsider my process chart for debugging the microprocessor board. The very first test I did on the nonfunctional boards was to see if power and ground were shorted together. That would certainly explain a dead board. Also, it was plausible that a student may have damaged a board by overheating it and causing an inner layer short. This wasn’t part of my hypothesis; I just wanted to start with simple but significant tests and work my way up to the more subtle tests.
    Note that I did not start with step #5: connect the logic analyzer. I proceeded from simple, global tests and worked my way toward the more complex possible problems as I eliminated the simple ones.
  4. 4. Do differential testing:
    This means just change one variable at a time. Differential testing is the exact opposite of shotgunning. By changing only one variable, you can see the effect that the change had on the system. Ask the student why they don’t follow this process and the consistent answer you will hear is, “I don’t have the time.”
    If you’re a student reading this, you will probably agree that time is your most precious asset. You never have enough time. Therefore, you would probably sympathize with their comment. Perhaps you’ve been there more times than you want to admit. But if you don’t have time to follow a disciplined process, why are you so willing to tear something apart, rewire it, or rewrite it, without stopping to analyze what could be the real source of the error?
    It’s painful to watch when a neatly laid-out solderless breadboard that obviously took quite a long time to construct is pulled apart and rewired because the student hadn’t taken the time available to properly locate the flaw.
    In this particular case, the student had wired the board correctly and had simply neglected to check the positive and negative power rails. If he had performed this simple test, he would have discovered that half the circuit wasn’t connected to power or ground. The breadboard he was using actually split the power and ground rails in the middle of the rail. The upper four rail segments were connected to power and ground, but the lower four rail segments were unconnected. All that was required to fix the problem was four little jumper wires between the upper and lower segments of the power and ground rails. Referring back to Fig. 2.1, the two blue squares show where the power and ground buses are split. The four segments on the left side of the breadboard are connected together and the four segments on the right side of the breadboard are connected together, but there is no connection between the two halves.
    In the previous example in this chapter (pg. 21), I showed this principle in step #6. Up to this point, I had narrowed the defect down to the possibility that the micromonitor IC was the leading suspect, but I wasn’t certain it was the culprit because I had 17 boards that worked just fine and three that didn’t. I had a pretty good idea that was the problem because I knew that the time delay and watchdog timer reset interval were about the same amount of time.
    Also, after reading the data sheet for the part, I knew there was some variability in the exact time interval of the watchdog timer, so it followed that some parts might work and other parts might not work. Thus, the differential measurement was to replace a suspect part with a part that did not generate the RESET pulses.
    Testing one variable at a time, and only that variable, makes good sense. Every time you simultaneously test more than one variable, you change the number of possible outcome combinations by at least 2N, where N is the number of variables you changed.
    Now you might think, “Why is he saying ‘at least’ 2N? Isn’t it just 2N?” If there were no additional interactions between the N variables, then the number of possible outcome combinations might be just 2N and, to clarify, this statement is saying that if you do see a change in the system with a particular combination of input variable changes, then there are 2N–1 other combinations that might also cause the change.
    However, if two or more of those variables are interdependent, or the problem stops (or a new problem begins) because of the interplay between two variables, then you are exposing your debugging session to even more possible outcomes and exponentially more time to sort them out.
    Even when you do use differential testing, it is still critically important to write down your expected result and what you actually observe. Even if you debug properly, it is easy to lose track of the results of the tens of tests that you are performing.
  5. 5. Consider the best solution before moving on:
    It’s tempting to do a quick fix; and there are lots of “good reasons” for them including:
    1. a. We’re behind schedule.
    2. b. The “big show” is next week.
    3. c. The software team needs units to continue development.
    4. d. Got to get units out into the field ASAP.
    5. e. Engineers have to move onto other projects.
      If you are a student reading this, you will have your own set of good reasons to want to do a quick fix and move on. These would likely include:
    6. a. The project is due at the end of the week.
    7. b. I have to study for my (name the course) final exam.
    8. c. I’m leaving town after finals are over.
    9. d. I’ll lose my scholarship if I don’t get a good grade in this class.
      In order to close the book on this section, we need to return to the example of the errant Z80 boards. We’ve identified the problem and made a software fix in the delay loop that eliminated the watchdog timer timing out and causing a ~ RESET to be asserted. Should we just move on and call it fixed?
      The quick fix was to rewrite the code and reprogram the ROMs. However, there was another solution. I could change the timer’s timeout interval to 1.2 s by bringing the TD input to the + 5 VDC (Vcc) power plane. This involves adding a jumper wire from the TD to the + 5 VDC input pin. Not a particularly hard modification to do to three boards. Might take 30 min in total. But wait. This is a lurking problem. Shouldn’t I modify all 20 boards to prevent the problem from recurring? That would easily have taken several more hours.
      Alternatively, I could do a redesign of the board and permanently tie TD to Vcc. Now I have to buy a new batch of boards and reload them. Time consuming and costly. Better yet, wouldn’t it be nice if you could select the time interval for the watchdog timer with a permanent jumper selector? If you look back at Fig. 2.2, you can see jumper selectors in the lower left of the board. Why not add another one for the timer?
      At my division of Hewlett-Packard, one of the project managers had a plastic baseball bat with the phrase “WIBNI Killer” on it. WIBNI stands for “Wouldn’t It Be Nice If…” When an engineer suddenly got excited about adding more features or capabilities, the manager would bring out the WIBNI Killer and symbolically beat the engineer into submission.
      Why this aside? Because if we have a window of opportunity to fix the hardware, it is reasonable to ask if we should use the opportunity to also make some design modifications or improvements. If there will need to be a respin of the board to fix our defect, perhaps we should use the opportunity to bring in something that marketing has been requesting.
      The key point of this discussion is that we should not just fix the bug with the most expedient possible fix and move on. Once you know what the problem is, take time to list your possible solutions before taking the easiest path. To review, the possible solutions are:
    10. a. Change the coding of the software delay loop.
    11. b. Add a jumper wire from the TD to Vcc.
    12. c. Respin the board and add a trace from the TD to Vcc.
    13. d. Respin the board and add a selectable jumper block for three possible timer timeout delays.
      In this example, with 20 boards for a student lab experiment, the simplest solution was the best one. Change the ROM code and eliminate the watchdog timer timeout issue. That’s what I did in this situation, but it wasn’t the only option we could have considered.
      In summary, most of the time we find the problem, fix it, and move on. I’m suggesting that before moving on, you take a moment to consider more possible options. You might be surprised.
  6. 6. Log the defect:
    Keeping track of the defects you’ve found and repaired is also a critical part of the process if you want to improve your design processes. In my simple example, I would list the root cause as two factors:
    1. a. An incomplete understanding of how the DS1232 worked.
    2. b. Poor software design by using a software timing loop.
      What is the lesson here? That’s not so easy. I knew that using a software timing loop was not good design practice, but it was a reasonable solution for a simple situation where I needed a way to show the students that a board was working properly, and their problems were elsewhere.
      Would a better understanding of the micromonitor actually have prevented the bug? Maybe. Maybe I could have taken the time to calculate the exact elapsed time of the time delay loop that I wanted for the display, instead of writing the delay loop, watching the display, and figuring “that’s about right,” then looking at the possible variability in the timer timeout to see if there was a possible conflict.
      That approach would probably have found the bug. Interestingly, I doubt that this bug would have been caught in a design review. I would have gotten some flak for the software delay loop, but I could have justified it on its merits.
      The hardware flaw (if you can call it a flaw) might have been picked up because the TD input on the DS1232 was left unconnected. Open inputs are always something to check as they can be sources of system instability due to noise on the floating inputs. However, I would have justified it as an intended part of my design. I think it would have taken someone really astute to pick up this defect before I uncovered it.
      Even so, having a written analysis of the bug, what you observed, and how you fixed it can be an incredible time saver at some time in the future. If you simply fix it and move on, you may forget about it and then, fast forward 2 years, there it is again. You’ve got this vague recollection that you’ve seen this before, but you can’t recall what exactly it was and what you did to find and fix it. So you go through the debug process once again. You find it and fix it, but imagine how you could wow the crowd if you opened up your defect logbook (paper or electronic) and found the defect notes you made long before.

Let’s close this chapter on the generalized process of debugging by summarizing the main points of the discussion.

  1. 1. Stop, think, and plan before attacking the problem:
    Write down what you've observed and what the possible causes might be. Revise this plan as necessary. Keep a written record of what you expected to see, assuming the system is working properly, and what you've actually observed.
  2. 2. Thoroughly read and understand the data sheets:
    Failure to understand how aspects of your design work is a primary reason that defects can be introduced into an embedded system.
  3. 3. Trust Occam’s Razor:
    Explore the simple possibilities first, then move on to the more subtle problems.
  4. 4. Do differential testing:
    Never change more than one variable at a time, and never change several variables at once in the hope that The Force will be with you and one of those fixes will be “the one.”
  5. 5. Consider the best solution before moving on:
    It’s tempting to implement the quickest fix to correct the bug, but give it some thought and choose the best overall solution, not the most expedient.
  6. 6. Log the defect:
    You can’t implement continuous improvement unless you can learn from your mistakes. Keep a record. A history of errors and their causes is an invaluable tool.

A final thought.

Another avenue of debugging is to rerun simulations of the circuit and see if they provide a clue to the behavior you are observing. We’ll discuss using simulations in a later chapter when we talk about how to best use the arsenal of debug tools that is available to you.

I’ll end this chapter with another student story. Our EE, CS, and CE students at UWB are primarily focused on earning a BS degree and then joining the workforce. A small fraction will go on to grad school to get an MS degree, and only a very small fraction will choose to go on to earn a Ph.D. Their Capstone Final Exam is the technical interview that they will likely go through with a prospective employer.

Because I was a hiring manager for many years in industry and interviewed many, many new engineers, my students tend to listen when I talk about the hiring process and only “multitask” by viewing their Twitter accounts when I talk about the subject matter of the course. When the students hand in their final project reports, I ask that they do a self-critique of their project and what they could have done better.

Easily one-third to one-half of the students describe how they just ran out of time due to debugging the hardware or the software. A typical response is “I decided to use this Arduino code I found online and …” or “I found this App Note on the manufacturer’s web site….” When I give them a project grade, I write a fairly long and detailed critique of their project and their report.

Often, I’ll suggest that they just redo their project the right way and then write the report about how they followed the development process that we covered in class and used all the best practices that we discussed. No matter how badly their project turned out the first time through, they could still write an excellent and compelling project report that would influence any hiring manager to seriously consider this student.

Picture this, right there in the report is the debug write-up, just like we discussed in class. This is the report that should make it into your portfolio, not the one that was submitted for a grade. I’m reminded of my days as a high school student taking first-year physics lab. We were assigned an experiment to measure acceleration due to gravity. Of course, our experimental data taking left much to be desired and our data was wildly off, but no worry. We knew where we wanted to end up, so we simply started with the result (32 ft/s/s) and then we worked backward to create the data to fit.