Debugging Hardware

by Dr. Howard Johnson. First printed in EDN magazine, August 16, 2001

Satveer Kaur writes:

I am a recent graduate with no experience in the lab. I am testing a four-port WAN-adapter card. At room temperature, it transmits and receives data perfectly, rarely losing a data packet. In my heat chamber, however, one of the ports starts failing. The only difference between the passing ports and the failing port is the setup time for the Tx data (14 nsec for the failing and 30 nsec for the passing).

I have no idea why this problem is happening. It's driving me crazy!

Dr. Johnson replies:

Debugging new hardware can be difficult and trying. The most common mistakes that most new engineers make when first debugging a system are:

Trying to debug too much at once,
Not testing their assumptions, and
Keeping inadequate records.

Too much at once—Jumping the gun to complete functional testing of any new, highly complex system is, as you have discovered, a waste of time. A more experienced engineer would first break the system into pieces for the initial tests, because he would recognize that there are likely to be a multitude of problems. A system with multiple problems often displays a complex array of symptoms that come and go with time, temperature, and test procedure. You can't make progress with a system like that. When your symptoms seem confusing, break the system down into successively smaller pieces until each piece contains at most one design flaw. Only then can you properly determine how to fix it.

For example, find a way to test the WAN card without its associated processor. This step removes software as a possible culprit. You can perhaps eliminate the processor by wiring the data pins on the transmit chip to a pattern generator. Many transceiver chips respond perfectly well to a repeating pattern of bytes, sending the same message over and over. Alternatively, you can test the cable by itself without the WAN card. However, the cable may not meet its attenuation specifications at an elevated temperature. Or, you can play good serial data into the receiver from a working data source.

As you break the system down into smaller and smaller subsections, you will eventually pin down which parts are actually failing.

Not testing—The second mistake happens when your impatience causes you to abandon the standard scientific method. Don't concentrate on finding a fix. Rather, postulate a theory about what causes your problem, test your theory to determine whether it legitimately causes the problem, and, then, only when you are sure you understand the nature of the problem, allow yourself to dream up a patch to fix it.

In your description, you comment that the setup time for Tx data differs between the good and the bad channels. Given that information, you should test whether that difference in setup timing actually causes a packet-error problem. I suggest that, to perform this test, you should degrade the timing on a good channel to see whether you can make it bad. Degrading the timing on a good channel is a better idea than trying to improve the timing on the bad channel for two reasons: It's probably easier to do, and, if you know the good channel is defect-free, it can unambiguously tell you the extent to which transmit timing affects the packet-error rate. If you instead continue to concentrate on the broken channel, other issues may mask the true effect of improvements in transmit timing.

To degrade the timing, you might use a short section of coaxial cable or some big capacitive loads to delay the Tx data. A setup-and-hold stress test would continuously vary the Tx clock phasing to explicitly measure the setup-and-hold margins on each channel. If data-timing differences are causing your problem, the setup-and-hold stress test will pinpoint the timing adjustments needed to fix it.

Inadequate recordkeeping—The third mistake, I fear, is the most widespread. In the process of debugging a major design, you will try hundreds of little experiments. For example, you may wish to find out whether the system works when it's cold, hot, idle, interrupted, touched with a probe, or sending all ones or all zeros. Nobody can remember all that data. Keep meticulous written records. Your old test notes, viewed in the light of new knowledge about the true nature of the problem, often reveal nuggets of information heretofore unnoticed. Even more often, careful post analysis of the old test data reveals significant holes in your testing program—holes you must fill with more testing.