Wednesday, November 24, 2010

SQA and the Scientific Method

My son has been learning about scientific method in his science class. As I've been helping him with his homework, I realized that I use scientific method when I find a bug.

For example, suppose you're testing remote access software installed on a Windows client. You're noticing that on one system, it keeps losing connection to the server. This is something I ran into once. Now, if you're a test monkey, you'll write up a bug saying, "Brokey brokey, no worky" and let development figure it out.

However, if you're reading this blog, you like making it easy for developers. So, you'll wind up asking yourself, "Why does this one system have a problem with disconnecting from the server?" At this point, you've just started approaching this from a scientific point of view.

Next, you'll do some research and elminate variables. What's unique about the one system with the problem? What could cause the connection to drop? Is it the network it's connected to? Is it a bad cable? Does it just not like me?

Once you've decided what could be causing the problem, you'll start with the first hypothesis. You'll want the simplest and easiest to test, so maybe it's the network. You'll test the hypothesis by moving the "bad" computer to the same network as the "good" computer. In fact, you could even use the same network cable that the "good" computer used. If it still fails, you've eliminated three variables (network, cable, and port on the switch). If it works, you've gotten it down to three.

If it still fails, it's back to the hypothesis and experiment loop. You'll want to keep eliminating variables until you find the cause of the problem. Maybe it's faulty hardware. Maybe it's another app. Maybe it's a feature unique to the computer.

In my case, the failing system was a laptop. After some experimentation, I traced the problem to the SpeedStep feature. If I turned that off, it worked fine. I entered the bug. When a developer got it, the root cause was found in minutes. It turned out that the API used to time the 60 second keep alive packet failed if the processor speed changed. When the app launched, the CPU usage was high, so the processor ran at full speed. Once it went idle, it slowed down, which slowed the timer down. Then, it missed the keep alive packet and the server assumed the client had disconnected and closed the pipe.

A good bug report starts with a question, then some reseach. After that, it's a cycle of coming up with a hypothesis, testing it, and repeating until you can prove a hypothesis and find the cause. Finally, you report the findings to a developer through a bug report and, hopefully, get the bug fixed.