#2 The Root Cause

Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.

Arthur Conan Doyle

Consultant looking for Root Cause

Satisfaction and feeling of accomplishment are very important in our daily work. For me, working with software always provided multiple occassions to feel this fuzzy, warm feeling of satisfaction of work well done. Today I wanted to write a few words about one experience, that is especially close to my heart and that is shared by almost any role and at any stage of the software lifecycle:

Bug fixing! Or, more precisely, THE BUG fixing - because not all bugs are created equal. Most programming mishaps are caught early on and have a clear cause, but sometimes a BUG is smart enough to avoid scrutiny by developers and QAs, weird enough to navigate between sets of automated tests, and located somewhere in a narrow corner, making it disappear from the scenarios of the User Acceptance Tests. How do you deal with these peculiar behaviours of the system, the unexpected errors, or unforeseen corner cases?

First: The Issue Description

This is often overlooked, especially when the bug was detected by someone who doesn't work in IT on a daily basis (i.e., end users). A clear and precise description of the issue is the key to finding solution. Screenshots, timelines, IDs of affected cases, involved users and parties - all of these might prove invaluable when working on a bug.

Second: The Reproduction Path

There is a reason this particular bug slipped through the other quality gates and ended up where it did. It won't show up in your "typical" run of a case or test scenario. That's why a detailed description is so important - and even then, it will likely require additional resources (e.g., logs, specific entry data, pre-setting starting values) and multiple runs (sometimes with screen sharing) before you pinpoint the exact steps needed to reproduce the issue.

Third: Limit Degrees of Freedom

Now that you know how to reproduce the issue consistently, limit the moving parts within the scenario. Try tweaking only one parameter, to see if it has any effect on system behaviour. Switch to a new user, run the case with a different set of initial data, or open it in different browser etc. Step by step, your reproduction path will become narrower, eventually leading you to...

Fourth: The Toggle

In my experience, the weirdest bugs are usually caused by very small changes in the logic - but what makes them hard to catch is that they manifest only if several conditions are fulfilled at the same time, or are quite separated (code-wise and time-wise) from the place when they manifest. The moment you find this "if" conditions that takes a wrong turn, or a parameter that is nulled at the wrong point - the fuzzy feeling will kick in, but don't get fooled, we're not done yet!

Fifth: The Root Cause

Play with the Toggle, change it to see if it fixes your problem, but don't do it blindly. You've delved deep into the system, and now it's time to stop and look around. Why is the condition written this way? Why is the data read from this table, or that service? What is the purpose of this transformation? Only when you understand how all the elements along your entire path fit together (that's why point 3 is so important!) you can confidently apply your fix - and let the satisfaction wash over you!

Six: Clean Up
When the rush is over - it's time for a bit of cleanup. Deployment planning, regression tests, adding missing unit tests, planning for refactoring to avoid future vulnerabilities. Make sure the bug won't raise its head again, or mutate in something completely new!

Consultant fighting the Regression Hydra

Page updated

Google Sites

Report abuse