Recently, the following defect made the news and was one of the most widely-shared articles on the New York Times web site. Here's what the article, Computer Snag Limits Insurance Penalties on Smokers said:
A computer glitch involving the new health care law may mean that some smokers won’t bear the full brunt of tobacco-user penalties that would have made their premiums much higher — at least, not for next year.
The Obama administration has quietly notified insurers that a computer system problem will limit penalties that the law says the companies may charge smokers, The Associated Press reported Tuesday. A fix will take at least a year.
Tip of the Iceberg
This defect was entirely avoidable and predictable. Its safe to expect that hundreds (if not thousands) of similar defects related to Obamacare IT projects will emerge in the weeks and months to come. Had testers used straightforward software test design prioritization techniques, bugs like these would have been easily found. Let me explain.
There's no Way to Test Everything
If the developers and/or testers were asked how could this bug could sneak past testing, they might at first say something defensive, along the lines of: "We can't test everything! Do you know how many possible combinations there are?" If you include 40 variables (demographic information, pre-existing conditions, etc.) in the scope of this software application, there would be:
possible scenarios to test. That's not a typo: 41 QUADRILLION possible combinations. As in it would take 13 million years to execute those tests if we could execute 100 tests every second. There's no way we can test all possible combinations. So bugs like these are inevitably going to sneak through testing undetected.
The Wrong Question
When the developers and testers of a system say there is no way they could realistically test all the possible scenarios, they're addressing the wrong challenge. "How long would it take to execute every test we can think of?" is the wrong question. It is interesting but ultimately irrelevant that it would take 13 million years to execute those tests.
The Right Question
A much more important question is "Given the limited time and resources we have available for testing, how can we test this system as thoroughly as possible?" Most teams of developers and software testers are extremely bad at addressing this question. And they don't realize nearly how bad they are. The Dunning Kruger effect often prevents people from understanding the extent of their incompetence; that's a different post for a different day. After documenting a few thousand tests designed to cover all of the significant business rules and requirements they can think of, testers will run out of ideas, shrug their shoulders in the face of the overwhelming number of total possible scenarios and declare their testing strategy to be sufficiently comprehensive. Whenever you're talking about hundreds or thousands of tests, that test selection strategy is a recipe for incredibly inefficient testing that both misses large numbers of easily avoidable defects and wastes time by testing certain things again and again. There's a better way.
The Straightforward, Effective Solution to this Common Testing Challenge: Testers Should Use Intelligent Test Prioritization Strategies
If you create a well-designed test plan using scientific prioritization approaches, you can reduce the number of individual tests to test tremendously. It comes down to testing the system as thoroughly as possible in the time that's available for testing. There are well-proven methods for doing just that.
There are Two Kinds of Software Bugs in the World
Bugs that don't get found by testers sneak into production for one of two main reasons, namely:
"We never thought about testing that" - An example that illustrates this type of defect is one James Bach told me about. Faulty calculations were being caused by an overheated server that got that way because of a blocked vent. You can't really blame a tester who doesn't think of including a test involving a scenario with a blocked vent.
"We tested A; it worked. We tested B; it worked too.... But we never tested A and B together." This type of bug sneaks by testers all too often. Bugs like this should not sneak past testers. They are often very quick and easy to find. And they're so common as to be highly predictable.
Let's revisit the high-profile bug Obamacare bug that will impact millions of people and take more than a year to fix. Here's all that would have been required to find it:
Was the system tested with a scenario involving an applicant who had a relatively high age? I'm assuming it must have been.
Was the system tested with a scenario involving an applicant who smoked? Again, I'm assuming it must have been.
Was the system tested with a scenario involving an applicant who had a relatively high age who also smoked? That's what triggers this important bug; apparently it wasn't found during testing (or found early enough).
If You Have Limited Time, Test All Pairs
Let's revisit the claim of "we can't execute all 13 million-years-worth of tests. Combinations like these are bound to sneak through, untested. How could we be expected to test all 13 million-years-worth of tests?" The second two sentences are preposterous.
"Combinations like these are bound to sneak through, untested." Nonsense. In a system like this, at a minimum, every pair of test inputs should be tested together. Why? The vast majority of defects in production today would be found simply by testing every possible pair of test inputs together at least once.
"How could we be expected to test all 13 million-years-worth of tests?" Wrong question. Start by testing all possible pairs of test inputs you've identified. Time-wise, that's easily achievable; its also a proven way to cover a system quite thoroughly in a very limited amount of time.
Design of Experiments is an Established Field that was Created to Solve Problems Exactly Like This; Testers are Crazy Not to Use Design of Experiments-Based Prioritization Approaches
The almost 100 year-old field of Design of Experiments is focused on finding out as much actionable information as possible in as few experiments as possible. These prioritization approaches have been very widely used with great success in many industries, including advertising, manufacturing, drug development, agriculture, and many more. While Design of Experiments test design techniques (such as pairwise testing and orthogonal array testing / OA testing) are increasingly becoming used by software testing teams but far more teams could benefit from using these smart test prioritization approaches. We've written posts about how Design of Experiments methods are highly applicable to software testing here and here, and put an "Intro to Pairwise Testing" video here. Perhaps the reason this powerful and practical test prioritization strategy remains woefully underutilized by the software testing industry at large is that there are too few real-world examples explaining "this is what inevitably happens when this approach is not used... And here's how easy it would be to avoid this from happening to you in your next project." Hopefully this post helps raise awareness.
Let's Imagine We've Got One Second for Testing, Not 13 Million Years; Which Tests Should We Execute?
Remember how we said it would take 13 million years to execute all of the 41 quadrillion possible tests? That calculation assumed we could execute 100 tests a second. Let's assume we only have one second to execute tests from those 13 million years worth of tests. How should we use that second? Which 100 tests should we execute if our goal is to find as many defects as possible?
If you have a Hexawise account, you can to your Hexawise account to view the test plan details and follow along in this worked example. To create a new account in a few seconds for free, go to hexawise.com/free.
By setting the 40 different parameter values intelligently, we can maximize the testing coverage achieved in a very small number of tests. In fact, in our example, you would only need to execute only 90 tests to cover every single pairwise combination.
The number of total possible combinations (or "tests") that are generated will depend on how many parameters (items/factors) and how many options (parameter values) there are for each parameter. In this case, the number of total possible combinations of parameters and values equal 41 quadrillion.
This screen shot shows a portion of the test conditions that would be included the first 4 tests of the 90 tests that are needed to provide full pairwise coverage. Sometimes people are not clear about what "test every pair" means. To make this more concrete, by way of a few specific examples, pairs of values tested together in the first part of test number 1 include:
Plan Type = A tested together with Deductible Amount = High
Plan Type = tested together with Gender = Male
Plan Type = A tested together with Spouse = Yes
Gender = Male tested together with State = California
Spouse = Yes tested together with Yes (and over 5 years)
And lots of other pairs not listed here
This screen shot shows a portion of the later tests. You'll notice that the values are shown in purple italics. Those values listed in purple italics are not providing new pairwise coverage. You will note in the first tests every single parameter value is providing new pairwise coverage value, toward the end few parameter value settings are providing new pairwise coverage. Once a specific pair has been tested, retesting it doesn't provide additional pairwise coverage. Sets of Hexawise tests are "front loaded for coverage." In other words, if you need to stop testing at any point before the end of the complete set of tests, you will have achieved as much coverage as possible in the limited time you have to execute your tests (whether that is 10 tests or 30 tests or 83). The pairwise coverage chart below makes this point visually; the decreasing number of newly tested pairs of values that appear in each test accounts for the diminishing marginal returns per test.
You Can Even Prioritize Your First "Half Second" of Tests To Cover As Much As Possible!
This graph shows how Hexawise orders the test plan to provide the greatest coverage quickly. So if you get through 37 of the 90 tests needed for full pairwise coverage you have already tested over 90% of all the pairwise test coverage. The implication? Even if just 37 tests were covered, there would be a 90% chance that any given pair of values that you might select at random would be tested together in the same test case by that point.
Was Missing This Serious Defect an Understandable Oversight (Because of Quadrillions of Possible Combinations Exist) or was it Negligent (Because Only 90 Intelligently Selected Tests Would Have Detected it)?
A generous interpretation of this situation would be that it was "unwise" for testers to fail to execute the 90 tests that would have uncovered this defect.
A less generous interpretation would be that it was idiotic not to conduct this kind of testing.
The health care reform act will introduce many such changes as this. At an absolute minimum, health insurance firms should be conducting pairwise tests of their systems. Given the defect finding effectiveness of pairwise testing coverage, testing systems any less thoroughly is patently irresponsible. And for health insurance software testing it is often wiser to expand to test all triples or all quadruples given the interaction between many variables in health insurance software.
Incidentally, to get full 3 way test coverage (using the same example as above) would require 2,090 tests.