We've been discussing the state of modern QA in the office lately. A common theme across some of my recent client engagements is that the team's gut sense doesn't align with the metrics we see in the bug tracker. "We closed 10 out of 12 issues in Jira this sprint," a team will report. Looking at the numbers, this is great news - but we can easily start to believe that there's only two bugs left. In our gut, we know there's more out there, so tho question is, can we get a better metric?
I came across a hacker news discussion entitled Maybe getting rid of your QA team was bad which has this this comment from a Sage IBM Engineer of Yore.
At the start of my career (late 70s), I worked at IBM (Hursley Park) in Product Assurance (their version of QA). We wrote code and built hardware to test the product that was our area of responsibility (it was a word processing system). We wrote test cases that our code would drive against the system under test. Any issues we would describe in general terms to the development team -- we didn't want them to figure out our testcases -- we wanted them to fix the bugs. Of course, this meant that we would find (say) three bugs in linewrapping of hyphenated words and the use of backspace to delete characters, and then the development team would fix four bugs in that area but only two of the actual bugs that we had found. This meant that you could use fancy statistics to estimate the actual number of bugs left.
We haven't been trying to measure the number of undiscovered defects in our systems, yet we feign surprise when users find things that slipped past our developers and testers.
Maybe, instead of seeking to get the number of defects in our issue tracker down to 0 (unlikely in any case), we should set a bar for the predicted number of defects we are willing to accept in our system, and then track that baseline against other metrics. Note that this applies not just to outright bugs, but also with usability or engagement issues as well.
In the Sage IBM Engineer of Yore's example, I don't know specifically what "fancy statistics" he's referring to, but I did take an engineering statistics class 20 years ago, and I have ChatGPT, so I'll try it as an exercise here:
We know that:
Before we move on to the math, ask yourself how many total bugs do you think exist in this system?
The fanciest statistical technique I personally am aware of is this one:
Where P(A|B) is the probability of a given bug being found, given the number bugs found by QA.
But we don't actually have P(A) (prior bug density) or P(B|A) (likelihood of finding observed bugs by the development team given the total number of bugs and bugs found by QA), so we'll have to estimate for now, though we can eventually track these numbers once we have a better sense for them.
Let's denote:
Given that the QA team found 3 bugs and the development team found 4, but only 2 were common, we can try to estimate pQA and pDev and then use these to estimate the total number of bugs.
Here,
We can use the ratio of the bugs found by both teams to the total unique bugs found as the left-hand side of this equation and solve for pQA and pDev assuming they are equal (which might not be true). Let's denote this probability as p. Then
So the estimated probability of either team finding a given bug would be 0.571.
Using this approach, which approximates Bayes' theorem by estimating the probability of either team finding a given bug, the estimated total number of bugs in the system is approximately 5 / 0.571 = 8.75.
This is a rough estimate and depends on the assumption that both teams have an equal and independent probability of finding any given bug, which may not be true. The point here, however, is that we may be able to get a better picture of how many defects exist in the system, then make informed delivery decisions based on that, rather than chasing defects post delivery.
I am a principal consultant at devobsessed. You can find me on linkedin and github.