When doing validations and verifications, we statistically estimate different performance characteristics of a method. The values gained by calculations never give exact information about the method under examination. So how can we know how trustworthy the results really are?
The standard tool for estimating the credibility of a statistical result is the confidence interval (CI), which describes the precision of an estimate. A wide confidence interval points to lack of information, whether the result is statistically significant or not, and is a warning against over-interpreting results from small studies. Validation Manager™ calculates CI ranges for you automatically to help you in interpreting your results.
In Validation Manager™ we use 95% confidence level for calculations, meaning that if you would repeat your experiment 100 times, in 95 of those experiments the true value would fall between the measured confidence intervals.
What’s the difference between probability and confidence level?
Probability describes a situation, where we know the distribution (or basically the mean value and variance) of possible results from which we are taking individual data points to examination. When the distribution is known, we know the probabilities of each possible result.
Confidence level describes a situation, where we do not know the true distribution. Even when we know the exact concentrations that we are measuring, we don’t know the exact mean value of results given by the instrument, because we don’t know the exact amount of bias given by our measurement setup. We estimate the distribution by collecting a statistically significant data set to represent the true distribution. It is possible (though rather unlikely, see xkcd comic 882) that our experiment yields such results that the measured CI does not give any information about the true value, but with 95% confidence level the true value lies within the measured confidence interval.
What determines the width of the confidence interval?
Confidence interval is affected by following components:
- Selected confidence level. The more confidence you look for, the wider the interval. 95% is the standard choice for a confidence level, as it is widely considered as the minimum level for any scientific use. In cases where mistakes would lead to extreme consequences like death, higher confidence levels are recommended. The reason why 95% confidence is used when ever possible, is that with higher confidence levels the minimum sample size for getting meaningful results grows fast to amounts that are barely feasible for most purposes (see third point of this list)
- Variability. Samples with more variability (larger SD) generate wider confidence intervals.
- Sample size. Smaller the N size of the experiment, the wider the confidence intervals will be. There is an inverse square root relationship between the confidence intervals and sample sizes. This means that if you want to cut your margin of error in half, you need to approximately quadruple your sample size. Furthermore, it must be noted that for example for a probit LoD experiment, the sample size means both the replicate count and the number of dilutions.
What conclusions can be made from your results?
If you have a goal for achieving a certain value for a measured quantity (for example if you are verifying a LoD claim of a test manufacturer), what can you conclude from your results?
- If the desirable value falls below the measured CI, the experiment shows with 95% confidence level that the true value lies above your goal. For example if a LoD claim is below the 95% CI, the goal has not been met.
- If the desirable value falls in between the measured CI, you cannot make conclusions whether the goal is met or not, but with 95% confidence level the true value lies within the confidence interval. Depending on your CI you might either want to grow the sample population to achieve more precise results, or you might be happy with the result (as it is close enough to your goal).
- If the desirable value is above the measured CI, the experiment shows with 95% confidence level that the true value lies below your goal. For example if a LoD claim is above the 95% CI, the goal has been met.
When comparing two quantitative methods or instruments with each other, we are interested in whether there is bias between the results. Again we cannot only determine a value for bias by averaging over differences of individual samples, but we must also look at CI.
- If both ends of the CI of your bias are positive, then with 95% confidence level there is bias giving higher results than the comparative method or instrument. Depending on your CI you might or might not be able to estimate the magnitude of the bias. (To make a better estimation you could grow the sample population used in the experiment.)
- If the x-axis falls within the CI, you cannot make conclusions about there being bias, but with 95% confidence level the bias would not be greater than what fits within the confidence interval. Depending on your CI you might either be happy with the result (as the possible bias is so small that you can neglect it) or want to grow the sample population to achieve more precise results.
- If both ends of the CI of your bias are negative, then with 95% confidence level there is bias giving lower results than the comparative method or instrument. Depending on your CI you might or might not be able to estimate the magnitude of the bias. (To make a better estimation you could grow the sample number used in the experiment.)
- Please note also that bias is rarely constant over the whole measurement range. For example it is possible that small values show negative bias and large values show positive bias. Sometimes part of the measurement range shows acceptable bias and part of it shows unacceptable levels of bias.
If you don’t have a stated goal to compare to, CI gives you information about the credibility of the calculated value. For example if you measure precision, with 95% confidence level the true precision in similar operating conditions lies within the CI. When evaluating whether the measured result is acceptable, you should consider the whole range given by CI, not only the calculated standard deviation or coefficient of variation. The true value can be anywhere within the CI, and you cannot state that some value within the CI would be more likely true than the others.
You could think of the calculated result a little bit like predicting when a childbirth starts. The calculated value is the exact day when 40 weeks of pregnancy is full, with a confidence interval extending from 38th week to 42th week. Most babies are born between 38 and 42 weeks, only few on the day when exactly 40 weeks has passed. So even though we like to operate with definite values, we must understand that they are only estimates, and that the measured confidence interval represents the examined quantity better than any single value we may have as an estimate.
So what would be acceptable for confidence intervals?
Unfortunately, there is no unambiguous answer to question “when CI are acceptable”. As discussed in the previous section, CI gives knowledge whether your results are statistically significant, i.e. do you have a reason to state that a bias exists or that a goal has or has not been met. When the value you are comparing to is outside the measured CI, you can make these conclusions (with 95% confidence) regardless of the size of the CI.
When you don’t have a value to compare to, or the compared value falls within the measured CI, you need to consider whether the CI is acceptable or not. What is acceptable depends on what you are validating. The only universal thumb of rule to be given is to look at the CI to make as pessimistic an interpretation as possible. For example when establishing precision, you should consider the upper limit of CI as the true precision. When establishing bias, you should consider both the upper and lower limit of CI to evaluate whether this much of uncertainty is acceptable or not. How likely would the differences between these outermost values have an effect on how a patient is treated, and if it would, how dramatic consequences it could have? What ranges of results would give definite answers on a patient condition, and is that enough for clinical use? These are the profound questions guiding validations, and they need to be addressed case by case.
If you have doubts, growing the sample population adds confidence.