Assessment of analysis-of-variance-based methods to quantify the random variations of observers in medical imaging measurements: guidelines to the investigator.

Research paper by William F A Klein WF Zeggelink, Augustinus A M AA Hart, Kenneth G A KG Gilhuijs

Indexed on: 13 Aug '04Published on: 13 Aug '04Published in: Medical physics


The random variations of observers in medical imaging measurements negatively affect the outcome of cancer treatment, and should be taken into account during treatment by the application of safety margins that are derived from estimates of the random variations. Analysis-of-variance- (ANOVA-) based methods are the most preferable techniques to assess the true individual random variations of observers, but the number of observers and the number of cases must be taken into account to achieve meaningful results. Our aim in this study is twofold. First, to evaluate three representative ANOVA-based methods for typical numbers of observers and typical numbers of cases. Second, to establish guidelines to the investigator to determine which method, how many observers, and which number of cases are required to obtain the a priori chosen performance. The ANOVA-based methods evaluated in this study are an established technique (pairwise differences method: PWD), a new approach providing additional statistics (residuals method: RES), and a generic technique that uses restricted maximum likelihood (REML) estimation. Monte Carlo simulations were performed to assess the performance of the ANOVA-based methods, which is expressed by their accuracy (closeness of the estimates to the truth), their precision (standard error of the estimates), and the reliability of their statistical test for the significance of a difference in the random variation of an observer between two groups of cases. The highest accuracy is achieved using REML estimation, but for datasets of at least 50 cases or arrangements with 6 or more observers, the differences between the methods are negligible, with deviations from the truth well below +/-3%. For datasets up to 100 cases, it is most beneficial to increase the number of cases to improve the precision of the estimated random variations, whereas for datasets over 100 cases, an improvement in precision is most efficiently achieved by increasing the number of observers. For datasets of at least 50 cases, the standard error ranges between 30% or less with 3 observers down to 10% or less with 8 observers, and the differences in precision between the methods are negligible. The F test (PWD) is very anticonservative and should not be used, while the t test (RES) is reliable for datasets of at least 2 x 50 cases evaluated by 4 or more observers. The likelihood-ratio-test (REML estimation) consistently indicates the significance of a difference in the random variation of an observer between two groups of cases, regardless of the number of cases, and regardless of the number of observers. If a statistical package to perform REML estimation is available, and the investigator feels confident using it, this is the preferred method for studies that involve less than 50 cases evaluated by less than 6 observers. Otherwise, the RES method is an excellent alternative, because of its straightforward implementation, its completeness with respect to the provided statistics, and its overall sufficient accuracy, precision, and reliability of the provided statistical test. If neither the RES method nor REML estimation can provide sufficient performance, either more observers or more cases must be included.