The interpretation of screening tests, such as mammograms, typically requires subjective visual evaluation of images by a radiologist, which often results in significant differences in the classifications of subjects` test results by radiologists. In expert-to-peer compliance screening clinical trials, multiple evaluators are often recruited to evaluate subjects` test results based on an ordinal classification scale. However, the use of traditional compliance measures in some studies is a challenge due to the presence of many evaluators, the use of an ordinal classification scale, and unbalanced data. We evaluate and compare the results of existing compliance and association measures, as well as a new compliance model with three large-scale clinical screening trials with ordinal classifications of many evaluators. We are also conducting a simulation study to demonstrate the main characteristics of the synthesis measurements. The assessment of the agreement and the association varied according to the choice of summary measure. Some measures were influenced by the underlying prevalence of the disease and the marginal distribution of assessors and/or were limited to balanced data sets in which each assessor classifies each subject. Our simulation study showed that popular measures of compliance and association are sensitive to the prevalence of the underlying disease. Model-based measures offer a flexible approach to calculating concordance and association and are robust to missing and unbalanced data as well as the prevalence of the underlying disease. .