# Measures of Data Quality¶

Refinement (R) values in crystallography measure the agreement between observed predicted data [D1]. The newer \(CC\) correlation coefficients estimate the correlation betwen an observed data set and the ``true’’ signal providing a better indication of statistical validity, especially in the high resolution limit [D2].

## \(R\) factors¶

Three different \(R\) factors are computed in OpenHKL. The subscript \(l\) denotes observations of reflection \(h\) with intensity \(I\).

\(R_\mathrm{merge}\):

The traditional measure of internal consistency; this measure has the flaw of increasing with data multiplicity, although this can be mitigated by averaging more observations.

\(R_\mathrm{meas}\):

This is the multiplicity-weighted \(R\) factor, in which \(n_h\) is the number of observations of reflection \(h\).

\(R_\mathrm{pim}\):

This is the precision-indicating \(R\) factor, which estimates data quality after merging.

## The correlation coefficients \(CC_{1/2}\) and \(CC\ast\)¶

The statistic \(CC_{1/2}\) as introduced in [D2]. The statistic \(CC_{1/2}\) is defined as follows. Randomly divide the unmerged dataset into two subsets. For each symmetry-equivalence class \([hkl]\), we have a merged intensity \(x_{hkl}\) from the first set and \(y_{hkl}\) from the second set. \(CC_{1/2}\) is defined as the Pearson correlation coefficient of the joint measurements \((x_{hkl}, y_{hkl})\).

where \(\rho\) denotes the Pearson correlation coefficient. Note that this depends on the choice of division of the unmerged datasets into two subsets, so it is itself a random variable. (However, under some assumptions, one can check that its variance should be small.)

Let \(J_{hkl}\) denote the true intensity (we use \(J\) instead of \(I\) to distinguish this from our measured and/or merged intensities). Then define random variables \(\xi := x - J\) and \(\eta := y - J\). We make the following assumption: \(\xi\) and \(\eta\) are independent with mean zero, that \(\sigma_\xi = \sigma_\eta\), and that \(\xi,\eta\) are uncorrelated with \(J\).

Since \(\xi,\eta\) are uncorrelated with \(J\),

Then

Thus we have

This expression will be useful in the following subsection.

## \(CC_\mathrm{true}\)¶

Let \(x, y, \xi, \eta, J\) be as in the previous subsection. Define

denote the merged intensities of the entire dataset. Then \(CC_\mathrm{true}\) is defined to be the Pearson correlation coefficient of \(I\) and the true intensities \(J\):

Since in most cases we do not know the true intensities, this definition is not directly useful.

Making the same assumptions about measurement error as in the previous subsection, we have

and furthermore,

Therefore,

From equation [cc-half-simplified], we can express \(\sigma^2_\xi\) as \(\sigma^2_J(1/CC_{1/2}-1)\). Putting this into the above expression for \(CC_\mathrm{true}\), we have

which amazingly is a function of \(CC_{1/2}\) only. We therefore define

to be an estimate of \(CC_\mathrm{true}\), which can be calculated directly from the data. The statistic was introduced in [D2].

P. R. Evans. An introduction to data reduction: space-group determination, scaling and intensity statistics. *Acta Crystallographica Section D Biological Crystallography*, 67:282–292, 2011. doi:10.1107/s090744491003982x.

A. P. Karplus and K. Diederichs. Linking crystallographic model and data quality. *Science*, 336:1030–1033, 2012. doi:10.1126/science.1218231.

Go to top.