Measures of Data Quality

Refinement (R) values in crystallography measure the agreement between observed predicted data [D1]. The newer \(CC\) correlation coefficients estimate the correlation betwen an observed data set and the ``true’’ signal providing a better indication of statistical validity, especially in the high resolution limit [D2].

\(R\) factors

Three different \(R\) factors are computed in OpenHKL. The subscript \(l\) denotes observations of reflection \(h\) with intensity \(I\).

  1. \(R_\mathrm{merge}\):

    The traditional measure of internal consistency; this measure has the flaw of increasing with data multiplicity, although this can be mitigated by averaging more observations.

\[R_\mathrm{merge} = \frac{\sum_h\sum_l |I_{hl} - \langle I_h \rangle |} {\sum_h \sum_l \langle I_h \rangle}\]
  1. \(R_\mathrm{meas}\):

This is the multiplicity-weighted \(R\) factor, in which \(n_h\) is the number of observations of reflection \(h\).

\[R_\mathrm{merge} = \frac{\sum_h\sum_l \left( \frac{n_h}{n_h - 1} \right)^{1/2} |I_{hl} - \langle I_h \rangle |} {\sum_h \sum_l \langle I_h \rangle}\]
  1. \(R_\mathrm{pim}\):

    This is the precision-indicating \(R\) factor, which estimates data quality after merging.

\[R_\mathrm{merge} = \frac{\sum_h\sum_l \left( \frac{1}{n_h - 1} \right)^{1/2} |I_{hl} - \langle I_h \rangle |} {\sum_h \sum_l \langle I_h \rangle}\]

The correlation coefficients \(CC_{1/2}\) and \(CC\ast\)

The statistic \(CC_{1/2}\) as introduced in  [D2]. The statistic \(CC_{1/2}\) is defined as follows. Randomly divide the unmerged dataset into two subsets. For each symmetry-equivalence class \([hkl]\), we have a merged intensity \(x_{hkl}\) from the first set and \(y_{hkl}\) from the second set. \(CC_{1/2}\) is defined as the Pearson correlation coefficient of the joint measurements \((x_{hkl}, y_{hkl})\).

\[CC_{1/2} := \rho(x, y) = \frac{\mathrm{Cov}(x, y)}{\sigma_x \sigma_y},\]

where \(\rho\) denotes the Pearson correlation coefficient. Note that this depends on the choice of division of the unmerged datasets into two subsets, so it is itself a random variable. (However, under some assumptions, one can check that its variance should be small.)

Let \(J_{hkl}\) denote the true intensity (we use \(J\) instead of \(I\) to distinguish this from our measured and/or merged intensities). Then define random variables \(\xi := x - J\) and \(\eta := y - J\). We make the following assumption: \(\xi\) and \(\eta\) are independent with mean zero, that \(\sigma_\xi = \sigma_\eta\), and that \(\xi,\eta\) are uncorrelated with \(J\).

Since \(\xi,\eta\) are uncorrelated with \(J\),

\[\begin{split}\begin{aligned} \sigma^2_x &= \sigma^2_J + \sigma^2_\xi \\ \sigma^2_y &= \sigma^2_J + \sigma^2_\eta = \sigma^2_J + \sigma^2_\xi\end{aligned}\end{split}\]


\[\begin{split}\begin{aligned} \rho(x,y) &= \frac{\mathrm{Cov}(x, y)}{\sigma_x \sigma_y} \\ &= \frac{\mathrm{Cov}(J + \xi, J + \eta)}{\sigma_x \sigma_y} \\ &= \frac{\sigma^2_J + \mathrm{Cov}(\xi, J) + \mathrm{Cov}(\eta, J) + \mathrm{Cov}(\xi, \eta)}{\sigma_x \sigma_y} \\ &= \frac{\sigma^2_J}{\sigma^2_J + \sigma^2_\xi}\end{aligned}\end{split}\]

Thus we have

\[\label{cc-half-simplified} CC_{1/2} = \sigma^2_J / \left(\sigma^2_J + \sigma^2_\xi \right)\]

This expression will be useful in the following subsection.


Let \(x, y, \xi, \eta, J\) be as in the previous subsection. Define

\[I = \frac{x+y}{2}\]

denote the merged intensities of the entire dataset. Then \(CC_\mathrm{true}\) is defined to be the Pearson correlation coefficient of \(I\) and the true intensities \(J\):

\[\label{cc-true-definition} CC_\mathrm{true} = \rho(I, J) = \frac{\mathrm{Cov}(I, J)}{\sigma_I \sigma_J}\]

Since in most cases we do not know the true intensities, this definition is not directly useful.

Making the same assumptions about measurement error as in the previous subsection, we have

\[\begin{split}\begin{aligned} \sigma^2_z &= \frac{1}{4} \sigma^2_x + \frac{1}{4}\sigma^2_y + \frac{1}{2} \mathrm{Cov}(x, y) \\ &= \sigma_J^2 + \frac{1}{2} \sigma_\xi^2\end{aligned}\end{split}\]

and furthermore,

\[\mathrm{Cov}(I, J) = \mathrm{Cov}(J + \frac{\xi+\eta}{2}, J) = \sigma^2_J.\]


\[CC_\mathrm{true} = \frac{\sigma_J}{\sqrt{\sigma^2_J + \frac{1}{2}\sigma^2_\epsilon}}.\]

From equation  [cc-half-simplified], we can express \(\sigma^2_\xi\) as \(\sigma^2_J(1/CC_{1/2}-1)\). Putting this into the above expression for \(CC_\mathrm{true}\), we have

\[\begin{split}\begin{aligned} CC_\mathrm{true} &= \frac{\sigma_J}{\sqrt{\sigma_J^2 + \frac{1}{2}\sigma^2_\xi}} = \frac{\sigma_J}{\sqrt{\sigma_J^2 + \frac{1}{2}\sigma^2_J(1/CC_{1/2}-1)}} \\ &= \frac{1}{\sqrt{\frac{1}{2}-\frac{1}{2 CC_{1/2}}}} = \sqrt{\frac{2 CC_{1/2}}{1+CC_{1/2}}},\end{aligned}\end{split}\]

which amazingly is a function of \(CC_{1/2}\) only. We therefore define

\[\label{cc-star-definition} CC\ast := \sqrt{\frac{2 CC_{1/2}}{1+CC_{1/2}}},\]

to be an estimate of \(CC_\mathrm{true}\), which can be calculated directly from the data. The statistic was introduced in [D2].


P. R. Evans. An introduction to data reduction: space-group determination, scaling and intensity statistics. Acta Crystallographica Section D Biological Crystallography, 67:282–292, 2011. doi:10.1107/s090744491003982x.

[D2] (1,2,3)

A. P. Karplus and K. Diederichs. Linking crystallographic model and data quality. Science, 336:1030–1033, 2012. doi:10.1126/science.1218231.

Go to top.