Tutorial¶
In this tutorial, we will look at some real (but relatively unproblematic) experimental data: trypsin with bound aniline.
Download and extract the archive containing the raw image files from the detector: trypsin.tar.gz (485 MB) .
Loading the raw data¶
The data set consists of 167 images with a resolution of 2500 times 900
pixels, and the sample is rotated in increments of 0.4 degrees. The instrument
used in this case is BioDiff [T1]. Click on Create new experiment
from the
Home
widget of OpenHKL, and select the BioDiff2500
instrument from the
list in the resulting dialogue.
Next, we need to add the data set, a set of .raw
image files from the
detector. From the main menu, select Data > Add data set > Add raw/tiff data
.
Select all of the image files, then click Open
. This will open the image
metadata dialogue.
In this case, this information is parsed from a .readme
file in the directory
containing the images, but this will not normally be the case. In this
experiment, we have an angular increment of 0.4 degrees and a wavelength of 2.67
Å. The table in the Home
widget now shows that the data is loaded.
Adding masks¶
There are artifacts on detector image whose position is fixed regardless of the
sample rotation angle, for example the beam stop and imperfections on the
detector such as the seam between detector plates. We can ensure that peaks
intersecting these artifacts are excluded from the final results by applying
masks. A mask can be applied on any image, and will apply across the whole
rotation range. Click on the Experiment
icon on the sidebar, and click on the
Masks
tab. When the Add detector image masks
box is checked, clicking and
dragging on the detector image will add either a rectangular or elliptical mask
depending on the state of the Mask type
box.
Fine adjustments can be made to the mask bounds using the spin boxes on the left panel, and any mistakes can be selected and deleted.
In this case we have masked a width of 250 pixels on the left and right of the image to remove backscattered peaks and the beam stops, the central beam stop, and the seam between detector plates between 1729 and 1736 pixels.
Finding peaks¶
The first step in data reduction is to find the detector spots in the images.
Unlike with many data reduction programmes, this must be done for all frames at
once, instead of on a per-frame basis. This is because OpenHKL constructs a
real-space 3D model of the peaks, so when processing a frame, it also requires
information on adjacent frames. Click on the Find peaks
icon on the sidebar.
The most important peak finder parameter is the threshold, which determines the
minimum number of counts for a pixel to be in a peak region; in this case, we
have decreased it from the default 80 to 30 in order to have more peaks to
construct a shape model. If our detector images had too few peaks to generate a
convincing shape model, we could decrease it further, and also decrease the
Minimum size
. The danger in doing this is, of course, that including very
weak peaks will degrade the quality of our shape model later on. Minimum
size
and Maximum size
specify the minimum and maximum number of counts in
a blob (i.e. a peak in real space); the latter is important because a blob with
too many counts might be a heterogeneous background feature. Finally Maximum
width
defines the maximum number of frames over which a blob can extend.
Leaving these parameters at their current values, click on Find peaks
, and
wait for the processing to finish (this may take a few minutes).
When the peak finding algorithm has finished, we should have identified 15316
peaks (this may vary slightly due to numerical imprecision and masking). At this
point, we need to integrate the peaks. The default integration parameters are a
good guess for most situations, but check the Compute gradient
box so that
the mean gradient in the background region is computed (this can be used later
on to discard peaks with strongly inhomogeneous backgrounds). The shape of a
blob can be characterised by an covariance (or inertia) matrix, and we simply
rescale this matrix to determine the integration region. 3 is a good guess for
the extent of the peak region (peak end
) because we expect to find 99.54%
of all counts within three standard deviations of the peak centre. After
clicking on the integrate
button and waiting, the intensity, \(\sigma\)
and strength columns in the table of peaks will be populated. Note that we don’t
want to make peak end
too large, since we want to avoid overlapping
intensity regions. The extent of the background region is less critical, since
overlaps will not affect the results; however we want it to be large enough to
get good statistics.
Note that after integration some peaks are marked as invalid; specifically,
approximately 16802 out of 25318 peaks are valid (Note that this number might
vary if your masks are not exactly the same as in the screenshot). We can check
the reason for rejection
column in the table to see why they were rejected.
In the first few frames and last few frames of the data set, the reason is
usually because the peak extends outside the sample rotation range and is
therefore incomplete. Also note how peaks close to the rotation axis are often
rejected, since they intersect the Ewald sphere on many images, and frequently
extend outside the sample rotation range as a result. In any case, we have a
good number of peaks, which we can use to both determine the unit cell and
generate a good shape model. Finally, click on create peak collection
and
name the collection found
.
Autoindexing¶
We now need to determine the unit cell, or to be precise, the rotation matrix of
the unit cell, since the unit cell is generally known at this stage. Click on
the Autoindex
icon on the sidebar.
It is absolutely critical that the direct beam position is accurate to within a
pixel or two at this point; if this is not the case, indexing will be very
difficult, if not impossible. Click on the set initial direct beam position
checkbox; this will allow you to drag and drop a crosshair, the size of which
can be adjusted, on the detector image. The direct beam is by default at the
exact centre of the image, but the air scattering halo in this data set is a
better indicator of the “real” direct beam position. Adjust the crosshair size
so that it can be used as a guide centre the direct beam, and reposition it as
well as possible.
Here, the direct beam position has been moved by about one and a half pixels in the x and y directions, but even this tiny change has made a precipitous improvement in the ease with which we can index this data set (try it without setting the crosshair!).
Since we’ve done the most important task of correcting the direct beam position,
we don’t need to adjust the indexer parameters too much. We only use peaks from
frames 0 to 10 to determine the cell (in fact, it is possible to do it with only
one frame in general, but we need to specify a range, since OpenHKL generates a
full 3D model). Adjusting strength
and d range
allows us to exclude weak
and high resolution peaks that may hinder the indexing algorithm. The indexer
generates a number of q-vectors over the unit sphere as candidate reciprocal
basis vectors; the number of vectors is specified by Q vertices
(10 000 is
more than enough). For more problematic unit cells, it may be necessary to
change Subdivisions
, which controls the number of histogram bins used in the
Fourier transform algorithm. Click on Find unit cells
, then switch from the
detector
tab to the solutions
tab.
A list of candidate unit cells is shown. If you were careful when adjusting the
direct beam, it should be quite easy to find the correct unit cell. In this
case, the cell parameters are 54.97 58.53 67.58 90 90 90
. The space group in
this case is \(P2_12_12_1\), which corresponds body-centred orthorhombic
Bravais lattice (oP
in the table). Again, the ability to find the correct
Bravais lattice and centring is crucial, and strongly dependent on the direct
beam position. In the screen shot, we can see that the hightlighted solution has
a quality of 100%, meaning that 100% of peaks in this collection can be
indexed using this unit cell. Select the best solution and click
Assign selected unit cell
to open the following dialogue.
This allows us to assign the chosen cell (indexed
) to the peak collection we
generated from the image analysis step above.
Shape model¶
Before generating an exhaustive set of predicted peaks, we need to construct a
preliminary shape model. We’re not too interested in optimising the model at this
stage, but want to be sure that the predicted peak shapes are reasonable so
that it’s possible to refine them. The most important parameter at this stage is
minimum I/sigma
, excluding weak peaks from the model. Set this to 3.0, then
click build shape model
and wait for the integration to complete. In the
example, around 16436 shapes should be generated, i.e. the model consists of the
shapes of 16436 strong peaks. Click on save shape model
and save the model.
A shape generated at specific detector coordinates can be previewed using the
shape preview widget. Here I have chosen the coordinates of a strong peak at
coordinates (1501, 267, 18)
, and compute a mean shape within a radius of 200
pixels and 10 frames.
The shape is previewed in the two panels between the table and detector views. On the left is a sequence of images depicting the peak, pixel-for-pixel, as it appears in the detector image. On the right is a plot of the 3D profile corresponding to that peak, i.e. the predicted value \(p_i\) for each pixel in the bounding box defined by the peak.
Note that clicking the peak on the detector image also superimposes the integration region for that single peak with the parameters (peak end, background begin, background end) specified in the parameters widget on the left. The peak region is in yellow, and the background region in green.
Predict peaks¶
Click the predict
icon on the sidebar. In this section we will use the unit
cell we found in the autoindexing step to predict an exhaustive set of peaks
within a given resolution cutoff. However, now that we have a unit cell we can
automatically refine the direct beam position. Set the number of batches to 30;
this will divide the found
peaks into 30 batches with equal number of peaks,
contiguous in sample rotation angle. This means after refinement, we will have
30 different values for the direct beam position, varying as a function of
rotation angle.
Set the value of d min
to 1.5 Å. peaks will be predicted up to this maximum
resolution. Click the predict
button, and wait for the prediction to finish.
Up to a resolution of 1.5 Å, we have 58229 predicted peaks, and it can be seen
in the screenshot below that the agreement for the peak centre positions is
quite good, if not perfect, and that it is worse at higher resolution. Now we
can apply our preliminary shape model to the predicted peaks by clicking
apply shape model
.
We’re not too concerned with the parameters at this point so
we’ll just use the defaults. Looking at a section of the detector image before
and after applying the shape model, we can see that the peaks correspond to the
image significantly better afterwards.
This is because when the shape of a peak changes, the ellipse formed at the
intersection of the ellipsoid and the detector plane changes, and therefore the
centre of the detector spot on the detector image changes, and serves to
highlight the importance of the peak shape even at this stage in the data
reduction. Finally, click create peak collection
and save the predicted
peaks as predicted
.
Refine¶
At this point, we need to ensure that our predicted peaks coincide with the
spots on the detector images, to ensure that the integration is accurate. This
can be achieved by refining the unit cell and instrument parameters (namely the
incident wavevector, the detector position and the sample position and
orientation). Click the Refine
icon on the sidebar.
Here, we will use a non-linear least squares algorithm to minimise the difference between Miller indices of the predicted peaks and the non-integer Miller indices of the peaks found via the image analysis step, by allowing the unit cell and aforementioned instrument parameters to vary.
The most important refinement parameter is number of batches
. Since the
instrument parameters and unit cell may change over the course of an experiment
due to thermal fluctuations, physical imperfections etc., it is likely that
these parameters will vary from frame to frame. To mitigate such errors, we
perform the refinement in batches, that is to say, we divide a peak collection
into n batches of equal numbers of peaks, and assign the peaks within that
rotation range the unit cell and instrument parameters from that batch. There
are 169 images in this data set, so we will divide the experiment into 30
batches, resulting approximately 5 images per batch (although this will vary,
since the peaks are not evenly distributed over the rotation range). Note also
that there is an “overlap” region between batches, such that the batch 2, for
example, will contain some peaks from batch 1.
We start with the unit cell found in the Autoindexing section. The use refined
cells
checkbox indicates, when checked, that the refiner will use the batch
unit cells instead of the single unit cell indicated in the unit cell
box as
the starting point (it should be unchecked for the moment, and the indexed unit
cell selected). For this data set, the optimisation is relatively
straightforward, so we can refine all parameters simultaneously, so all check
boxes under parameters to refine
should be checked. Now click refine
.
The tables should now be populated with the per-image values for unit cell,
sample position/orientation, detector position and incident wavevector. In the
screenshot below, I have plotted the change in the a/b/c cell parameters as a
function of image (i.e. sample rotation angle).
We have adjusted the instrument parameters and
unit cells, but we will need to update the predicted unit cells to see what
physical effect this has. In case we want to compare the predicted peaks before
and after refinement later on, let’s clone the predicted peaks. From the menu,
select peaks > clone peak collection
and in the resulting dialogue, clone
the predicted
peak collection to a new collection to be named refined
.
From the update predictions
section in the controls, select the new
collection refined
in the predicted peaks
box, then click on Update
.
After waiting a couple of minutes, you should be able to see something like the
following under the detector
tab.
The change should be quite significant. You can see how the unit cells and
instrument parameters change numerically as a function of rotation angle under
the tables
tab. We can try refining again, using the newly created “batch”
unit cells as a starting point by clicking the use refined cells
checkbox,
then clicking refine
again, followed by update
. From the screenshot
below, we can see not much has changed, but there are a few higher resolution
peaks that have moved significantly.
This is good enough for now, but later on we might want to run the refiner again when we’ve applied a better shape model.
Integrate¶
Now that we have a decent agreement between the positions of our predicted peaks
and detector spots, we can compute the intensities and variances of the
predicted peaks by clicking on the integrate
icon on the sidebar.
First though, it would be helpful to visualise the integration region, so we can
check that they are sane, and we don’t have too many overlaps between adjacent
peak regions. Expand the show/hide peaks
section in the controls, and check
the integration regions
checkbox.
The integration regions look good to the naked eye here, in general. The
detector spots are in general almost exactly coincident with the peak regions of
the predicted peaks, and there are no overlaps. If there were overlaps, we could
try decreasing the size of the peak regions. In fact, we can visualise the
effect of changing the integration region limits without changing their actual
values by checking the preview integration region
box, then adjusting the
three values below. Note in the screenshot the presence of at least one detector
spot (in the red box) that does not coincide as well as the others with the
integration region of its associated predicted peaks. Although it is off-centre,
it still looks as though the peak intensity is contained within the yellow
region. To check we can visualise the integration region for this peak one all
of the images on which it appears by double-clicking on it in the detector
image.
The detector spot is well-centred in the middle image, but is less so on adjacent images. This is evidence of a poor shape model: the major axis of the ellipsoid we generated earlier is skewed with respect to the blob formed by the detector spots. This is not a problem for integration of this specific reflection, but may be indicative of a systematic problem with affects peaks in high resolution areas of the detector image. For now, however, we will proceed with the integration.
The crucial parameters for integration are the integration region bounds (peak
end
, background begin
and background end
). Here we choose the “fixed
ellipsoid” integration region type, which means the ellipsoids defining
integration regions have a fixed \(r_2\) metric given in pixels by the
peak end
value of 5.5. The background begin
and background end
values are scaling factors, such that the background region will be an ellipsoid
scaled by the given values of 1.3 and 2.3 respectively. Check the Compute
gradient
box and integrate the peaks by clicking integrate peaks
.
Now the intensity
and sigma
columns in the table have been populated,
and we can see that 54914/58412 peaks were successfully integrated,
approximately 94%. For the purposes of this tutorial, this is acceptable, but
could be improved. Now we can merge the predicted peaks and get an idea of the
quality of the integration.
There are two things to note here. Firstly, that peak that intersect masks are rejected, but it also seems that a few peaks whose integration regions are just outside masks are rejected because they intersect masks (this can be confirmed by hovering the cursor over the peak and checking the “Reason for rejection” column in the table. Even though a peak does not intersect the mask on this image, the same peak may intersect a mask at a different rotation angle (i.e. in a different image). Secondly, peaks close to the rotation axis (namely peaks in a vertical line going through the central beam stop) are frequently rejected. This is because such peaks intersect the Ewald sphere over a large rotation range (possibly the whole rotation range), and are thus very difficult to build a good shape model for. They also often extend beyond the rotation range of the experiment.
Merge¶
Merging peaks that are related by space group symmetry operations gives us a
measure of the quality of the data reduction. These symmetry-related peaks
should have identical intensities, and the R-factors and CC-values computed in
the Merge
section give a statistical measure of the extent to which this is
true.
The reciprocal space in the resolution range 1.5 Å to 50 Å (in this instance) is divided into 10 concentric shells of equal reciprocal space volume, and the quality metrics computed for each individual shell, and for the overall volume. We also have the option of omitting the frames at the beginning and end of the rotation range, since the peaks here cannot generally be interpolated or are incomplete; however this is not usually necessary since such peaks are flagged as invalid.
There are many different quality metrics, but for this tutorial, we will just look at \(R_\mathrm{pim}\). It is close to zero at low resolutions (0.025 between 3.23 and 50.0 Å), and monotonically increases as the resolution increases, up to 0.45 in the 1.5 to 1.55 Å regime. Crucially, we should note that the completeness is good, averaging 72% over all resolution shells. The completeness is the fraction of the total number of peaks predicted that have been integrated successfully, and can effectively be indefinitely improved just by rejecting more and more peaks; however, this would defeat the purpose of our data reduction, so we would like to preserve as many peaks as possible by ensuring that our predicted spot positions and shape model are good.
Finally, we can inspect the lists of merged and unmerged peaks by clicking on their respective tabs. The unmerged peak table is the essential goal of the data reduction process, and can be exported in various different formats to use in refinement of structural models; it is simply a list of peaks with their Miller indices, coordinates, intensities and sigmas. The merged peaks tab gives the redundancy, the number of symmetry-equivalent peaks for a given index, together with a “goodness of fit” parameter in the form of a \(\chi^2\) value.
Profile integration¶
We can improve the quality of the integration at high resolutions by using profile integration for weak peaks (see 3D Profile integration). Instead of subtracting a mean local background from each pixel and summing together the resulting values in a given region, it is usually more accurate to weight each pixel by a factor corresponding to the value of a pixel in a mean profile. That is, we use strong peaks within a given radius of our reference weak peaks to construct a mean profile, and use the resulting normalised shape to weight pixel values of a peak with a poorly defined shape.
There are a few parameters that strongly affect the quality of profile integration; the first of these is the number of histogram bins used in constructing the mean profiles. The first of these is in the construction of the shape model: the default grid is 20 x 20 x 10, corresponding to the x, y and frame number coordinates respectively. Empirically, a lower value of the number of frame bins seems to work better, so construct a new shape model using a 20 x 20 x 6 grid (the parameters are pictured below).
This new model should be set on the integration screen, where the remaining parameters are, notably the radius for neighbour searching and the minimum number of neighbours.
Choosing the correct neighbour search parameters is a delicate balancing act.
Ideally we want a good statistical model for our weak peak, which means using as
many peaks as possible to generate the profile. However, choosing an excessively
large radius will degrade the model quality because of the spatial dependence of
peak shapes, i.e. if our radius is large, we are more likely to include strong
peaks that are far enough away to have radically different shapes. Moreover, we
do not want to reintegrate strong peaks, since their shape is already well
defined, and any attempt to scale the pixel values will simply make the
integration quality worse. Therefore, we check the Maximum strength for
profile integration box
and set the upper limit to profile integrate a peak to
a strength of 1. We can motivate this choice by looking at the peak statistics
in the “reject” widget.
We can use this tool to plot a histogram of sum-integrated peak strengths. The vast majority of peaks have a strength lower than 10. Adjusting x-axis range on the plot filters the detector image to show the peak in that range in green, and any outside that range in red. By experimenting with this view, it should be possible to convince yourself that the peaks that are essentially invisible to the naked eye on the image have strengths below 1, although this value could probably be optimised somewhat.
Note that this is the sum integrated strength, and relies on having already integrated the predicted peaks using pixel sum integration. Set the profile integration parameters as above, then integrate the predicted peaks again.
Note that the profile integrated columns in the table are now populated. We can now switch to the merge widget, and compare the integration metrics. For this choice of parameters, we have improved \(R_\mathcal{pim}\), but it may be possible to improve it even more.
Further work¶
We can attempt to improve the quality of the data reduction, but this is left as an exercise for the reader. Here are some hints:
Improve the shape model. In the first instance, we used peaks found in the image analysis step to generate a model. Since we have a lot of peaks in this model, perhaps we could achieve better fits by using only stronger peaks, by increasing the minimum strength. When the shape model is applied to a collection of predicted peaks, the highest resolution predicted peaks (at around 1.5 Å in this case) are invalid because there are not enough neighbouring “found” peaks (i.e. strong detector spots) to generate a reliable mean shape. One way of dealing with this is to increase the neighbour search radius (
radius
for pixel radius andframes
for angular radius in theshape model
controls); however, the risk is that we include peaks from too far away are qualitatively different to local peaks. We can overcome this by using different shape interpolation schemes (interpolation type
): “inverse distance” gives peaks closer to the selected peak a higher weight when averaging, and “intensity” gives stronger peaks a higher weight. Finally, we can try to “bootstrap” our calculation by using the shapes of the predicted peaks to generate a new shape model.Better refinement. It may be possible to optimise the number of batches used in the refinement. Additionally, it is possible that refining all parameters simultaneously might trap the algorithm in a local minimum. This can be avoided by optimising parameters individually, or using an alternative residual. Finally, if the shape model changes, then the per-image peak centres will also change, so it will be necessary to perform the refinement process again.
Better integration. Results might also be improved with a better choice of integration region. One possibility is modifying the bounds of the integration region (
peak end
,background begin
andbackground end
) to ensure that even in cases where the predicted peak is slightly off centre, all of the peak intensity pixels are encompassed by the integration region. It may also be possible to improve the results using a “fixed ellipsoid” as theintegration region type
. In this case, an ellipsoid with the same r2 metric (notionally radius) is used to construct the integration region for all peaks, regardless of their location on the detector. This is in general better than having an integration region whose size varies with \(\sigma\) as in the case of the “variable ellipsoid” integration region, since the count predictions for large integration regions tend to be somewhat noisier. In this tutorial, we have not been very careful about avoiding peak overlaps, namely the clashing of the yellow intensity part of integration regions. Clashing peaks can have a extremely negative effect on the quality of the data reduction, so as well as being careful with our choice of integration bounds, we can also reject overlapping peaks.Outlier rejection. The
Reject
widget accessible from the sidebar allows the generation of peak statistical models (as opposed to pixel statistics), and rejection of peaks that do not fit that model, for example, peaks with outlying background gradient values can be identified and rejected.
A. Ostermann and T. Schrader. Biodiff: diffractometer for large unit cells. Journal of large-scale research facilities JLSRF, 1:2, 2015. doi:10.17815/jlsrf-1-19.
W. Kabsch. Evaluation of single-crystal x-ray diffraction data from a position-sensitive detector. Journal of Applied Crystallography, 21:916–924, 1988. doi:10.1107/S0021889888007903.
W. Kabsch. Integration, scaling, space-group assignment and post-refinement. Acta Crystallographica Section D Biological Crystallography, 66:133–144, 2010. doi:10.1107/s0907444909047374.
R. Diamond. Profile analysis in single crystal diffractometry. Acta Crystalograhica, 25:43–55, 1969. doi:10.1107/S0567739469000064.
W. Kabsch. Xds. Acta Crystallographica Section D Biological Crystallography, 66:125–132, 2010. doi:10.1107/s0907444909047337.
C. Wilkinson, H. W. Khamis, R. F. D. Stansfield, and G. J. McIntyre. Integration of single-crystal reflections using area multidetectors. Journal of Applied Crystallography, 21:471–478, 1988. doi:10.1107/S0021889888005400.
E. Prince, C. Wilkinson, and G. J. McIntyre. Comparison of the sigma(\it I)/\it I and least-squares methods for integration of bragg reflections. Journal of Applied Crystallography, 30:133–137, 1997. doi:10.1107/S0021889896012824.