Manual

This manual focuses on practical use of DATCMP and assumes basic familiarity with statistical testing concepts like hypothesis testing and p‑values.

The command-line version of DATCMP described in this manual is intended primarily for scripted or batch workflows. For interactive use with immediate visual feedback, it is strongly recommended to employ the graphical wizards available in PRIMUS (Processing -> Compare), which invokes DATCMP internally.

The following sections briefly describe the method implemented in DATCMP, how to run DATCMP from the command-line on any of the supported platforms, the required input files.

Introduction

DATCMP compares SAS data to identify significant differences between experimental data frames, or between experimental data and calculated scattering.

Common uses for multiple experimental SAS data (.dat) files include detecting radiation damage across frames before averaging, or validating that replicate measurements are consistent, e.g. buffers before and after samples.

Further, if a single regularised SAS data (.out) file is provided as an argument, DATCMP calculates the fit of the reconstructed scattering from the $p(r)$ to the experimental data. If a single fit to experimental data (.fir) or fit to calculated data (.fit) file is provided, DATCMP compares the contained experimental data and the model fit.

DATCMP implements multiple statistical tests to assess the similarity of two or more data sets, experimental or calculated.

Standardized Residuals

All tests implemented in DATCMP use residuals, the difference between two data sets, in one way or another. Standardized residuals are defined as:

\[r(s) = \frac{I_{1}(s) - I_{2}(s)}{\sqrt{\sigma_1(s)^2 + \sigma_2(s)^2}}\]

where $I_k$ and $\sigma_k$ are the intensities and error estimates of experimental data or model, respectively.

Figure 1 Left: pointwise residuals; right: histogram of the same residuals

If the two data $I_1(s)$ and $I_2(s)$ are not significantly dissimilar, the standardized residuals should follow a Standard Normal Distribution, thus be centered on zero and roughly be symmetric in the range [-3;+3].

Deviations from these expectations indicate a violation of the assumption of similarity of the data.

Statistical Tests

The following statistical tests are implemented in DATCMP:

CorMap test

Purpose: Randomness of residuals
Uses: Sign of residuals
Test Value: $C$
Properties:
- does not require error estimates and is robust when errors are missing or unreliable
- sensitive to broad, correlated deviations in curve shape, not point-by-point noise
- longer curves make long runs more significant, so the same $C$ can imply different p-values depending on $n$
- with the CorMap test, it is possible to assess the correctness of error estimates

red. $\chi^2$ test

Purpose: Randomness of standardized residuals
Uses: standardized residuals
Test Value: reduced $\chi^2$
Properties:
- Requires accurate error estimates
- the $\text{red.}\,\chi^2$ value heavily depends on the error estimates, incorrect errors will lead to incorrect conclusions
- with correct error estimates, a $\text{red.}\,\chi^2$ value “close to 1.0” is generally considered a good fit; “close to 1.0” depends on the number of data points $n$
  - $n$ small (e.g. $n=50$): a $\text{red.}\,\chi^2$ value in the range 0.5-1.5 can be considered “close to 1.0”
  - $n$ large (e.g. $n=2000$): a $\text{red.}\,\chi^2$ value in the range 0.9-1.1 can be considered “close to 1.0”
  - the exact ranges may be calculated from the $\chi_n^2$-distribution for any $n$

Anderson-Darling test

Purpose: Standard Normal distribution of standardized residuals
Uses: ordered standardized residuals
Test Value: $A^2$
Properties:
- Requires accurate error estimates
- more sensitive to deviations in the tails than reduced $\chi^2$

Running datcmp

Usage:

$ datcmp [OPTIONS] <SASDATA(S)>

OPTIONS known by DATCMP are described in next section, the required SASDATA arguments in the section on input files.

Command-Line Arguments and Options

DATCMP requires the following command line arguments:

Argument	Description
SASDATA	Two or more experimental SAS data (.dat) files, or a single regularised SAS data (.out), fit to experimental data (.fir) or fit to calculated data (.fit) file.

Absolute as well as relative paths to data files are accepted. Up to one of the input files may also be given as ‘-‘, in this case input is read in experimental SAS data (.dat) format from stdin.

DATCMP recognizes following command-line options:

Short Option	Long Option	Description
	--mode <NAME>	Comparison mode: PAIRWISE (default) compares all unique pairs; INDEPENDENT compares file pairs (1 vs 2, 3 vs 4, …).
	--test <NAME>	Test to apply: CORMAP (default), CHI-SQUARE, or ANDERSON-DARLING.
	--adjust <NAME>	Multiple-testing adjustment: FWER (Bonferroni, default) or FDR (Benjamini-Hochberg).
	--alpha <VALUE>	Significance level used for marking results and clique search (default: 0.01).
-f	--format <FMT>	Output format: FULL (default) or CSV.
	--prefix <PATH>	If present, write the internal data representation to files with this prefix.
-v	--version	Print version information and exit.
-h	--help	Print a summary of arguments, options, and exit.

Runtime Output

As everywhere with statistical testing, one must set the threshold of what one considers significant prior to running the test. Failing to do so will skew the results. Here we shall assume a significance level of 0.01.

$ datcmp --test=chi-square img_0003_0000*
Hypothesis: all data sets are similar
Alternative: at least one data set is different

Pair-wise Reduced Chi^2 test with correction for Familywise Error Rate (Bonferroni)
                                                     Chi^2   Pr(>Chi^2)adj Pr(>Chi^2)
     1 vs.    2                                   0.984495     0.714182     0.714182 

       1*      img_0003_00001.dat
       2*      img_0003_00002.dat

The first two lines of the output restate the hypothesis and their alternative: the hypothesis is that all frames are similar and the alternative that at least one frame is different among the provided set.

The next line states the evaluated test and the applied adjustment for multiple testing. Following one line per test stating the test value, the $p$-value and the adj. $$p$-value.

The * next to the file index indicates “is present in the largest group of similar data”.

For automated processing use --format=csv where the same information is provided in tabular format:

$ datcmp --test=chi-square --format csv img_0003_0000* 
Reduced Chi^2 test,      1,      2,       0.9845,       0.7142,       0.7142

datcmp input files

DATCMP accepts experimental SAS data (.dat), regularised SAS data (.out), fit to experimental data (.fir) and fit to calculated data (.fit) files.

Examples

Comparing Experimental Data

Comparing a data set of 20 frames, here of the same water blank.

% bin/datcmp --test=chi-square img_0021_000*
Hypothesis: all data sets are similar
Alternative: at least one data set is different

Pair-wise Reduced Chi^2 test with correction for Familywise Error Rate (Bonferroni)
                                                     Chi^2   Pr(>Chi^2)adj Pr(>Chi^2
vs.    2                                   1.034621     0.109783     1.000000 
vs.    3                                   1.039952     0.078535     1.000000 
vs.    4                                   1.036327     0.098927     1.000000 
vs.    5                                   1.055624     0.024884     1.000000 
vs.    6                                   1.041984     0.068610     1.000000 
vs.    7                                   1.010677     0.352573     1.000000 
vs.    8                                   1.011242     0.345093     1.000000 
vs.    9                                   1.032325     0.125729     1.000000 
vs.   10                                   1.008866     0.376908     1.000000 
vs.   11                                   1.053483     0.029543     1.000000 
vs.   12                                   1.063295     0.012950     1.000000 
vs.   13                                   1.038081     0.088616     1.000000 
vs.   14                                   1.025040     0.186882     1.000000 
vs.   15                                   1.579417     0.000000     0.000000*
vs.   16                                 953.013187     0.000000     0.000000*
vs.   17                                 301.964068     0.000000     0.000000*
vs.   18                                1051.522440     0.000000     0.000000*
vs.   19                                1059.348329     0.000000     0.000000*
vs.   20                                1060.874763     0.000000     0.000000*

[cut for brevity]

    19 vs.   20                                   1.014102     0.308287     1.000000 

       1*      img_0021_00001.dat
       2*      img_0021_00002.dat
       3*      img_0021_00003.dat
       4*      img_0021_00004.dat
       5*      img_0021_00005.dat
       6*      img_0021_00006.dat
       7*      img_0021_00007.dat
       8*      img_0021_00008.dat
       9*      img_0021_00009.dat
      10*      img_0021_00010.dat
      11*      img_0021_00011.dat
      12*      img_0021_00012.dat
      13*      img_0021_00013.dat
      14*      img_0021_00014.dat
      15       img_0021_00015.dat
      16       img_0021_00016.dat
      17       img_0021_00017.dat
      18       img_0021_00018.dat
      19       img_0021_00019.dat
      20       img_0021_00020.dat

It can be concluded that the first 14 frames are similar to each other, while the latter six frames exhibit differences when compared to initial 14.

Assessing Error Estimates of Experimental Data

Take >= 6 consecutive frames, e.g. of water
Compare pairwise with --test=cormap to verify similarity
Compare independent pairs of frames, 1-2, 3-4, 5-6, … with --test=chi-square
If all observed $\text{red.}\,\chi^2$ values are in [0.9 – 1.1], errors are likely ok
If there is one outside this range, repeat steps 1. to 4. once
If there are two or more outside range (counting repeat), error estimates are likely incorrect

The cutoff-interval of [0.9 – 1.1] depends strictly on the number of data points in the experimental data, but works as a rule of thumb for common detector sizes. More accurate values can be calculated from the $\chi_n^2$-distribution for any $n$.

Assessing Model Fit

To (re-)calculate the fit of a model to experimental data in fit to experimental data (.fir) and fit to calculated data (.fit) files, DATCMP may be employed:

% datcmp ly01.fir 
Hypothesis: all data sets are similar
Alternative: at least one data set is different

Pair-wise Correlation Map test with correction for Familywise Error Rate (Bonferroni)
                                                         C       Pr(>C)   adj Pr(>C)
     1 vs.    2                                  12.000000     0.044513     0.044513 

Assuming a cutoff $\alpha$ of 0.01, the Pr(>C) = 0.044 is greater than $\alpha$, hence the assumption that the model scattering fits the experimental data can not be rejected, a.k.a. “the model fits the data”.