datcmp
Manual
This manual focuses on practical use of DATCMP and assumes basic familiarity with statistical testing concepts like hypothesis testing and p‑values.
The command-line version of DATCMP described in this manual is intended primarily for scripted or batch workflows. For interactive use with immediate visual feedback, it is strongly recommended to employ the graphical wizards available in PRIMUS (Processing -> Compare), which invokes DATCMP internally.
The following sections briefly describe the method implemented in DATCMP, how to run DATCMP from the command-line on any of the supported platforms, the required input files.
Introduction
DATCMP compares SAS data to identify significant differences between experimental data frames, or between experimental data and calculated scattering.
Common uses for multiple experimental SAS data (.dat) files include detecting radiation damage across frames before averaging, or validating that replicate measurements are consistent, e.g. buffers before and after samples.
Further, if a single regularised SAS data (.out) file is provided as an argument, DATCMP calculates the fit of the reconstructed scattering from the \(p(r)\) to the experimental data. If a single fit to experimental data (.fir) or fit to calculated data (.fit) file is provided, DATCMP compares the contained experimental data and the model fit.
DATCMP implements multiple statistical tests to assess the similarity of two or more data sets, experimental or calculated.
Standardized Residuals
All tests implemented in DATCMP use residuals, the difference between two data sets, in one way or another. Standardized residuals are defined as:
\[r(s) = \frac{I_{1}(s) - I_{2}(s)}{\sqrt{\sigma_1(s)^2 + \sigma_2(s)^2}}\]where \(I_k\) and \(\sigma_k\) are the intensities and error estimates of experimental data or model, respectively.
Figure 1 Left: pointwise residuals; right: histogram of the same residuals
If the two data \(I_1(s)\) and \(I_2(s)\) are not significantly dissimilar, the standardized residuals should follow a Standard Normal Distribution, thus be centered on zero and roughly be symmetric in the range [-3;+3].
Deviations from these expectations indicate a violation of the assumption of similarity of the data.
Statistical Tests
The following statistical tests are implemented in DATCMP:
CorMap test
- Purpose: Randomness of residuals
- Uses: Sign of residuals
- Test Value: \(C\)
- Properties:
- does not require error estimates and is robust when errors are missing or unreliable
- sensitive to broad, correlated deviations in curve shape, not point-by-point noise
- longer curves make long runs more significant, so the same \(C\) can imply different p-values depending on \(n\)
- with the CorMap test, it is possible to assess the correctness of error estimates
red. \(\chi^2\) test
- Purpose: Randomness of standardized residuals
- Uses: standardized residuals
- Test Value: reduced \(\chi^2\)
- Properties:
- Requires accurate error estimates
- the \(\text{red.}\,\chi^2\) value heavily depends on the error estimates, incorrect errors will lead to incorrect conclusions
- with correct error estimates, a \(\text{red.}\,\chi^2\) value “close to 1.0” is generally considered a good fit; “close to 1.0” depends on the number of data points \(n\)
- \(n\) small (e.g. \(n=50\)): a \(\text{red.}\,\chi^2\) value in the range 0.5-1.5 can be considered “close to 1.0”
- \(n\) large (e.g. \(n=2000\)): a \(\text{red.}\,\chi^2\) value in the range 0.9-1.1 can be considered “close to 1.0”
- the exact ranges may be calculated from the \(\chi_n^2\)-distribution for any \(n\)
Anderson-Darling test
- Purpose: Standard Normal distribution of standardized residuals
- Uses: ordered standardized residuals
- Test Value: \(A^2\)
- Properties:
- Requires accurate error estimates
- more sensitive to deviations in the tails than reduced \(\chi^2\)
Running datcmp
Usage:
$ datcmp [OPTIONS] <SASDATA(S)>
OPTIONS known by DATCMP are described in next section, the required SASDATA arguments in the section on input files.
Command-Line Arguments and Options
DATCMP requires the following command line arguments:
| Argument | Description |
|---|---|
| SASDATA | Two or more experimental SAS data (.dat) files, or a single regularised SAS data (.out), fit to experimental data (.fir) or fit to calculated data (.fit) file. |
Absolute as well as relative paths to data files are accepted. Up to one of the input files may also be given as ‘-‘, in this case input is read in experimental SAS data (.dat) format from stdin.
DATCMP recognizes following command-line options:
| Short Option | Long Option | Description |
|---|---|---|
| --mode <NAME> | Comparison mode: PAIRWISE (default) compares all unique pairs; INDEPENDENT compares file pairs (1 vs 2, 3 vs 4, …). | |
| --test <NAME> | Test to apply: CORMAP (default), CHI-SQUARE, or ANDERSON-DARLING. | |
| --adjust <NAME> | Multiple-testing adjustment: FWER (Bonferroni, default) or FDR (Benjamini-Hochberg). | |
| --alpha <VALUE> | Significance level used for marking results and clique search (default: 0.01). | |
| -f | --format <FMT> | Output format: FULL (default) or CSV. |
| --prefix <PATH> | If present, write the internal data representation to files with this prefix. | |
| -v | --version | Print version information and exit. |
| -h | --help | Print a summary of arguments, options, and exit. |
Runtime Output
As everywhere with statistical testing, one must set the threshold of what one considers significant prior to running the test. Failing to do so will skew the results. Here we shall assume a significance level of 0.01.
$ datcmp --test=chi-square img_0003_0000*
Hypothesis: all data sets are similar
Alternative: at least one data set is different
Pair-wise Reduced Chi^2 test with correction for Familywise Error Rate (Bonferroni)
Chi^2 Pr(>Chi^2)adj Pr(>Chi^2)
1 vs. 2 0.984495 0.714182 0.714182
1* img_0003_00001.dat
2* img_0003_00002.dat
The first two lines of the output restate the hypothesis and their alternative: the hypothesis is that all frames are similar and the alternative that at least one frame is different among the provided set.
The next line states the evaluated test and the applied adjustment for multiple testing. Following one line per test stating the test value, the \(p\)-value and the adj. $$p$-value.
The * next to the file index indicates “is present in the largest group of
similar data”.
For automated processing use --format=csv where the same information is
provided in tabular format:
$ datcmp --test=chi-square --format csv img_0003_0000*
Reduced Chi^2 test, 1, 2, 0.9845, 0.7142, 0.7142
datcmp input files
DATCMP accepts experimental SAS data (.dat), regularised SAS data (.out), fit to experimental data (.fir) and fit to calculated data (.fit) files.
Examples
Comparing Experimental Data
Comparing a data set of 20 frames, here of the same water blank.
% bin/datcmp --test=chi-square img_0021_000*
Hypothesis: all data sets are similar
Alternative: at least one data set is different
Pair-wise Reduced Chi^2 test with correction for Familywise Error Rate (Bonferroni)
Chi^2 Pr(>Chi^2)adj Pr(>Chi^2
1 vs. 2 1.034621 0.109783 1.000000
1 vs. 3 1.039952 0.078535 1.000000
1 vs. 4 1.036327 0.098927 1.000000
1 vs. 5 1.055624 0.024884 1.000000
1 vs. 6 1.041984 0.068610 1.000000
1 vs. 7 1.010677 0.352573 1.000000
1 vs. 8 1.011242 0.345093 1.000000
1 vs. 9 1.032325 0.125729 1.000000
1 vs. 10 1.008866 0.376908 1.000000
1 vs. 11 1.053483 0.029543 1.000000
1 vs. 12 1.063295 0.012950 1.000000
1 vs. 13 1.038081 0.088616 1.000000
1 vs. 14 1.025040 0.186882 1.000000
1 vs. 15 1.579417 0.000000 0.000000*
1 vs. 16 953.013187 0.000000 0.000000*
1 vs. 17 301.964068 0.000000 0.000000*
1 vs. 18 1051.522440 0.000000 0.000000*
1 vs. 19 1059.348329 0.000000 0.000000*
1 vs. 20 1060.874763 0.000000 0.000000*
[cut for brevity]
19 vs. 20 1.014102 0.308287 1.000000
1* img_0021_00001.dat
2* img_0021_00002.dat
3* img_0021_00003.dat
4* img_0021_00004.dat
5* img_0021_00005.dat
6* img_0021_00006.dat
7* img_0021_00007.dat
8* img_0021_00008.dat
9* img_0021_00009.dat
10* img_0021_00010.dat
11* img_0021_00011.dat
12* img_0021_00012.dat
13* img_0021_00013.dat
14* img_0021_00014.dat
15 img_0021_00015.dat
16 img_0021_00016.dat
17 img_0021_00017.dat
18 img_0021_00018.dat
19 img_0021_00019.dat
20 img_0021_00020.dat
It can be concluded that the first 14 frames are similar to each other, while the latter six frames exhibit differences when compared to initial 14.
Assessing Error Estimates of Experimental Data
- Take >= 6 consecutive frames, e.g. of water
- Compare pairwise with
--test=cormapto verify similarity - Compare independent pairs of frames, 1-2, 3-4, 5-6, … with
--test=chi-square - If all observed \(\text{red.}\,\chi^2\) values are in [0.9 – 1.1], errors are likely ok
- If there is one outside this range, repeat steps 1. to 4. once
- If there are two or more outside range (counting repeat), error estimates are likely incorrect
The cutoff-interval of [0.9 – 1.1] depends strictly on the number of data points in the experimental data, but works as a rule of thumb for common detector sizes. More accurate values can be calculated from the \(\chi_n^2\)-distribution for any \(n\).
Assessing Model Fit
To (re-)calculate the fit of a model to experimental data in fit to experimental data (.fir) and fit to calculated data (.fit) files, DATCMP may be employed:
% datcmp ly01.fir
Hypothesis: all data sets are similar
Alternative: at least one data set is different
Pair-wise Correlation Map test with correction for Familywise Error Rate (Bonferroni)
C Pr(>C) adj Pr(>C)
1 vs. 2 12.000000 0.044513 0.044513
Assuming a cutoff \(\alpha\) of 0.01, the Pr(>C) = 0.044 is greater than \(\alpha\),
hence the assumption that the model scattering fits the experimental data can not
be rejected, a.k.a. “the model fits the data”.