gajoe
Manual
The following sections briefly describe the method implemented in GAJOE, how to run GAJOE from the command-line and describe the required input and the produced output file.
Introduction
GAJOE employs a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describes the experimental SAXS data. GAJOE can be run using comma separated value files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.
Please note: when running GAJOE on short peptides it is recommended to use an ensemble of fixed size with 50 curves per ensemble and to disallow repetitions.
Running GAJOE
Command-Line
Usage:
$ gajoe [OPTIONS] [DATAFILE]
GAJOE accepts absolute as well as relative paths.
The OPTIONS known by GAJOE are described in the next section.
Arguments and Options
GAJOE requires the following command line arguments:
| Argument | Description |
|---|---|
| DATAFILE | Exactly one experimental SAS data (.dat) file. |
Absolute as well as relative paths to data files are accepted.
GAJOE recognizes the following command-line options. Mandatory arguments to long options are mandatory for short options too.
| Short option | Long option | Description |
|---|---|---|
| -i | --intensity=<FILE> | pool of theoretical intensity curves (generated by FFMAKER) |
| -m | --multipool=<N> | this option is deprecated. Users are advised to use FFMAKER to generate pool files that include intensities of all models |
| -t | --times=<INT> | number of times to repeat the search (default: 100) |
| -s | --size=<FILE> | pool size/statistics list file (generated by FFMAKER) |
| -g | --generation=<N> | number of generations (default: 1000) |
| -e | --ensemble=<N> | number of ensembles per generation (default: 50) |
| -c | --no-constant | Disable constant subtraction (default: enabled) |
| -o | --no-repeated | Disable curve repetition in the ensemble (default: disabled) |
| -w | --work-files=<ARG> | Enable writing workfiles to track the GA runs (default: disabled) |
| -a | --minimum=<N> | minumum number of conformers in selected ensemble (default: 1) |
| -z | --maximum=<N> | maximum number of conformers in selected ensemble (default: 50) |
| --seed | Set the seed for the random number generator. | |
| -v | --version | Print version information and exit. |
| -h | --help | Print a summary of arguments, options, and exit. |
Runtime Output
During runtime, a progress output of the calculations is provided.
Graphical Interface
As an alternative to usage from the command-line, GAJOE may also be run through the EOM wizard from the ATSAS Application Launcher.
This wizard allows convenient generation and selection of pools, as well as their processing and analysis with either GAJOE or NNLSJOE. See EOM for more details.
GAJOE Input Files
GAJOE requires exactly one experimental SAS data (.dat) file.
In addition, at least the intensity and ideally the size, as generated by FFMAKER from the pool of models, should be provided.
GAJOE Output Files
Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run ( e.g. GA001, GA002 etc. ). In each subfolder the following files/folders are written:
| File Name | Description |
|---|---|
| GA00n/curve_m/ | Folder containing the result of the genetic algorithm for each experimental data set_m_(curve_1, curve_2,… curve_m) |
| GA00n/curve_m/logFile_00n_m.log | Log file. File containing the configuration for the genetic algorithm for the experimental data set_m_. |
| GA00n/curve_m/profiles_00n_m.fit | Fit file. File containing the fit for the best ensemble for the curve_m_. |
| GA00n/curve_m/Rg_distr_00n_m.dat | \(R_g\) distribution file. File containing the \(R_g\) distribution of the selected ensemble for the curve_m_and the \(R_g\) distribution for the pool. |
| GA00n/curve_m/Size_distr_00n_m.dat | Size distribution file. File containing the size distribution of maximum model dimensions (\(D_{\max}\)) of the selected ensemble for the curve_m_, and the size distribution for the pool. |
| GA00n/curve_m/CaCa_distr_00n_m.dat | End-to-end CA-CA size distribution file. |
| GA00n/curve_m/Volume_distr_00n_m.dat | Model volume distribution file. |
| GA00n/curve_m/pdbs | Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve_m_. |
NOTE: the PDB files in the pdbs folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the \(R_g\) and \(D_{\max}\) distributions and flexibility metrics.
Examples
GAJOE for ensemble selection
Use GAJOE to perform selection from a pool of models:
$ gajoe --intensity intensities.csv --size size_statistics.csv --no-repeated --maximum 50 --minimum 50 datafile.dat
This command runs GAJOE 100 times against the file datafile.dat, using the pool
of intensities file intensities.csv and size/statistic file
_size_statistics.csv`. No models are repeated and the ensemble size is
maintained at a size of 50 members.