Manual

The following sections briefly describe the method implemented in GAJOE, how to run GAJOE from the command-line and describe the required input and the produced output file.

Introduction

GAJOE employs a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describes the experimental SAXS data. GAJOE can be run using comma separated value files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.

Please note: when running GAJOE on short peptides it is recommended to use an ensemble of fixed size with 50 curves per ensemble and to disallow repetitions.

Running GAJOE

Command-Line

Usage:

$ gajoe [OPTIONS] [DATAFILE]

GAJOE accepts absolute as well as relative paths.

The OPTIONS known by GAJOE are described in the next section.

Arguments and Options

GAJOE requires the following command line arguments:

Argument Description
DATAFILE Exactly one experimental SAS data (.dat) file.

Absolute as well as relative paths to data files are accepted.

GAJOE recognizes the following command-line options. Mandatory arguments to long options are mandatory for short options too.

Short option Long option Description
-i --intensity=<FILE> pool of theoretical intensity curves (generated by FFMAKER)
-m --multipool=<N> this option is deprecated. Users are advised to use FFMAKER to generate pool files that include intensities of all models
-t --times=<INT> number of times to repeat the search (default: 100)
-s --size=<FILE> pool size/statistics list file (generated by FFMAKER)
-g --generation=<N> number of generations (default: 1000)
-e --ensemble=<N> number of ensembles per generation (default: 50)
-c --no-constant Disable constant subtraction (default: enabled)
-o --no-repeated Disable curve repetition in the ensemble (default: disabled)
-w --work-files=<ARG> Enable writing workfiles to track the GA runs (default: disabled)
-a --minimum=<N> minumum number of conformers in selected ensemble (default: 1)
-z --maximum=<N> maximum number of conformers in selected ensemble (default: 50)
  --seed Set the seed for the random number generator.
-v --version Print version information and exit.
-h --help Print a summary of arguments, options, and exit.

Runtime Output

During runtime, a progress output of the calculations is provided.

Graphical Interface

As an alternative to usage from the command-line, GAJOE may also be run through the EOM wizard from the ATSAS Application Launcher.

This wizard allows convenient generation and selection of pools, as well as their processing and analysis with either GAJOE or NNLSJOE. See EOM for more details.

GAJOE Input Files

GAJOE requires exactly one experimental SAS data (.dat) file.

In addition, at least the intensity and ideally the size, as generated by FFMAKER from the pool of models, should be provided.

GAJOE Output Files

Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run ( e.g. GA001, GA002 etc. ). In each subfolder the following files/folders are written:

File Name Description
GA00n/curve_m/ Folder containing the result of the genetic algorithm for each experimental data set_m_(curve_1, curve_2,… curve_m)
GA00n/curve_m/logFile_00n_m.log Log file. File containing the configuration for the genetic algorithm for the experimental data set_m_.
GA00n/curve_m/profiles_00n_m.fit Fit file. File containing the fit for the best ensemble for the curve_m_.
GA00n/curve_m/Rg_distr_00n_m.dat \(R_g\) distribution file. File containing the \(R_g\) distribution of the selected ensemble for the curve_m_and the \(R_g\) distribution for the pool.
GA00n/curve_m/Size_distr_00n_m.dat Size distribution file. File containing the size distribution of maximum model dimensions (\(D_{\max}\)) of the selected ensemble for the curve_m_, and the size distribution for the pool.
GA00n/curve_m/CaCa_distr_00n_m.dat End-to-end CA-CA size distribution file.
GA00n/curve_m/Volume_distr_00n_m.dat Model volume distribution file.
GA00n/curve_m/pdbs Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve_m_.

NOTE: the PDB files in the pdbs folder are NOT the structure of the flexible system but serve as descriptors of the behaviour of the system in solution and are used to generate the \(R_g\) and \(D_{\max}\) distributions and flexibility metrics.

Examples

GAJOE for ensemble selection

Use GAJOE to perform selection from a pool of models:

$ gajoe --intensity intensities.csv --size size_statistics.csv --no-repeated --maximum 50 --minimum 50 datafile.dat

This command runs GAJOE 100 times against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file _size_statistics.csv`. No models are repeated and the ensemble size is maintained at a size of 50 members.