Written by H. Mertens, D. Franke, P. Markov, G. Tria, M. Petoukhov & D.I. Svergun. Post all your questions about EOM to the ATSAS Community Forum. Copyright (c) 2024 European Molecular Biology Laboratory (EMBL), Hamburg Unit and BIOSAXS GmbH, Hamburg, Germany This is the manual for the program suite EOM 3.0 (Ensemble Optimisation Method), which seeks to describe experimental SAXS data using an ensemble representation of atomic models. The following sections briefly describe the components of the new program EOM 3.0 and detail the steps required to run the suite. File input and output are explained. NOTE that the latest version of the program uses an updated modular protocol for: model generation (RANCH), intensities calculation ( FFMAKER ) and choice of several selection options (GAJOE/NNLSJOE). It is no longer possible to run the deprecated EOM executable. If one requires model generation it is recommended to execute RANCH (or obtain models from some other program), then subsequently use these models as input for the selection algorithms GAJOE or NNLSJOE. If you use results from EOM in your own publication, please cite: Tria, G., Mertens, H. D. T., Kachala, M.&Svergun, D. I. (2015) Advanced ensemble modelling of flexible macromolecules using X-ray solution scattering._IUCrJ_2, 207-217. Bernado, P., Mylonas, E., Petoukhov, M.V., Blackledge, M., Svergun, D.I. (2007) Structural Characterization of Flexible Proteins Using Small-Angle X-ray Scattering._J. Am. Chem. Soc._129(17), 5656-5664.

EOM 3.0

Table of Contents

Introduction

EOM 3.0 is a suite of programs that facilitate fitting of an averaged theoretical scattering intensity derived from an ensemble of conformations to experimental SAXS data. A pool of n independent models based upon sequence and structural information is first generated (eg. using the updated prgram RANCH). For multi-domain proteins where high-resolution structures for individual subunits/domains are available, these structures and distance/orientation information derived from them can be used as rigid-bodies and/or constraints in EOM model generation. For proteins expected to be intrinsically unfolded no rigid bodies are required as input, and completely random configurations of the alpha-carbon trace are created based upon the sequence alone. Crystallographic symmetry if required must be defined by the user as an appropriately arranged set of input rigid bodies (CIF or PDB format, with the user applying the fixed flag to maintain the desired orientation of such bodies). RANCH will not apply symmetry operations. Inter-domain/subunit contacts can be imposed to generate homo/hetero oligomers and complexes by providing distance constraints. Once the pool generation is completed the user can compute the theoreticl scattering intensities of the models in the pool using FFMAKER. FFMAKER will generate input to be passed to the ensemble selection methods: a genetic algorithm (GAJOE) or non-negative linear least-squares algorithm (NNLSJOE) for the selection of an ensemble. The selection algorithm compares the averaged theoretical scattering intensity from n independent ensembles of conformations against the scattering data. The ensemble that best describes the experimental SAXS data is selected.

Metrics for quantitative assessment of system flexibility

The distributions of R ~g~ and D ~max~ generated by EOM (specifically the GAJOE module) can be represented as probability density functions. This allows for a quantitative estimation of the flexibility of the system using the concept of information entropy. For example, an ensemble/pool of structural parameters for a protein showing a broad Gaussian-like distribution (where it is assumed the disordered regions move randomly in solution) can be viewed as a carrier of high uncertainty. Conversely, an ensemble/pool of parameters for a protein with a narrow size distribution (a scenario where the particle exhibits limited flexibility) provides low uncertainty. Useful metrics for the quantitative description of uncertainty (flexibility) provided by EOM 2.0 are: Rflex = -Hb(S), where Hb(S)=- ∑ ^n^ ~i=1~ p(x ~i~ )log ~b~ [p(x ~i~ )], with log ~b~ [p(x ~i~ )] = 0 if p(x ~i~ ) = 0 (For further detail refer to the EOM 2.0 paper) Metric for the degree of flexibility of the selected ensemble and that of the pool. Rflex = 100% for a fully flexible system, Rflex = 0% for a fully rigid system. Rsigma = standard_deviation(ensemble) / standard_deviation(pool) Metric for evaluation of the variance of the distributions of the selected ensemble and that of the pool, defined as the ratio of the standard deviations of the selected ensemble and that of the pool. Rsigma approaches 1.0 for a fully flexible system and Rsigma < 1.0 for systems with significant flexibility. For example, the following output from EOM/GAJOE facilitates assessment of the flexibility of the system: Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62 Rflex of the selected ensemble is ~67%, compared to ~91% for the pool, suggesting that this system is significantly less flexible than the pool. Rsigma is much less than 1.0, supporting the hypothesis that the system is significantly flexible. N.B. If Rflex of the ensemble is significantly smaller than that of the pool, but Rsigma > 1.0, this may indicate a problem with the experimental data and further investigation is required.

RANCH (RANdom CHains)

Manual

Introduction

RANCH is a program that generates a pool of n independent models based upon sequence and structural information. For multi-domain proteins where high- resolution structures for individual domains are available, such files (eg. CIF/PDB) can be used as rigid-body domains/subunits. For proteins expected to be intrinsically unfolded, no rigid bodies are used and random configurations of the alpha-carbon trace are created based upon the sequence. Crystallographic symmetry can be contructed through the user fixing input rigid bodies at required positions upon input, symmetry operations (P1,P2….Pn) are not applied by RANCH. Contacts between rigid bodies and unstructured regions of the sequence can be introduced via a set of user defined distance constraints.

Running RANCH

Usage:

$ ranch [OPTIONS] [COORDINATE FILE(S)]

RANCH accepts absolute as well as relative paths to the input SEQUENCE, ASSIGNMENT and atomic coordinate FILE(s). If no path is provided, RANCH reads from stdin. In all cases the cordinate input may be either in PDB or mmCIF format. The OPTIONS known by RANCH are described in the next section.

RANCH input files

Command-Line Arguments and Options

RANCH requires the following command line arguments:

Argument Description
SEQUENCE Required. The amino-acid sequence of the protein/peptide(s) in FASTA format, in a single file. If no FILE path is specified, input is read from stdin.
ASSIGNMENT Required. Domain assignments. The assignment of chain ID and residue numbering corresponding to structured and unstructured sequence. Here can be defined sequence regions corresponding to input CIF/PDB files and also user defined stretches of ideal strand and helix.
FILE Optional. The atomic coordinate files of any input rigid bodies in PDB or mmCIF format.

RANCH recognizes the following command-line options.

Short option Long option Description
-p --prefix=<ARG> output filename prefix (default: ranch)
  --model-format=<FMT> Format of 3D models, one of: cif, pdb (default: cif)
  --offset=<ARG> output file numbering offset (default: 0)
  --repetitions=<ARG> number of output model files (CIF); default: 10000
  --database=<FILE> Quasi-Ramachandran database file (dihedral map).NOTEthat three designations in the ASSIGNMENT file can be used that define the dihedral angles used: disordered (for intrinsically disordered and unstructured regions), denatured (for chemically denatured proteins/peptides) and compact (for compact structure).
  --database-threshold=<ARG> probabilities from the Quasi-Ramachandran dihedral map less than this threshold will be set to 0.0 (default: 0.0025)
  --distance-constraints=<FILE> File listing distance constraints between specified sequence positions/amino-acids
  --seed=<INT> Set the seed for the random number generator
-v --version Print version information and exit.
-h --help Print a summary of arguments, options, and exit.

Runtime Output

RANCH does not have any runtime output.

RANCH Input Files

RANCH accepts atomic coordinate data in PDB or mmCIF format as input, and a single sequence file in FASTA format. This may be either a relative or absolute file path, or data will be read from stdin.

RANCH Output Files

RANCH writes atomic coordinate data in PDB or mmCIF format on output. By default the coordinate files are written to the current directory, or a directory may be specified as part of the prefix.

Examples

RANCH for generation of a pool of unstructured peptides

Use RANCH to generate a pool of 10000 models based only on amino-acid sequence sequence.fasta and write the models to the directory pool:

$ ranch --repetitions 10000 --prefix pool/pep_ assignment.txt sequence.fasta

Example of the FASTA sequence file format:


> A
DSHAKRHHGYKRKFHEKHHSHRGYADSHAKRHHGYKRKFHEKHHSHRGYA
AAAAAAAAAAARKFHEKHHSHRGYADSHAKRHHGYKRKFHEKHHSHRGYA

In this case a single chain (A) of 100 residue length is defined. Additional chains can be appended to the file following this format. Example of the assignment file format:

A 1 100 disordered

In this case a single chain (A), generate coordinates for residues 1 to 100 using the Quasi-Ramachandran database for dihedral angles.

RANCH for generation of a pool of peptides with user defined regions of secondary structure

Use RANCH to generate a pool of 10000 models with stretches of ideal secondary structure:

$ ranch --repetitions 10000 --prefix pool/pep_ assignment_ss.txt sequence.fasta

Example of the assignment file format:


#assignment_ss.txt
A 1 10 disordered
A 11 22 helix 
A 22 26 disordered
A 27 37 strand
A 38 100 disordered

In this case a single chain (A), generate coordinates for unstructured residues 1-10, 22-26 & 38-100 using the disordered Quasi-Ramachandran database for dihedral angles, and additionally use dihdral angles from the helical and beta- strand regions of the Quasi-Ramachandran database for residues 11-22 and 27-37, respectively.

RANCH for generation of a pool of protein homo-oligomers with user defined coordinates for several domains

Use RANCH to generate a pool of 10000 multi-chain models with an interface defined by input PDB/CIF orientation:

$ ranch --repetitions 10000 --prefix pool/complex_ assignment.txt sequence.seq dom1.pdb dom2a.pdb dom2b.pdb

Example of the assignment file format:


#assignment.txt
A   1 218 structure fixed
A 219 228 disordered
A 229 387 structure
B   1 218 structure fixed
B 219 228 disordered
B 229 387 structure

In this case a two-domain protein forms a dimer by its first domain (dom1.pdb). The two copies of the second domain (dom2a.pdb and dom2b.pdb) are connected to the first domain by unstructured regions (219-228). The interface is defined by the user input coordinate file (dom1.pdb) and this pre-oriented coordinate file is fixed in position. RANCH will allow the unstructured regions to undergo conformational sampling while the interface is maintained. Instead of using fixed dimeric interface of the first domain, one may split the dom1.pdb into two files and apply distance constraints to define the interface using the following assignment.txt and a distances.txt file:


# assignment.txt
A   1 218 structure
A 219 228 disordered
A 229 387 structure
B   1 218 structure
B 219 228 disordered
B 229 387 structure


# distances.txt
A 140 145 B 140 145 15

In the above case a 15 angstrom upper limit distance is defined between residues 140-145 of chain A and residues 140-145 of chain B.

GAJOE (Genetic Algorithm Judging Optimisation of Ensembles)

Manual

Introduction

GAJOE is a program that uses a genetic algorithm for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. GAJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.

Running GAJOE

Usage:

$ gajoe [OPTIONS] [DATAFILE]

The OPTIONS known by GAJOE are described in the next section.

GAJOE input files

Command-Line Arguments and Options

GAJOE requires the following command line arguments:

Argument Description
DATAFILE Required. Scattering data file in 3 column format (s, intensities, errors).

GAJOE recognizes the following command-line options.

Short option Long option Description
-i --intensity=<FILE> pool of theoretical intensity curves (generated byFFMAKER)
-m --multipool=<N> this option is deprecated. Users are advised to useFFMAKERto generatepoolfiles that include intensities of all models
-t --times=<INT> number of times to repeat the search (default: 100)
-s --size=<FILE> pool size/statistics list file (generated byFFMAKER)
-g --generation=<N> number of generations (default: 1000)
-e --ensemble=<N> number of ensembles per generation (default: 50)
-c --no-constant Disable constant subtraction (default: enabled)
-o --no-repeated Disable curve repetition in the ensemble (default: disabled)
-w --work-files=<ARG> Enable writing workfiles to track the GA runs (default: disabled)
-a --minimum=<N> minumum number of conformers in selected ensemble (default: 1)
-z --maximum=<N> maximum number of conformers in selected ensemble (default: 50)
  --seed Set the seed for the random number generator.
-v --version Print version information and exit.
-h --help Print a summary of arguments, options, and exit.

Runtime Output

On runtime, the following lines of output will be written to standard output:


*******  ------------------------------------------------------  *******
*******     GAJOE - version 3.0                                  *******
*******     Copyright (c) ATSAS Team                             *******
*******     EMBL, Hamburg Outstation, 2007 - 2022                *******
*******                                                          *******
*******     For doubts/questions please visit SAXIER forum:      *******
*******     http://www.saxier.org/forum/viewforum.php?f=10       *******
*******                                                          *******
*******     In case of bugs please refer to:                     *******
*******     H. Mertens, D.I. Svergun, EMBL BioSAXS group         *******
*******     atsas@embl-hamburg.de                                *******
*******  ------------------------------------------------------  *******
 Experimental data file name ............................ : datafile.dat
 Intensities file name .................................. : intensities.csv
 Number of cycles of the genetic algorithm to run (min. 1): 100
 Random number generator has not been initialised; using current time
 Random seed is:    861619947286592639
 Curve: datafile.dat - Loading values and configuration ...
   Number of theoretical curves :       10000
 Starting the Genetic Algorithm ...
 CYCLE:   1
    Chi^2:  0.985
    No. unique models: 50
    Ensemble size: 50
...
...
...
 CYCLE: 100                                                                                                                                                  
    Chi^2:  0.991
    Ensemble size: 50
 ... finished the Genetic Algorithm!
 Rflex (random) / Rsigma: ~ 66.6% (~ 91.2%) / 0.62
 Re-making selected structures ...
[ 20%] [ 40%] [ 60%] [ 80%] [100%]  ... completed

Examples

GAJOE for ensemble selection from a pool of intensities generated by FFMAKER

Use GAJOE to perform selection from a pool of models:

$ gajoe --intensity intensities.csv --size size_statistics.csv --no-repeated --maximum 50 --minimum 50 datafile.dat

This command runs GAJOE 100 times against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv. No models are repeated and the ensemble size is maintained at a size of 50 members.

GAJOE output files

Once completed, GAJOE creates a subfolder in the working directory containing all files resulting from the computation. The subfolders are named in the form GAnum where num is the sequential number for each independent run ( e.g. GA001, GA002 etc. ). In each subfolder the following files/folders are written:

File Name Description
GA00n/curve_m/ Folder containing the result of the genetic algorithm for each experimental data set_m_(curve_1, curve_2,… curve_m)
GA00n/curve_m/logFile_00n_m.log Log file. File containing the configuration for the genetic algorithm for the experimental data set_m_.
GA00n/curve_m/profiles_00n_m.fit Fit file. File containing the fit for the best ensemble for the curve_m_. It can be opened directly in SASPLOT/PRIMUS. Detailed information (e.g. the discrepancy, CHI^2) is contained in the header of the file and can be viewed with a text editor.
GA00n/curve_m/Rg_distr_00n_m.dat R~g~distribution file. File containing the R~g~distribution of the selected ensemble for the curve_m_and the R~g~distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average R~g~values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
GA00n/curve_m/Size_distr_00n_m.dat Size distribution file. File containing the size distribution of maximum model dimensions (D~max~) of the selected ensemble for the curve_m_, and the size distribution for the pool. The file can be directly opened using SASPLOT/PRIMUS and selecting View > AbsY:X (linear scale). Detailed information on the average D~max~values of the pool and the selected structures are listed in the file header and can be viewed with a text editor.
GA00n/curve_m/CaCa_distr_00n_m.dat End-to-end CA-CA size distribution file.
GA00n/curve_m/Volume_distr_00n_m.dat Model volume distribution file.
GA00n/curve_m/pdbs Additionally, a subfolder named pdbs is created, containing the models from the selected ensemble produced in the cycle with the lowest CHI^2 value for the curve_m_.Please note that the PDB files in this folder areNOTthestructure of the flexible systembut serve as descriptors of the behaviour of the system in solution and are used to generate the R~g~/D~max~distributions and flexibility metrics.

GAJOE on short peptides

Warning: When running GAJOE on short peptides it is recommended to use fixed size ensemble with 50 curves per ensemble and disallow repetitions. See paper

NNLSJOE (Non-Negative Linear Least-Squares algorithm Judging Optimisation of Ensembles)

Manual

Introduction

NNLSJOE is an alternative selection algorithm program for the selection of an ensemble of models, whose combined theoretical scattering intensity best describe the experimental SAXS data. NNLSJOE can be run using tabular files of intensities and size/statistics generated by FFMAKER. Thus the input models describing a pool of conformations may be derived from eg. RANCH or any other program that provides CIF/PDB format.

Running NNLSJOE

Usage:

$ nnlsjoe [OPTIONS] [DATAFILE]

The OPTIONS known by NNLSJOE are described in the next section.

NNLSJOE input files

Command-Line Arguments and Options

GAJOE requires the following command line arguments:

Argument Description
DATAFILE Required. Scattering data file in 3 column format (s, intensities, errors).

NNLSJOE recognizes the following command-line options.

Short option Long option Description
-i --intensity=<FILE> pool of theoretical intensity curves (dat,csv,txt)
-s --size=<FILE> pool size/statistics list file (dat,csv,txt)
  --poolsize=<N> number of form-factors to use.
  --fit=<ARG> name of output fit file (.dat,.csv,*.txt)
-v --version Print version information and exit.
-h --help Print a summary of arguments, options, and exit.

Examples

NNLSJOE for ensemble selection from a pool of intensities generated by FFMAKER

Use NNLSJOE to perform selection from a pool of models:

$ nnlsjoe --intensity intensities.csv --size size_statistics.csv datafile.dat

This command runs NNLJOE against the file datafile.dat, using the pool of intensities file intensities.csv and size/statistic file size_statistics.csv. All models and repeats are considered and the optimum ensemble size determined.

NNLSJOE output files

Once completed, NNLSJOE writes a file describing the fit of the selected intensities to the experimental data, and reports the models selected and statistics to stdout.

File Name Description
user defined *.fit filename File containing the fit of the selected intensities to the experimental data.