Atom and residue selection

Logical selection operators
Atom selections

Residue selections
name comparisons
- X-PLOR type wildcards
- regular expressions

The selection mechanism is a powerful mechanism for defining different sets of atoms and residues as arguments to the commands. The selection works as a logical expression: For each residue or atom that has been read into MolScript, the program tests whether that residue or atom matches the entire expression. All atoms or residues matching the expression are selected as argument for the command. This is very similar to how query statements work in the SQL language for relational databases.

If an expression selects no atoms or residues, then there is generally no error; that command simply does not do anything. The exception to this is the position vector specification.

Logical operators

The logical operations 'not', 'and' and 'or' can be used in a nested fashion to any depth. It is therefore possible to build quite complex statements, which select precisely the desired atoms or residues for a command.

Note that atom selections and residue selections cannot be freely used in the 'and' and 'or' expressions. The selection expressions are strongly typed; all terms in one 'and' or 'or' expression must be of the same type; either atom or residue. However, there are operators that convert an atom selection into the corresponding residues (contains) and vice versa (in).

require exp1, exp2, exp3 ... and expn

The 'and' operator has the meaning that the expressions

exp1,
exp2, exp3,..., expn

must all be true for an atom or residue to be selected. All the expressions must of one type; either atom or residue selection.

Note the comma ',' character: it is required between the expressions, except before the keyword and, where it may not occur.

either exp1, exp2, exp3, ... or expn

The 'or' operator has the meaning that if any single one of expressions exp1, exp2, exp3,..., expn is true for an atom or residue, then that atom or residue is selected. All the expressions must of one type; either atom or residue selection.

Note the comma ',' character: it is required between the expressions, except before the keyword or, where it may not occur.

not exp

This operator simply converts the value exp for each atom or residue into its opposite value.

Atom selections

atom string

Selects all atoms with the given name. The name may contain X-PLOR type wildcards or be a regular expression.

res-atom string string

Selects all atoms within residues of the given name (the first argument) and with the given atom name (the second argument). The names may contain X-PLOR type wildcards or be regular expressions. This expression is actually just shorthand for

  require in residue first_argument and atom second_argument

Note that although the res-atom expression is most often used to select one single atom, it will select all atoms that fit the arguments.

occupancy number number

Selects all atoms with an occupancy value within the given range.

b-factor number number

Selects all atoms with a B-factor value within the given range.

in residue-selection

Selects all atoms within the selected residue(s). This is an expression often used for the commands ball-and-stick and cpk, which need an atom selection as argument.

sphere vector number

Selects all atoms within a sphere with its centre at the given vector and with the given radius.

close atom-selection number

Selects all atoms closer than the given distance to any of the given atoms. The atoms given as argument are not part of the finally selected set. That is, this expression specifies only neighbours to certain atoms, excluding the atoms themselves.

backbone

This atom selection is shorthand for the following expression:

   either
     require in amino-acids
         and either atom N, atom CA, atom C or atom O
   or
     require not in amino-acids
         and either atom *', atom O%P or atom P

That is, if a residue is an amino acid, then its N, CA, C and O atoms are selected. If it is not an amino acid, then the atoms with names appropriate for the nucleic acid residue phosphate and (deoxy)ribose groups are selected. In the latter case an expression that selects all primed atoms is used.

peptide

This atom selection is shorthand for the following expression:

   require in amino-acids
       and either atom N, atom C or atom O

For all amino acid residues, the peptide atoms (N, C and O) are selected.

hydrogens

This atom selection is shorthand for the following expression:

   either atom H*, atom 1H*, atom 2H* or atom 3H*

That is, all atoms having the names commonly given to hydrogen atoms in a PDB file are selected.

Note that this selection is currently not based on the element specified for the atom in the new (v2.0) PDB file format. It may in a future version.

element string

Selects the atoms of the given element type. The element type string may contain one or two characters. The comparison is different from that usually used for strings: it is not case-sensitive, and no wildcards can be used.

The element type of the atoms are set when the coordinate file is read.

Residue selections

molecule string

Selects all residues within the given molecule. The molecule name is that given when the coordinate file was read. The name may contain X-PLOR type wildcards or be a regular expression.

model integer

Selects the model with the given number.

In the new (v2.0) PDB coordinate file format, the different coordinate sets from an NMR structure determination are given sequential model numbers, starting with 1. This is determined by the MODEL keyword in the PDB coordinate file.

Molecules read from a coordinate file with no MODEL keywords (e.g. an X-ray diffraction structure) will have the model number 0.

from string to string

Selects the stretch of residues between and including the given residues. The names may contain X-PLOR type wildcards or be a regular expression. The two names may not denote the same residue.

If there is more than one stretch of residues that match, then all such stretches are selected. For example, if a coordinate file contains amino acids from 1 to 100, and waters also numbered 1 to 57 (as may occur in PDB files), then a sequence specification "from 5 to 15" will pick both the stretch of amino-acid residues from 5 to 15, and the waters from 5 to 15.

This is usually not a problem in connection with commands such as helix or coil, since any selected non-amino acid residues are simply ignored by these. The behaviour can be advantageous when dealing with symmetrical subunits. The name comparison feature can then be used to pick both strands (or whatever) in both chains with one single command.

As a special case, if the first residue in the coordinates that match the 'from' part is an amino-acid residue, then all other first residues (if any) must also be amino-acid residues. This solves a problem that occurs in some PDB files where some amino-acid residues and ligands (hetero groups) have the same name, and the ligands are interspersed between several chains of amino-acid residues.

If a stretch of residues is not finished when the last residue in the currently loaded coordinates is reached, then MolScript issues a warning, but does not produce an error. An error should arguably be the proper response, but there are PDB files where the residue names are such that it is difficult to avoid this.

residue string

Selects the residues with the given name (or number). The name may contain X-PLOR type wildcards or be a regular expression.

Note that the residue name is left-shifted and the blanks have been squeezed out when the coordinate file was read. This means that the chain identifier and insertion code, if any, are part of the residue name, even if they were separate in the input coordinate file.

type string

Selects the residues with the given type. The type may contain X-PLOR type wildcards or be a regular expression.

chain string

Selects the residues with the given chain identifier. Note that this identifier is just a single character, if it is at all present. The segment identifier in the new (v2.0) PDB format can now be used in the residue selection segid.

contains atom-selection

Selects the residues that contain the given atoms.

amino-acids

This residue selection is shorthand for the following selection expression:

   either type ALA, type SER, type THR, type GLY, type PRO,
          type CPR, type ASN, type GLN, type ASP, type GLU,
          type ASX, type GLX, type ARG, type LYS, type HIS,
          type PHE, type TYR, type TRP, type TRY, type VAL,
          type ILE, type LEU, type MET, type CYS, type CSH,
          type CYH or type CSM

All standard three-letter codes for amino acid residues are recognized, as well as some non-standard ones; CPR for cis-proline, ASX for undetermined ASN or ASP, GLX for undetermined GLN or GLU, TRY for tryptophan, and CSH, CYH and CSM for cysteine.

waters

This residue selection is shorthand for the following selection expression:

   either type H2O, type HHO, type OHH, type HOH,
          type OH2, type SOL or type WAT

At least some of the commonly occurring residue type designations for water molecules are covered by this expression.

nucleotides

This residue selection is shorthand for the following selection expression:

  either residue A, residue +A, residue C, residue +C,
         residue I, residue +I, residue G, residue +G,
         residue T, residue +T, residue U or residue +U

This covers the common nucleotide bases as well as modified variants of these bases designated according to the PDB conventions.

ligands

This residue selection is shorthand for the following selection expression:

   not either amino-acids, waters or nucleotides

All residues which are neither amino acids, waters nor nucleotides are selected by this expression.

segid string

Selects the residues in the given segment (or chain). The segment identifier string must contain exactly four (4) characters. The comparison is different from that usually used for strings: no wildcards may be used. It is case sensitive.

This selection is useful only for molecules read from new (v2.0) PDB format files.

Name comparisons

Comparisons between the given atom names, residue types and names, and molecule names in the various selection expressions with those present in the coordinate data follow certain rules:

The comparison is case-sensitive; Tyr is not equal to TYR.
All strings have been left shifted when read from the coordinate file. All blanks have been squeezed out of the strings.
If the value of the parameter regularexpression is off, then MolScript allows using X-PLOR (Brünger 1992) type wildcard characters in the given strings. If the value is on, then the given string is viewed as a proper regular expression.

X-PLOR type wild cards

It is possible to use wildcard characters in the comparison: '*' means any string (zero or more characters), '%' means any single character, '#' means any number (zero or more digits), and '+' means any single digit. Some examples:

   atom *    all atoms
   atom N*   all nitrogen atoms (and sodium, neon, niobium,...)
   atom %G*  all gamma (G) atoms; CG, OG, OG1, SG (and possibly others)
   type T*   residue types THR, TRP and TYR (and possibly others)
   type T%R  residue types THR and TYR

If the coordinate file contains '*' in atom names (nucleic acids in PDB files) then these are converted into single-quotes ''' while reading the file. If your coordinate file contains '*' in residue names or types, or '%', '#' or '+' characters anywhere, then you must use a proper regular expression.

regular expressions

The regular expressions have the same syntax as in the UNIX utility regexp (except not having the "r{m,n}" feature):

      ^           beginning of line
      $           end of line
      .           any character
      \<          beginning of word
      \>          end of word
      [str]       any character in str
      [^str]      any character not in str
      [x-y]       any character between x and y (ASCII order)
      *           any number of the preceding expression
      c           the character c, where c is not special
      \(r\)       the regular expression r

Caveat: The above description may contain errors, since the source code used for this feature was not very well documented. Also, it hasn't been tested properly.

Top page