e3fp.fingerprint.fprinter module

Tools for generating E3FP fingerprints.

Author: Seth Axen E-mail: seth.axen@gmail.com

class Fingerprinter(bits=4294967296, level=5, radius_multiplier=1.718, stereo=True, counts=False, include_disconnected=True, rdkit_invariants=False, exclude_floating=True, remove_duplicate_substructs=True)[source]

Bases: object

E3FP fingerprint generator.

Parameters:
  • bits (int or None, optional) – Maximum number of bits to which to fold returned fingerprint. Multiple of 2 is strongly recommended.

  • level (int or None, optional) – Maximum number of iterations for fingerprint generation. If None or -1, run until no new substructures are identified. Because this could produce a different final level number for each conformer, it is recommended to manually specify a level.

  • radius_multiplier (float, optional) – Multiple by which to increase shell size. At iteration 0, shell radius is 0*`radius_multiplier`, at iteration 2, radius is 2*`radius_multiplier`, etc.

  • counts (bool, optional) – Instead of simple bit-based Fingerprint object, generate CountFingerprint that tracks number of times each bit appears in a fingerprint.

  • stereo (bool, optional) – Differentiate based on stereography. Resulting fingerprints are not comparable to non-stereo fingerprints.

  • remove_duplicate_substructs (bool, optional) – If a substructure arises that corresponds to an identifier already in the fingerprint, then the identifier for the duplicate substructure is not added to fingerprint.

  • include_disconnected (bool, optional:) – Include disconnected atoms from hashes and substructure. E3FP’s advantage over ECFP relies on disconnected atoms, so the option to turn this off is present only for testing/comparison.

  • rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.

  • exclude_floating (bool, optional:) – Mask atoms with no bonds (usually floating ions) from the fingerprint. These are often placed arbitrarily and can confound the fingerprint.

Variables:
  • current_level (int) – The maximum level/iteration to which the fingerprinter has been run on the current conformer.

  • level_shells (dict) – Dict matching level to set of all shells accepted at that level.

property current_level
get_fingerprint_at_level(level=-1, bits=None, exact=False, atom_mask={})[source]

Get the fingerprint at the specified level.

Parameters:
  • level (int or None, optional) – Level/iteration

  • bits (int or None, optional) – Return fingerprints folded to this number of bits. If unspecified, defaults to bits set when instantiated.

  • exact (bool, optional) – Exact level

  • atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.

Returns:

Fingerprint

Return type:

Fingerprint at level

get_shells_at_level(level=-1, exact=False, atom_mask={})[source]

Get set of shells at the specified level.

Parameters:
  • level (int or None, optional) – Level/iteration

  • exact (bool, optional) – Exact level

  • atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.

Returns:

set of Shell

Return type:

Shells at level

initialize_conformer(conf)[source]

Retrieve atom coordinates and instantiate shells generator.

Parameters:

conf (RDKit Conformer) – Conformer to fingerprint

initialize_identifiers()[source]

Set initial identifiers for atoms.

initialize_mol(mol)[source]

Set general properties of mol that apply to all its conformers.

Parameters:

mol (RDKit Mol) – Input molecule Mol object.

next()

Run next iteration of fingerprinting.

reset()[source]

Clear all variables associated with the last run.

reset_conf()[source]

Clear only conformer-specific variables.

reset_mol()[source]

Clear all variables associated with the molecule.

run(conf=None, mol=None)[source]

Generate fingerprint from provided conformer or mol and conf id.

Parameters:
  • conf (RDKit Conformer or int, optional) – Input conformer or conformer in mol.

  • mol (RDKit Mol, optional) – Input molecule object, with at least one conformer. If conf not specified, first conformer is used.

substructs_to_pdb(level=None, bits=None, out_dir='substructs', reorient=True, exact=False)[source]

Save all accepted substructs from current level to PDB.

Parameters:
  • level (int or None, optional) – Level of fingerprinting/number of iterations

  • bits (int or None, optional) – Folding level of identifiers

  • out_dir (str, optional) – Directory to which to save PDB files.

  • reorient (bool, optional) – Reorient substructure to match stereo quadrants.

class ShellsGenerator(conf, atoms, radius_multiplier=0.5, include_disconnected=True, atom_coords=None, bound_atoms_dict=None)[source]

Bases: object

Generate nested Shell objects from molecule upon request.

back()[source]

Back up one iteration.

get_match_atoms(rad)[source]

Get atoms within shell at radius rad.

Parameters:

rad (float) – Radius of shell.

Returns:

dict – shell

Return type:

Dict matching atom id to set of ids for other atoms within

get_shells_at_level(level)[source]

Get dict of atom shells at specified level/iteration.

If not run to level, raises IndexError.

Parameters:

level (int) – Level/iteration from which to retrieve shells dict.

Returns:

dict

Return type:

Dict matching atom ids to that atom’s Shell at that level.

next()

Get next iteration’s dict of atom shells.

atom_tuples_from_shell(shell, atom_coords, connectivity, stereo)[source]

Generate sorted atom tuples for neighboring atoms.

Parameters:
  • shell (Shell) – Shell for which to build atom tuples

  • atom_coords (dict) – Dict matching atom ids to coords.

  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • stereo (bool) – Add stereo indicators to tuples

bound_atoms_from_mol(mol, atoms)[source]

Build dict matching atom id to ids of bounded atoms.

Bound atoms not in atoms are ignored.

Parameters:
  • mol (RDKit Mol) – Input mol

  • atoms (list of int) – List of atom IDs

Returns:

dict

Return type:

Dict matching atom id to set of bound atom ids.

coords_from_atoms(atoms, conf)[source]

Build dict matching atom id to coordinates.

Parameters:
  • atoms (list of int) – Atom ids

  • conf (RDKit Conformer) – Conformer from which to fetch coordinates

Returns:

dict

Return type:

Dict matching atom id to 1-D array of coordinates.

get_first_unique_tuple_inds(tuples_list, num_ret, ignore=[], assume_sorted=True)[source]

Return indices of first num_ret unique tuples in a list.

Only first 2 values of each tuple are considered.

Parameters:
  • tuples_list (list of tuple) – List of tuples. Only first two unique values are considered.

  • num_ret (int) – Maximum number of first unique tuples to return.

  • ignore (list, optional) – Indices for tuples not be considered as unique.

  • assume_sorted (bool, optional) – If True, assume list is already sorted by tuples.

Returns:

tuple of int – unique tuples in list.

Return type:

List of at most num_ret ints indicating index of

hash_int64_array(array, seed=0)[source]

Hash an int64 array into a 32-bit integer.

Parameters:
  • array (ndarray of int64) – Numpy array containing integers

  • seed (any, optional) – Seed for MurmurHash3.

Returns:

int

Return type:

32-bit integer

identifier_from_shell(shell, atom_coords, connectivity, level, stereo)[source]

Determine new identifier for a shell at a specific level.

Parameters:
  • shell (Shell) – Shell for which to determine identifier

  • atom_coords (dict) – Dict matching atom ids to coords.

  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • level (int) – Level/iteration

  • stereo (bool) – Add stereo indicators

identifiers_from_invariants(mol, atoms, rdkit_invariants=False)[source]

Initialize ids according to Daylight invariants.

Parameters:
  • mol (RDKit Mol) – Input molecule

  • atoms (list of int) – IDs for atoms in mol for which to generate identifiers.

  • rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.

Returns:

ndarray of int64

Return type:

initial identifiers for atoms

invariants_from_atom(atom)[source]

Get seven invariants from atom.

Invariants used are the six Daylight invariants, plus an indicator of whether the atom is in a ring, as detailed in [1].

References

  1. D Rogers, M Hahn. J. Chem. Inf. Model., 2010, 50 (5), pp 742-754 https://doi.org/10.1021/ci100050t

Parameters:

atom (RDKit Atom) – Input atom

Returns:

1-D array if int64

Return type:

Array of 7 invariants

pick_y(atom_tuples, cent_coords, y_precision=0.1)[source]

Pick a y-coordinate from atom tuples or mean coordinate.

Parameters:
  • atom_tuples (list of tuple) – Sorted list of atom tuples

  • cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.

  • y_precision (str, optional) – For mean to be chosen for y-coordinate, it must be at least this distance from the origin. Useful when atoms are symmetrical around the center atom where a slight shift in any atom results in a very different y.

Returns:

  • 1x3 array of float or None (y-coordinate)

  • int or None (index to y-atom, if y was chosen from the atoms.)

pick_z(connectivity, identifiers, cent_coords, y, long_angle, z_precision=0.01)[source]

Pick a z-coordinate orthogonal to y.

Parameters:
  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • identifiers (iterable of int) – Atom identifiers

  • cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.

  • y (1x3 array of float) – y-coordinate

  • long_angle (Nx1 array of float) – Absolute angle of atoms from orthogonal to y.

  • z_precision (float, optional) – Minimum difference in long_angle between two potential z-atoms. Used as a tie breaker to prevent small shift in one atom resulting in very different z.

Returns:

1x3 array of float or None

Return type:

z-coordinate

quad_indicators_from_coords(cent_coords, y, y_ind, z, long_sign)[source]

Create angle indicators for four quadrants in each hemisphere.

Parameters:
  • cent_coords (Nx3 array of float) – Array of centered coordinates.

  • y (1-D array of float) – Vector lying along y-axis.

  • y_ind (int) – Index of cent_coords corresponding to y.

  • z (1-D array of float) – Vector lying along z-axis

  • long_sign (Nx1 array of int) – Array of signs of vectors in cent_coords indicating whether they are above (+1) or below (-1) the xz-plane.

Returns:

Nx1 array of int – indicators are 2, 3, 4, 5 for vectors above the xz-plane and -2, -3, -4, -5 for vectors below the xz-plane.

Return type:

Quadrant indicators. Clockwise from z around y,

rdkit_invariants_from_atom(atom)[source]

Get the 6 atom invariants RDKit uses for its Morgan fingerprints.

Parameters:

atom (RDKit Atom) – Input atom

Returns:

1-D array if int64

Return type:

Array of 6 invariants

signed_to_unsigned_int(a, bits=4294967296)[source]

Convert int between +/-bits to an int between 0 and bits.

Parameters:
  • a (int or ndarray of int) – Integer

  • bits (int, optional) – Maximum size of int. E.g. 32-bit is 2^32.

Returns:

int

Return type:

unsigned integer

stereo_indicators_from_shell(shell, atom_tuples, atom_coords_dict, add_transform_to_shell=True)[source]

Get list of int indicating location of atoms on unit sphere.

Parameters:
  • shell (Shell) – Shell for which to get stereo indicators.

  • atom_tuples (list of tuple) – List of atom tuples.

  • atom_coords_dict (dict) – Dict matching atom ids to coords.

  • add_transform_to_shell (bool, optional) – Calculate transformation matrix to align coordinates to unit sphere, and add to shell.

Returns:

list of int

Return type:

stereo indicators for atoms in atom_tuples.