e3fp.fingerprint.fprinter module

Tools for generating E3FP fingerprints.

Author: Seth Axen E-mail: seth.axen@gmail.com

class Fingerprinter(bits=4294967296, level=5, radius_multiplier=1.718, stereo=True, counts=False, include_disconnected=True, rdkit_invariants=False, exclude_floating=True, remove_duplicate_substructs=True)[source]

Bases: object

E3FP fingerprint generator.

Parameters
  • bits (int or None, optional) – Maximum number of bits to which to fold returned fingerprint. Multiple of 2 is strongly recommended.

  • level (int or None, optional) – Maximum number of iterations for fingerprint generation. If None or -1, run until no new substructures are identified. Because this could produce a different final level number for each conformer, it is recommended to manually specify a level.

  • radius_multiplier (float, optional) – Multiple by which to increase shell size. At iteration 0, shell radius is 0*`radius_multiplier`, at iteration 2, radius is 2*`radius_multiplier`, etc.

  • counts (bool, optional) – Instead of simple bit-based Fingerprint object, generate CountFingerprint that tracks number of times each bit appears in a fingerprint.

  • stereo (bool, optional) – Differentiate based on stereography. Resulting fingerprints are not comparable to non-stereo fingerprints.

  • remove_duplicate_substructs (bool, optional) – If a substructure arises that corresponds to an identifier already in the fingerprint, then the identifier for the duplicate substructure is not added to fingerprint.

  • include_disconnected (bool, optional:) – Include disconnected atoms from hashes and substructure. E3FP’s advantage over ECFP relies on disconnected atoms, so the option to turn this off is present only for testing/comparison.

  • rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.

  • exclude_floating (bool, optional:) – Mask atoms with no bonds (usually floating ions) from the fingerprint. These are often placed arbitrarily and can confound the fingerprint.

Variables
  • current_level (int) – The maximum level/iteration to which the fingerprinter has been run on the current conformer.

  • level_shells (dict) – Dict matching level to set of all shells accepted at that level.

property current_level
get_fingerprint_at_level(level=-1, bits=None, exact=False, atom_mask={})[source]

Get the fingerprint at the specified level.

Parameters
  • level (int or None, optional) – Level/iteration

  • bits (int or None, optional) – Return fingerprints folded to this number of bits. If unspecified, defaults to bits set when instantiated.

  • exact (bool, optional) – Exact level

  • atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.

Returns

Fingerprint

Return type

Fingerprint at level

get_shells_at_level(level=-1, exact=False, atom_mask={})[source]

Get set of shells at the specified level.

Parameters
  • level (int or None, optional) – Level/iteration

  • exact (bool, optional) – Exact level

  • atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.

Returns

set of Shell

Return type

Shells at level

initialize_conformer(conf)[source]

Retrieve atom coordinates and instantiate shells generator.

Parameters

conf (RDKit Conformer) – Conformer to fingerprint

initialize_identifiers()[source]

Set initial identifiers for atoms.

initialize_mol(mol)[source]

Set general properties of mol that apply to all its conformers.

Parameters

mol (RDKit Mol) – Input molecule Mol object.

next()

Run next iteration of fingerprinting.

reset()[source]

Clear all variables associated with the last run.

reset_conf()[source]

Clear only conformer-specific variables.

reset_mol()[source]

Clear all variables associated with the molecule.

run(conf=None, mol=None, return_substruct=False)[source]

Generate fingerprint from provided conformer or mol and conf id.

Parameters
  • conf (RDKit Conformer or int, optional) – Input conformer or conformer in mol.

  • mol (RDKit Mol, optional) – Input molecule object, with at least one conformer. If conf not specified, first conformer is used.

  • return_substruct (bool, optional) – Return dict mapping substructure to fingerprint indices. Keys are indices, values are list of substructures, represented as a tuple of atom indices where the first index is the central atom and the remaining indices (within the sphere) are sorted.

substructs_to_pdb(level=None, bits=None, out_dir='substructs', reorient=True, exact=False)[source]

Save all accepted substructs from current level to PDB.

Parameters
  • level (int or None, optional) – Level of fingerprinting/number of iterations

  • bits (int or None, optional) – Folding level of identifiers

  • out_dir (str, optional) – Directory to which to save PDB files.

  • reorient (bool, optional) – Reorient substructure to match stereo quadrants.

class ShellsGenerator(conf, atoms, radius_multiplier=0.5, include_disconnected=True, atom_coords=None, bound_atoms_dict=None)[source]

Bases: object

Generate nested Shell objects from molecule upon request.

back()[source]

Back up one iteration.

get_match_atoms(rad)[source]

Get atoms within shell at radius rad.

Parameters

rad (float) – Radius of shell.

Returns

dict – shell

Return type

Dict matching atom id to set of ids for other atoms within

get_shells_at_level(level)[source]

Get dict of atom shells at specified level/iteration.

If not run to level, raises IndexError.

Parameters

level (int) – Level/iteration from which to retrieve shells dict.

Returns

dict

Return type

Dict matching atom ids to that atom’s Shell at that level.

next()

Get next iteration’s dict of atom shells.

atom_tuples_from_shell(shell, atom_coords, connectivity, stereo)[source]

Generate sorted atom tuples for neighboring atoms.

Parameters
  • shell (Shell) – Shell for which to build atom tuples

  • atom_coords (dict) – Dict matching atom ids to coords.

  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • stereo (bool) – Add stereo indicators to tuples

bound_atoms_from_mol(mol, atoms)[source]

Build dict matching atom id to ids of bounded atoms.

Bound atoms not in atoms are ignored.

Parameters
  • mol (RDKit Mol) – Input mol

  • atoms (list of int) – List of atom IDs

Returns

dict

Return type

Dict matching atom id to set of bound atom ids.

coords_from_atoms(atoms, conf)[source]

Build dict matching atom id to coordinates.

Parameters
  • atoms (list of int) – Atom ids

  • conf (RDKit Conformer) – Conformer from which to fetch coordinates

Returns

dict

Return type

Dict matching atom id to 1-D array of coordinates.

get_first_unique_tuple_inds(tuples_list, num_ret, ignore=[], assume_sorted=True)[source]

Return indices of first num_ret unique tuples in a list.

Only first 2 values of each tuple are considered.

Parameters
  • tuples_list (list of tuple) – List of tuples. Only first two unique values are considered.

  • num_ret (int) – Maximum number of first unique tuples to return.

  • ignore (list, optional) – Indices for tuples not be considered as unique.

  • assume_sorted (bool, optional) – If True, assume list is already sorted by tuples.

Returns

tuple of int – unique tuples in list.

Return type

List of at most num_ret ints indicating index of

hash_int64_array(array, seed=0)[source]

Hash an int64 array into a 32-bit integer.

Parameters
  • array (ndarray of int64) – Numpy array containing integers

  • seed (any, optional) – Seed for MurmurHash3.

Returns

int

Return type

32-bit integer

identifier_from_shell(shell, atom_coords, connectivity, level, stereo)[source]

Determine new identifier for a shell at a specific level.

Parameters
  • shell (Shell) – Shell for which to determine identifier

  • atom_coords (dict) – Dict matching atom ids to coords.

  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • level (int) – Level/iteration

  • stereo (bool) – Add stereo indicators

identifiers_from_invariants(mol, atoms, rdkit_invariants=False)[source]

Initialize ids according to Daylight invariants.

Parameters
  • mol (RDKit Mol) – Input molecule

  • atoms (list of int) – IDs for atoms in mol for which to generate identifiers.

  • rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.

Returns

ndarray of int64

Return type

initial identifiers for atoms

invariants_from_atom(atom)[source]

Get seven invariants from atom.

Invariants used are the six Daylight invariants, plus an indicator of whether the atom is in a ring, as detailed in [1].

References

  1. D Rogers, M Hahn. J. Chem. Inf. Model., 2010, 50 (5), pp 742-754 https://doi.org/10.1021/ci100050t

Parameters

atom (RDKit Atom) – Input atom

Returns

1-D array if int64

Return type

Array of 7 invariants

pick_y(atom_tuples, cent_coords, y_precision=0.1)[source]

Pick a y-coordinate from atom tuples or mean coordinate.

Parameters
  • atom_tuples (list of tuple) – Sorted list of atom tuples

  • cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.

  • y_precision (str, optional) – For mean to be chosen for y-coordinate, it must be at least this distance from the origin. Useful when atoms are symmetrical around the center atom where a slight shift in any atom results in a very different y.

Returns

  • 1x3 array of float or None (y-coordinate)

  • int or None (index to y-atom, if y was chosen from the atoms.)

pick_z(connectivity, identifiers, cent_coords, y, long_angle, z_precision=0.01)[source]

Pick a z-coordinate orthogonal to y.

Parameters
  • connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).

  • identifiers (iterable of int) – Atom identifiers

  • cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.

  • y (1x3 array of float) – y-coordinate

  • long_angle (Nx1 array of float) – Absolute angle of atoms from orthogonal to y.

  • z_precision (float, optional) – Minimum difference in long_angle between two potential z-atoms. Used as a tie breaker to prevent small shift in one atom resulting in very different z.

Returns

1x3 array of float or None

Return type

z-coordinate

quad_indicators_from_coords(cent_coords, y, y_ind, z, long_sign)[source]

Create angle indicators for four quadrants in each hemisphere.

Parameters
  • cent_coords (Nx3 array of float) – Array of centered coordinates.

  • y (1-D array of float) – Vector lying along y-axis.

  • y_ind (int) – Index of cent_coords corresponding to y.

  • z (1-D array of float) – Vector lying along z-axis

  • long_sign (Nx1 array of int) – Array of signs of vectors in cent_coords indicating whether they are above (+1) or below (-1) the xz-plane.

Returns

Nx1 array of int – indicators are 2, 3, 4, 5 for vectors above the xz-plane and -2, -3, -4, -5 for vectors below the xz-plane.

Return type

Quadrant indicators. Clockwise from z around y,

rdkit_invariants_from_atom(atom)[source]

Get the 6 atom invariants RDKit uses for its Morgan fingerprints.

Parameters

atom (RDKit Atom) – Input atom

Returns

1-D array if int64

Return type

Array of 6 invariants

signed_to_unsigned_int(a, bits=4294967296)[source]

Convert int between +/-bits to an int between 0 and bits.

Parameters
  • a (int or ndarray of int) – Integer

  • bits (int, optional) – Maximum size of int. E.g. 32-bit is 2^32.

Returns

int

Return type

unsigned integer

stereo_indicators_from_shell(shell, atom_tuples, atom_coords_dict, add_transform_to_shell=True)[source]

Get list of int indicating location of atoms on unit sphere.

Parameters
  • shell (Shell) – Shell for which to get stereo indicators.

  • atom_tuples (list of tuple) – List of atom tuples.

  • atom_coords_dict (dict) – Dict matching atom ids to coords.

  • add_transform_to_shell (bool, optional) – Calculate transformation matrix to align coordinates to unit sphere, and add to shell.

Returns

list of int

Return type

stereo indicators for atoms in atom_tuples.