e3fp.fingerprint.fprinter module¶
Tools for generating E3FP fingerprints.
Author: Seth Axen E-mail: seth.axen@gmail.com
- class Fingerprinter(bits=4294967296, level=5, radius_multiplier=1.718, stereo=True, counts=False, include_disconnected=True, rdkit_invariants=False, exclude_floating=True, remove_duplicate_substructs=True)[source]¶
Bases:
object
E3FP fingerprint generator.
- Parameters
bits (int or None, optional) – Maximum number of bits to which to fold returned fingerprint. Multiple of 2 is strongly recommended.
level (int or None, optional) – Maximum number of iterations for fingerprint generation. If None or -1, run until no new substructures are identified. Because this could produce a different final level number for each conformer, it is recommended to manually specify a level.
radius_multiplier (float, optional) – Multiple by which to increase shell size. At iteration 0, shell radius is 0*`radius_multiplier`, at iteration 2, radius is 2*`radius_multiplier`, etc.
counts (bool, optional) – Instead of simple bit-based
Fingerprint
object, generateCountFingerprint
that tracks number of times each bit appears in a fingerprint.stereo (bool, optional) – Differentiate based on stereography. Resulting fingerprints are not comparable to non-stereo fingerprints.
remove_duplicate_substructs (bool, optional) – If a substructure arises that corresponds to an identifier already in the fingerprint, then the identifier for the duplicate substructure is not added to fingerprint.
include_disconnected (bool, optional:) – Include disconnected atoms from hashes and substructure. E3FP’s advantage over ECFP relies on disconnected atoms, so the option to turn this off is present only for testing/comparison.
rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.
exclude_floating (bool, optional:) – Mask atoms with no bonds (usually floating ions) from the fingerprint. These are often placed arbitrarily and can confound the fingerprint.
- Variables
- property current_level¶
- get_fingerprint_at_level(level=-1, bits=None, exact=False, atom_mask={})[source]¶
Get the fingerprint at the specified level.
- Parameters
level (int or None, optional) – Level/iteration
bits (int or None, optional) – Return fingerprints folded to this number of bits. If unspecified, defaults to bits set when instantiated.
exact (bool, optional) – Exact level
atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.
- Returns
Fingerprint
- Return type
Fingerprint at level
- get_shells_at_level(level=-1, exact=False, atom_mask={})[source]¶
Get set of shells at the specified level.
- Parameters
level (int or None, optional) – Level/iteration
exact (bool, optional) – Exact level
atom_mask (int or set of int, optional) – Don’t return shells whose substructures contain these atoms.
- Returns
set of Shell
- Return type
Shells at level
- initialize_conformer(conf)[source]¶
Retrieve atom coordinates and instantiate shells generator.
- Parameters
conf (RDKit Conformer) – Conformer to fingerprint
- initialize_mol(mol)[source]¶
Set general properties of mol that apply to all its conformers.
- Parameters
mol (RDKit Mol) – Input molecule Mol object.
- next()¶
Run next iteration of fingerprinting.
- run(conf=None, mol=None, return_substruct=False)[source]¶
Generate fingerprint from provided conformer or mol and conf id.
- Parameters
conf (RDKit Conformer or int, optional) – Input conformer or conformer in mol.
mol (RDKit Mol, optional) – Input molecule object, with at least one conformer. If
conf
not specified, first conformer is used.return_substruct (bool, optional) – Return dict mapping substructure to fingerprint indices. Keys are indices, values are list of substructures, represented as a tuple of atom indices where the first index is the central atom and the remaining indices (within the sphere) are sorted.
- substructs_to_pdb(level=None, bits=None, out_dir='substructs', reorient=True, exact=False)[source]¶
Save all accepted substructs from current level to PDB.
- Parameters
level (int or None, optional) – Level of fingerprinting/number of iterations
bits (int or None, optional) – Folding level of identifiers
out_dir (str, optional) – Directory to which to save PDB files.
reorient (bool, optional) – Reorient substructure to match stereo quadrants.
- class ShellsGenerator(conf, atoms, radius_multiplier=0.5, include_disconnected=True, atom_coords=None, bound_atoms_dict=None)[source]¶
Bases:
object
Generate nested
Shell
objects from molecule upon request.- get_match_atoms(rad)[source]¶
Get atoms within shell at radius rad.
- Parameters
rad (float) – Radius of shell.
- Returns
dict – shell
- Return type
Dict matching atom id to set of ids for other atoms within
- get_shells_at_level(level)[source]¶
Get dict of atom shells at specified level/iteration.
If not run to level, raises IndexError.
- Parameters
level (int) – Level/iteration from which to retrieve shells dict.
- Returns
dict
- Return type
Dict matching atom ids to that atom’s
Shell
at that level.
- next()¶
Get next iteration’s dict of atom shells.
- atom_tuples_from_shell(shell, atom_coords, connectivity, stereo)[source]¶
Generate sorted atom tuples for neighboring atoms.
- Parameters
shell (Shell) – Shell for which to build atom tuples
atom_coords (dict) – Dict matching atom ids to coords.
connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).
stereo (bool) – Add stereo indicators to tuples
- bound_atoms_from_mol(mol, atoms)[source]¶
Build dict matching atom id to ids of bounded atoms.
Bound atoms not in atoms are ignored.
- Parameters
mol (RDKit Mol) – Input mol
atoms (list of int) – List of atom IDs
- Returns
dict
- Return type
Dict matching atom id to set of bound atom ids.
- coords_from_atoms(atoms, conf)[source]¶
Build dict matching atom id to coordinates.
- Parameters
atoms (list of int) – Atom ids
conf (RDKit Conformer) – Conformer from which to fetch coordinates
- Returns
dict
- Return type
Dict matching atom id to 1-D array of coordinates.
- get_first_unique_tuple_inds(tuples_list, num_ret, ignore=[], assume_sorted=True)[source]¶
Return indices of first num_ret unique tuples in a list.
Only first 2 values of each tuple are considered.
- Parameters
tuples_list (list of tuple) – List of tuples. Only first two unique values are considered.
num_ret (int) – Maximum number of first unique tuples to return.
ignore (list, optional) – Indices for tuples not be considered as unique.
assume_sorted (bool, optional) – If True, assume list is already sorted by tuples.
- Returns
tuple of int – unique tuples in list.
- Return type
List of at most num_ret ints indicating index of
- hash_int64_array(array, seed=0)[source]¶
Hash an int64 array into a 32-bit integer.
- Parameters
array (ndarray of int64) – Numpy array containing integers
seed (any, optional) – Seed for MurmurHash3.
- Returns
int
- Return type
32-bit integer
- identifier_from_shell(shell, atom_coords, connectivity, level, stereo)[source]¶
Determine new identifier for a shell at a specific level.
- Parameters
shell (Shell) – Shell for which to determine identifier
atom_coords (dict) – Dict matching atom ids to coords.
connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).
level (int) – Level/iteration
stereo (bool) – Add stereo indicators
- identifiers_from_invariants(mol, atoms, rdkit_invariants=False)[source]¶
Initialize ids according to Daylight invariants.
- Parameters
mol (RDKit Mol) – Input molecule
atoms (list of int) – IDs for atoms in mol for which to generate identifiers.
rdkit_invariants (bool, optional) – Use the atom invariants used by RDKit for its Morgan fingerprint.
- Returns
ndarray of int64
- Return type
initial identifiers for atoms
- invariants_from_atom(atom)[source]¶
Get seven invariants from atom.
Invariants used are the six Daylight invariants, plus an indicator of whether the atom is in a ring, as detailed in [1].
References
D Rogers, M Hahn. J. Chem. Inf. Model., 2010, 50 (5), pp 742-754 https://doi.org/10.1021/ci100050t
- Parameters
atom (RDKit Atom) – Input atom
- Returns
1-D array if int64
- Return type
Array of 7 invariants
- pick_y(atom_tuples, cent_coords, y_precision=0.1)[source]¶
Pick a y-coordinate from atom tuples or mean coordinate.
- Parameters
atom_tuples (list of tuple) – Sorted list of atom tuples
cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.
y_precision (str, optional) – For mean to be chosen for y-coordinate, it must be at least this distance from the origin. Useful when atoms are symmetrical around the center atom where a slight shift in any atom results in a very different y.
- Returns
1x3 array of float or None (y-coordinate)
int or None (index to y-atom, if y was chosen from the atoms.)
- pick_z(connectivity, identifiers, cent_coords, y, long_angle, z_precision=0.01)[source]¶
Pick a z-coordinate orthogonal to y.
- Parameters
connectivity (dict) – Dict matching atom id pair tuples to their bond order (5 for unbound).
identifiers (iterable of int) – Atom identifiers
cent_coords (Nx3 array of float) – Coordinates of atoms with center atom at origin.
y (1x3 array of float) – y-coordinate
long_angle (Nx1 array of float) – Absolute angle of atoms from orthogonal to y.
z_precision (float, optional) – Minimum difference in long_angle between two potential z-atoms. Used as a tie breaker to prevent small shift in one atom resulting in very different z.
- Returns
1x3 array of float or None
- Return type
z-coordinate
- quad_indicators_from_coords(cent_coords, y, y_ind, z, long_sign)[source]¶
Create angle indicators for four quadrants in each hemisphere.
- Parameters
cent_coords (Nx3 array of float) – Array of centered coordinates.
y (1-D array of float) – Vector lying along y-axis.
y_ind (int) – Index of cent_coords corresponding to y.
z (1-D array of float) – Vector lying along z-axis
long_sign (Nx1 array of int) – Array of signs of vectors in cent_coords indicating whether they are above (+1) or below (-1) the xz-plane.
- Returns
Nx1 array of int – indicators are 2, 3, 4, 5 for vectors above the xz-plane and -2, -3, -4, -5 for vectors below the xz-plane.
- Return type
Quadrant indicators. Clockwise from z around y,
- rdkit_invariants_from_atom(atom)[source]¶
Get the 6 atom invariants RDKit uses for its Morgan fingerprints.
- Parameters
atom (RDKit Atom) – Input atom
- Returns
1-D array if int64
- Return type
Array of 6 invariants
- signed_to_unsigned_int(a, bits=4294967296)[source]¶
Convert int between +/-bits to an int between 0 and bits.
- Parameters
a (int or ndarray of int) – Integer
bits (int, optional) – Maximum size of int. E.g. 32-bit is 2^32.
- Returns
int
- Return type
unsigned integer
- stereo_indicators_from_shell(shell, atom_tuples, atom_coords_dict, add_transform_to_shell=True)[source]¶
Get list of int indicating location of atoms on unit sphere.
- Parameters
shell (Shell) – Shell for which to get stereo indicators.
atom_tuples (list of tuple) – List of atom tuples.
atom_coords_dict (dict) – Dict matching atom ids to coords.
add_transform_to_shell (bool, optional) – Calculate transformation matrix to align coordinates to unit sphere, and add to shell.
- Returns
list of int
- Return type
stereo indicators for atoms in atom_tuples.