e3fp.fingerprint.fprint module
Classes and methods for chemical fingerprint storage and comparison.
Author: Seth Axen E-mail: seth.axen@gmail.com
- class CountFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]
Bases:
FingerprintA fingerprint that stores number of occurrences of each index.
- Parameters:
indices (array_like of int, optional) – log2(
bits)-bit indices in a sparse vector, corresponding to positions with counts greater than 0. If not provided,countsmust be provided.counts (dict, optional) – Dict matching each index in
indicesto number of counts. All counts default to 1 if not provided.bits (int, optional) – Number of bits in bitvector.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Variables:
bits (int) – Number of bits in bitvector, length of fingerprint.
counts (dict) – Dict matching each index in
indicesto number of counts.indices (numpy.ndarray of int) – Indices of fingerprint with counts greater than 0.
level (int) – Level of fingerprint, corresponding to fingerprinting iterations.
mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in
props).name (str or None) – Name of fingerprint (stored in
props).props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.
vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)
See also
FingerprintA fingerprint that stores indices of “on” bits
FloatFingerprintA fingerprint that stores float counts
Examples
>>> import e3fp.fingerprint.fprint as fp >>> from e3fp.fingerprint.metrics import soergel >>> import numpy as np >>> np.random.seed(1) >>> bits = 1024 >>> indices = np.random.randint(0, bits, 30) >>> print(indices) [ 37 235 908 72 767 905 715 645 847 960 144 129 972 583 749 508 390 281 178 276 254 357 914 468 907 252 490 668 925 398] >>> counts = dict(zip(indices, ... np.random.randint(1, 100, indices.shape[0]))) >>> f = fp.CountFingerprint(indices, counts=counts, bits=bits, level=0) >>> sorted(f.counts.items()) [(np.int64(37), 51), (np.int64(72), 88), ..., (np.int64(960), 8), (np.int64(972), 23)] >>> f_folded = f.fold(bits=32) >>> print(sorted(f_folded.counts.items())) [(np.int64(0), 8), (np.int64(1), 62), ..., (np.int64(30), 14), (np.int64(31), 95)] >>> print(f_folded.to_vector(sparse=False, dtype=int)) [ 8 62 0 0 0 113 61 58 88 97 71 228 111 2 58 10 64 0 82 0 120 0 0 0 0 82 0 0 27 50 14 95] >>> fp.Fingerprint.from_fingerprint(f_folded) Fingerprint(indices=array([0, 1, ...]), level=0, bits=32, name=None) >>> indices2 = np.random.randint(0, bits, 30) >>> counts2 = dict(zip(indices2, ... np.random.randint(1, 100, indices.shape[0]))) >>> f_folded2 = fp.CountFingerprint.from_indices(indices2, counts=counts2, ... bits=bits).fold(bits=32) >>> sorted(f_folded2.counts.items()) [(np.int64(0), 93), (np.int64(2), 33), ..., (np.int64(26), 89), (np.int64(30), 53)] >>> print(soergel(f_folded, f_folded2)) 0.17492946392...
- property counts
- fold(*args, **kwargs)[source]
Fold fingerprint while considering counts.
Optionally, provide a function to reduce colliding counts.
- Parameters:
bits (int, optional) – Length of new bitvector, ideally multiple of 2.
method ({0, 1}, optional) – Method to use for folding.
- 0
partitioning (array is divided into equal sized arrays of length
bitswhich are bitwise combined with counts_method)- 1
compression (adjacent bits pairs are combined with counts_method until length is
bits)
linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.
counts_method (function, optional) – Function for combining counts. Default is summation.
- Returns:
CountFingerprint
- Return type:
Fingerprint of folded vector
- classmethod from_counts(counts, bits=4294967296, level=-1, **kwargs)[source]
Initialize from an array of indices.
- Parameters:
counts (dict) – Dictionary mapping sparse indices to counts.
bits (int, optional) – Number of bits in array. Indices will be log2(
bits)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- classmethod from_fingerprint(fp, **kwargs)[source]
Initialize by copying existing fingerprint.
- Parameters:
fp (Fingerprint) – Existing fingerprint.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- classmethod from_indices(indices, counts=None, bits=4294967296, level=-1, **kwargs)[source]
Initialize from an array of indices.
- Parameters:
indices (array_like of int, optional) – Indices in a sparse bitvector of length
bitswhich correspond to 1.counts (dict, optional) – Dictionary mapping sparse indices to counts.
bits (int, optional) – Number of bits in array. Indices will be log2(
bits)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- get_count(index)[source]
Return count index in fingerprint.
- Returns:
int
- Return type:
Count of index in fingerprint
- std()[source]
Return standard deviation of fingerprint.
- Returns:
float
- Return type:
Standard deviation
- vector_dtype
alias of
uint16
- class Fingerprint(indices, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]
Bases:
objectA fingerprint that stores indices of “on” bits.
- Parameters:
indices (array_like of int, optional) – log2(
bits)-bit indices in a sparse bitvector ofbitswhich correspond to 1.bits (int, optional) – Number of bits in bitvector.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Variables:
bits (int) – Number of bits in bitvector, length of fingerprint.
counts (dict) – Dict matching each index in
indicesto number of counts (1 for bits).indices (numpy.ndarray of int) – Indices of “on” bits
level (int) – Level of fingerprint, corresponding to fingerprinting iterations.
mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in
props).props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.
vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)
See also
CountFingerprintA fingerprint that stores number of occurrences of each index
FloatFingerprintA fingerprint that stores indices of “on” bits
e3fp.fingerprint.db.FingerprintDatabaseEfficiently store fingerprints
Examples
>>> import e3fp.fingerprint.fprint as fp >>> from e3fp.fingerprint.metrics import tanimoto >>> import numpy as np >>> np.random.seed(0) >>> bits = 1024 >>> indices = np.random.randint(0, bits, 30) >>> print(indices) [684 559 629 192 835 763 707 359 9 723 277 754 804 599 70 472 600 396 314 705 486 551 87 174 600 849 677 537 845 72] >>> f = fp.Fingerprint(indices, bits=bits, level=0) >>> f_folded = f.fold(bits=32) >>> print(f_folded.indices) [ 0 1 3 4 5 6 7 8 9 12 13 14 15 17 18 19 21 23 24 25 26 27] >>> print(f_folded.to_vector(sparse=False, dtype=int)) [1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0] >>> print(f_folded.to_bitstring()) 11011111110011110111010111110000 >>> print(f_folded.to_rdkit()) <rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...> >>> f_folded2 = fp.Fingerprint.from_indices(np.random.randint(0, bits, 30), ... bits=bits).fold(bits=32) >>> print(f_folded2.indices) [ 0 1 3 5 7 9 10 14 15 16 17 18 19 20 23 24 25 29 30 31] >>> print(tanimoto(f_folded, f_folded2)) 0.5
- property bit_count
- property bits
- property counts
- property density
- fold(bits=1024, method=0, linked=True)[source]
Return fingerprint for bitvector folded to size
bits.- Parameters:
bits (int, optional) – Length of new bitvector, ideally multiple of 2.
method ({0, 1}, optional) – Method to use for folding.
linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.
- Returns:
Fingerprint
- Return type:
Fingerprint of folded bitvector
- classmethod from_bitstring(bitstring, level=-1, **kwargs)[source]
Initialize from bitstring (e.g. ‘10010011’).
- Parameters:
bitstring (str) – String of 1s and 0s.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- classmethod from_fingerprint(fp, **kwargs)[source]
Initialize by copying existing fingerprint.
- Parameters:
fp (Fingerprint) – Existing fingerprint.
- Returns:
fingerprint
- Return type:
- classmethod from_indices(indices, bits=4294967296, level=-1, **kwargs)[source]
Initialize from an array of indices.
- Parameters:
indices (array_like of int) – Indices in a sparse bitvector of length
bitswhich correspond to 1.bits (int, optional) – Number of bits in array. Indices will be log2(
bits)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- classmethod from_rdkit(rdkit_fprint, **kwargs)[source]
Initialize from RDKit fingerprint.
If provided fingerprint is of length 2^32 - 1, assumes real fingerprint is of length 2^32.
- Parameters:
rdkit_fprint (RDKit ExplicitBitVect or SparseBitVect) – Existing RDKit fingerprint.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- classmethod from_vector(vector, level=-1, **kwargs)[source]
Initialize from vector.
- Parameters:
vector (numpy.ndarray or scipy.sparse.csr_matrix) – Array of bits/counts/floats
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns:
fingerprint
- Return type:
- get_count(index)[source]
Return count index in fingerprint.
Defaults to 1 if index in self.indices
- Returns:
int
- Return type:
Count of bit in fingerprint
- get_folding_index_map()[source]
Get map of sparse indices to folded indices.
- Returns:
dict
- Return type:
Map of sparse index (keys) to corresponding folded index.
- get_unfolding_index_map()[source]
Get map of sparse indices to unfolded indices.
- Returns:
dict – indices.
- Return type:
Map of sparse index (keys) to set of corresponding unfolded
- property index_id_map
- property indices
- property level
- mean()[source]
Return mean, i.e. proportion of “on” bits in fingerprint.
- Returns:
float
- Return type:
Mean
- property mol
- property name
- property props
- std()[source]
Return standard deviation of fingerprint.
- Returns:
float
- Return type:
Standard deviation
- to_bitvector(sparse=True)[source]
Get full bitvector.
- Returns:
numpy.ndarray or scipy.sparse.csr_matrix of bool
- Return type:
Bitvector
- to_rdkit()[source]
Convert to RDKit fingerprint.
If number of bits exceeds 2^31 - 1, fingerprint will be folded to length 2^31 - 1 before conversion.
- Returns:
rdkit_fprint – Convert to bitvector used for RDKit fingerprints. If self.bits is less than 10^5,
ExplicitBitVectis used. Otherwise,SparseBitVectis used.- Return type:
RDKit ExplicitBitVect or SparseBitVect
- to_vector(sparse=True, dtype=None)[source]
Get vector of bits/counts/floats.
- Returns:
Vector of bits/counts/floats
- Return type:
- unfold()[source]
Return unfolded parent fingerprint for bitvector.
- Returns:
Fingerprint – None.
- Return type:
Fingerprint of unfolded bitvector. If None, return
- class FloatFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]
Bases:
CountFingerprintA Fingerprint that stores float counts.
Nearly identical to
CountFingerprint. Mainly a naming convention, but count values are stored as floats.See also
FingerprintA fingerprint that stores indices of “on” bits
CountFingerprintA fingerprint that stores number of occurrences of each index
- property counts
- vector_dtype
alias of
float64
- add(fprints, weights=None)[source]
Add fingerprints by count to new
CountFingerprint.If any of the fingerprints are
FloatFingerprint, resulting fingerprint is likewise aFloatFingerprint. Otherwise, resulting fingerprint isCountFingerprint.- Parameters:
fprints (iterable of Fingerprint) – Fingerprints to be added by count.
weights (iterable of float) – Weights for weighted sum. Results in
FloatFingerprintoutput.
- Returns:
Fingerprint with counts as sum of counts in fprints.
- Return type:
See also
- coerce_to_valid_dtype(dtype)[source]
Coerce provided NumPy data type to closest fingerprint data type.
If provided dtype cannot be read, default corresponding to bit
Fingerprintis returned.- Parameters:
dtype (numpy.dtype or str) – Input NumPy data type.
- Returns:
Output NumPy data type.
- Return type:
- diff_counts_dict(fp1, fp2, only_positive=False)[source]
Given two fingerprints, returns difference of their counts dicts.
- Parameters:
fp1, fp2 (Fingerprint) –
Fingerprintobjects, fp2 subtracted from fp1.only_positive (bool, optional) – Return only positive counts, negative being thresholded to 0.
- Returns:
counts_diff – Count indices in either fp1 or fp2 with value as diff of counts.
- Return type:
See also
- dtype_from_fptype(fp_type)[source]
Get NumPy data type from fingerprint type.
- Parameters:
fp_type (class or Fingerprint) – Class of fingerprint
- Returns:
NumPy data type
- Return type:
- fptype_from_dtype(dtype)[source]
Get corresponding fingerprint type from NumPy data type.
- Parameters:
dtype (numpy.dtype or str) – NumPy data type.
- Returns:
class – Class of fingerprint
- Return type:
{Fingerprint, CountFingerprint, FloatFingerprint}
- load(f, update_structure=True)[source]
Load
Fingerprintobject from file.- Parameters:
f (str or File) – File name or file-like object to load file from.
update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable.
- Returns:
Fingerprint
- Return type:
Pickled fingerprint.
- loadz(f, update_structure=True)[source]
Load
Fingerprintobjects from file.- Parameters:
f (str or File) – File name or file-like object to load file from.
update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable. If this doesn’t work, falls back to the original saved fingerprint.
- Returns:
list of Fingerprint
- Return type:
Fingerprints in pickle.
- mean(fprints, weights=None)[source]
Average fingerprints to generate
FloatFingerprint.- Parameters:
fprints (iterable of Fingerprint) – Fingerprints to be added by count.
weights (array_like of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.
- Returns:
FloatFingerprint – fprints.
- Return type:
Fingerprint with float counts as average of counts in
- save(f, fp, **kwargs)[source]
Save
Fingerprintobject to file.- Parameters:
f (str or File) – filename str or file-like object to save file to
fp (Fingerprint) – Fingerprint to save to file
protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.
- Returns:
bool
- Return type:
Success or fail
- savez(f, *fps, **kwargs)[source]
Save multiple
Fingerprintobjects to file.- Parameters:
f (str or File) – filename str or file-like object to save file to
fps (list of Fingerprint) – List of Fingerprints to save to file
protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.
- Returns:
bool
- Return type:
Success or fail
- sum_counts_dict(*fprints, **kwargs)[source]
Given fingerprints, return sum of their counts dicts.
If an optional weights iterable of the same length as fprints is provided, the weighted sum is returned.
- Parameters:
*fprints – One or more
Fingerprintobjectsweights (iterable of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.
- Returns:
dict – as sum of counts.
- Return type:
Dict of non-zero count indices in any of the fprints with value
See also