e3fp.fingerprint.fprint module¶
Classes and methods for chemical fingerprint storage and comparison.
Author: Seth Axen E-mail: seth.axen@gmail.com
- class CountFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]¶
Bases:
Fingerprint
A fingerprint that stores number of occurrences of each index.
- Parameters
indices (array_like of int, optional) – log2(
bits
)-bit indices in a sparse vector, corresponding to positions with counts greater than 0. If not provided,counts
must be provided.counts (dict, optional) – Dict matching each index in
indices
to number of counts. All counts default to 1 if not provided.bits (int, optional) – Number of bits in bitvector.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Variables
bits (int) – Number of bits in bitvector, length of fingerprint.
counts (dict) – Dict matching each index in
indices
to number of counts.indices (numpy.ndarray of int) – Indices of fingerprint with counts greater than 0.
level (int) – Level of fingerprint, corresponding to fingerprinting iterations.
mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in
props
).name (str or None) – Name of fingerprint (stored in
props
).props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.
vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)
See also
Fingerprint
A fingerprint that stores indices of “on” bits
FloatFingerprint
A fingerprint that stores float counts
Examples
>>> import e3fp.fingerprint.fprint as fp >>> from e3fp.fingerprint.metrics import soergel >>> import numpy as np >>> np.random.seed(1) >>> bits = 1024 >>> indices = np.random.randint(0, bits, 30) >>> print(indices) [ 37 235 908 72 767 905 715 645 847 960 144 129 972 583 749 508 390 281 178 276 254 357 914 468 907 252 490 668 925 398] >>> counts = dict(zip(indices, ... np.random.randint(1, 100, indices.shape[0]))) >>> print(sorted(counts.items())) [(37, 51), (72, 88), (129, 62), ..., (925, 50), (960, 8), (972, 23)] >>> f = fp.CountFingerprint(indices, counts=counts, bits=bits, level=0) >>> f_folded = f.fold(bits=32) >>> print(sorted(f_folded.counts.items())) [(0, 8), (1, 62), (5, 113), ..., (29, 50), (30, 14), (31, 95)] >>> print(f_folded.to_vector(sparse=False, dtype=int)) [ 8 62 0 0 0 113 61 58 88 97 71 228 111 2 58 10 64 0 82 0 120 0 0 0 0 82 0 0 27 50 14 95] >>> fp.Fingerprint.from_fingerprint(f_folded) Fingerprint(indices=array([0, 1, ...]), level=0, bits=32, name=None) >>> indices2 = np.random.randint(0, bits, 30) >>> counts2 = dict(zip(indices2, ... np.random.randint(1, 100, indices.shape[0]))) >>> f_folded2 = fp.CountFingerprint.from_indices(indices2, counts=counts2, ... bits=bits).fold(bits=32) >>> print(sorted(f_folded2.counts.items())) [(0, 93), (2, 33), (3, 106), ..., (25, 129), (26, 89), (30, 53)] >>> print(soergel(f_folded, f_folded2)) 0.17492946392...
- property counts¶
- fold(*args, **kwargs)[source]¶
Fold fingerprint while considering counts.
Optionally, provide a function to reduce colliding counts.
- Parameters
bits (int, optional) – Length of new bitvector, ideally multiple of 2.
method ({0, 1}, optional) – Method to use for folding.
- 0
partitioning (array is divided into equal sized arrays of length
bits
which are bitwise combined with counts_method)- 1
compression (adjacent bits pairs are combined with counts_method until length is
bits
)
linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.
counts_method (function, optional) – Function for combining counts. Default is summation.
- Returns
CountFingerprint
- Return type
Fingerprint of folded vector
- classmethod from_counts(counts, bits=4294967296, level=-1, **kwargs)[source]¶
Initialize from an array of indices.
- Parameters
counts (dict) – Dictionary mapping sparse indices to counts.
bits (int, optional) – Number of bits in array. Indices will be log2(
bits
)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- classmethod from_fingerprint(fp, **kwargs)[source]¶
Initialize by copying existing fingerprint.
- Parameters
fp (Fingerprint) – Existing fingerprint.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- classmethod from_indices(indices, counts=None, bits=4294967296, level=-1, **kwargs)[source]¶
Initialize from an array of indices.
- Parameters
indices (array_like of int, optional) – Indices in a sparse bitvector of length
bits
which correspond to 1.counts (dict, optional) – Dictionary mapping sparse indices to counts.
bits (int, optional) – Number of bits in array. Indices will be log2(
bits
)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- get_count(index)[source]¶
Return count index in fingerprint.
- Returns
int
- Return type
Count of index in fingerprint
- std()[source]¶
Return standard deviation of fingerprint.
- Returns
float
- Return type
Standard deviation
- vector_dtype¶
alias of
uint16
- class Fingerprint(indices, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]¶
Bases:
object
A fingerprint that stores indices of “on” bits.
- Parameters
indices (array_like of int, optional) – log2(
bits
)-bit indices in a sparse bitvector ofbits
which correspond to 1.bits (int, optional) – Number of bits in bitvector.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Variables
bits (int) – Number of bits in bitvector, length of fingerprint.
counts (dict) – Dict matching each index in
indices
to number of counts (1 for bits).indices (numpy.ndarray of int) – Indices of “on” bits
level (int) – Level of fingerprint, corresponding to fingerprinting iterations.
mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in
props
).props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.
vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)
See also
CountFingerprint
A fingerprint that stores number of occurrences of each index
FloatFingerprint
A fingerprint that stores indices of “on” bits
e3fp.fingerprint.db.FingerprintDatabase
Efficiently store fingerprints
Examples
>>> import e3fp.fingerprint.fprint as fp >>> from e3fp.fingerprint.metrics import tanimoto >>> import numpy as np >>> np.random.seed(0) >>> bits = 1024 >>> indices = np.random.randint(0, bits, 30) >>> print(indices) [684 559 629 192 835 763 707 359 9 723 277 754 804 599 70 472 600 396 314 705 486 551 87 174 600 849 677 537 845 72] >>> f = fp.Fingerprint(indices, bits=bits, level=0) >>> f_folded = f.fold(bits=32) >>> print(f_folded.indices) [ 0 1 3 4 5 6 7 8 9 12 13 14 15 17 18 19 21 23 24 25 26 27] >>> print(f_folded.to_vector(sparse=False, dtype=int)) [1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0] >>> print(f_folded.to_bitstring()) 11011111110011110111010111110000 >>> print(f_folded.to_rdkit()) <rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...> >>> f_folded2 = fp.Fingerprint.from_indices(np.random.randint(0, bits, 30), ... bits=bits).fold(bits=32) >>> print(f_folded2.indices) [ 0 1 3 5 7 9 10 14 15 16 17 18 19 20 23 24 25 29 30 31] >>> print(tanimoto(f_folded, f_folded2)) 0.5
- property bit_count¶
- property bits¶
- property counts¶
- property density¶
- fold(bits=1024, method=0, linked=True)[source]¶
Return fingerprint for bitvector folded to size
bits
.- Parameters
bits (int, optional) – Length of new bitvector, ideally multiple of 2.
method ({0, 1}, optional) – Method to use for folding.
linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.
- Returns
Fingerprint
- Return type
Fingerprint of folded bitvector
- classmethod from_bitstring(bitstring, level=-1, **kwargs)[source]¶
Initialize from bitstring (e.g. ‘10010011’).
- Parameters
bitstring (str) – String of 1s and 0s.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- classmethod from_fingerprint(fp, **kwargs)[source]¶
Initialize by copying existing fingerprint.
- Parameters
fp (Fingerprint) – Existing fingerprint.
- Returns
fingerprint
- Return type
- classmethod from_indices(indices, bits=4294967296, level=-1, **kwargs)[source]¶
Initialize from an array of indices.
- Parameters
indices (array_like of int) – Indices in a sparse bitvector of length
bits
which correspond to 1.bits (int, optional) – Number of bits in array. Indices will be log2(
bits
)-bit integers.level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- classmethod from_rdkit(rdkit_fprint, **kwargs)[source]¶
Initialize from RDKit fingerprint.
If provided fingerprint is of length 2^32 - 1, assumes real fingerprint is of length 2^32.
- Parameters
rdkit_fprint (RDKit ExplicitBitVect or SparseBitVect) – Existing RDKit fingerprint.
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- classmethod from_vector(vector, level=-1, **kwargs)[source]¶
Initialize from vector.
- Parameters
vector (numpy.ndarray or scipy.sparse.csr_matrix) – Array of bits/counts/floats
level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.
name (str, optional) – Name of fingerprint.
props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.
- Returns
fingerprint
- Return type
- get_count(index)[source]¶
Return count index in fingerprint.
Defaults to 1 if index in self.indices
- Returns
int
- Return type
Count of bit in fingerprint
- get_folding_index_map()[source]¶
Get map of sparse indices to folded indices.
- Returns
dict
- Return type
Map of sparse index (keys) to corresponding folded index.
- get_unfolding_index_map()[source]¶
Get map of sparse indices to unfolded indices.
- Returns
dict – indices.
- Return type
Map of sparse index (keys) to set of corresponding unfolded
- property index_id_map¶
- property indices¶
- property level¶
- mean()[source]¶
Return mean, i.e. proportion of “on” bits in fingerprint.
- Returns
float
- Return type
Mean
- property mol¶
- property name¶
- property props¶
- std()[source]¶
Return standard deviation of fingerprint.
- Returns
float
- Return type
Standard deviation
- to_bitvector(sparse=True)[source]¶
Get full bitvector.
- Returns
numpy.ndarray or scipy.sparse.csr_matrix of bool
- Return type
Bitvector
- to_rdkit()[source]¶
Convert to RDKit fingerprint.
If number of bits exceeds 2^31 - 1, fingerprint will be folded to length 2^31 - 1 before conversion.
- Returns
rdkit_fprint – Convert to bitvector used for RDKit fingerprints. If self.bits is less than 10^5,
ExplicitBitVect
is used. Otherwise,SparseBitVect
is used.- Return type
RDKit ExplicitBitVect or SparseBitVect
- to_vector(sparse=True, dtype=None)[source]¶
Get vector of bits/counts/floats.
- Returns
Vector of bits/counts/floats
- Return type
- unfold()[source]¶
Return unfolded parent fingerprint for bitvector.
- Returns
Fingerprint – None.
- Return type
Fingerprint of unfolded bitvector. If None, return
- class FloatFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]¶
Bases:
CountFingerprint
A Fingerprint that stores float counts.
Nearly identical to
CountFingerprint
. Mainly a naming convention, but count values are stored as floats.See also
Fingerprint
A fingerprint that stores indices of “on” bits
CountFingerprint
A fingerprint that stores number of occurrences of each index
- property counts¶
- vector_dtype¶
alias of
float64
- add(fprints, weights=None)[source]¶
Add fingerprints by count to new
CountFingerprint
.If any of the fingerprints are
FloatFingerprint
, resulting fingerprint is likewise aFloatFingerprint
. Otherwise, resulting fingerprint isCountFingerprint
.- Parameters
fprints (iterable of Fingerprint) – Fingerprints to be added by count.
weights (iterable of float) – Weights for weighted sum. Results in
FloatFingerprint
output.
- Returns
Fingerprint with counts as sum of counts in fprints.
- Return type
See also
- coerce_to_valid_dtype(dtype)[source]¶
Coerce provided NumPy data type to closest fingerprint data type.
If provided dtype cannot be read, default corresponding to bit
Fingerprint
is returned.- Parameters
dtype (numpy.dtype or str) – Input NumPy data type.
- Returns
Output NumPy data type.
- Return type
- diff_counts_dict(fp1, fp2, only_positive=False)[source]¶
Given two fingerprints, returns difference of their counts dicts.
- Parameters
fp1, fp2 (Fingerprint) –
Fingerprint
objects, fp2 subtracted from fp1.only_positive (bool, optional) – Return only positive counts, negative being thresholded to 0.
- Returns
counts_diff – Count indices in either fp1 or fp2 with value as diff of counts.
- Return type
See also
- dtype_from_fptype(fp_type)[source]¶
Get NumPy data type from fingerprint type.
- Parameters
fp_type (class or Fingerprint) – Class of fingerprint
- Returns
NumPy data type
- Return type
- fptype_from_dtype(dtype)[source]¶
Get corresponding fingerprint type from NumPy data type.
- Parameters
dtype (numpy.dtype or str) – NumPy data type.
- Returns
class – Class of fingerprint
- Return type
{Fingerprint, CountFingerprint, FloatFingerprint}
- load(f, update_structure=True)[source]¶
Load
Fingerprint
object from file.- Parameters
f (str or File) – File name or file-like object to load file from.
update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable.
- Returns
Fingerprint
- Return type
Pickled fingerprint.
- loadz(f, update_structure=True)[source]¶
Load
Fingerprint
objects from file.- Parameters
f (str or File) – File name or file-like object to load file from.
update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable. If this doesn’t work, falls back to the original saved fingerprint.
- Returns
list of Fingerprint
- Return type
Fingerprints in pickle.
- mean(fprints, weights=None)[source]¶
Average fingerprints to generate
FloatFingerprint
.- Parameters
fprints (iterable of Fingerprint) – Fingerprints to be added by count.
weights (array_like of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.
- Returns
FloatFingerprint – fprints.
- Return type
Fingerprint with float counts as average of counts in
- save(f, fp, **kwargs)[source]¶
Save
Fingerprint
object to file.- Parameters
f (str or File) – filename str or file-like object to save file to
fp (Fingerprint) – Fingerprint to save to file
protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.
- Returns
bool
- Return type
Success or fail
- savez(f, *fps, **kwargs)[source]¶
Save multiple
Fingerprint
objects to file.- Parameters
f (str or File) – filename str or file-like object to save file to
fps (list of Fingerprint) – List of Fingerprints to save to file
protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.
- Returns
bool
- Return type
Success or fail
- sum_counts_dict(*fprints, **kwargs)[source]¶
Given fingerprints, return sum of their counts dicts.
If an optional weights iterable of the same length as fprints is provided, the weighted sum is returned.
- Parameters
*fprints – One or more
Fingerprint
objectsweights (iterable of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.
- Returns
dict – as sum of counts.
- Return type
Dict of non-zero count indices in any of the fprints with value
See also