Fingerprints¶
The simplest interface for molecular fingerprints are through three classes in
e3fp.fingerprint.fprint
:
Fingerprint
a fingerprint with “on” bits
CountFingerprint
a fingerprint with counts for each “on” bit
FloatFingerprint
a fingerprint with float values for each “on” bit, generated for example by averaging conformer fingerprints.
In addition to storing “on” indices and, for the latter two, corresponding values, they store fingerprint properties, such as name, level, and any arbitrary property. They also provide simple interfaces for fingerprint comparison, some basic processing, and comparison.
Note
Many of these operations are more efficient when operating on a
FingerprintDatabase
. See Fingerprint Storage for more
information.
In the below examples, we will focus on Fingerprint
and
CountFingerprint
. First, we execute the necessary imports.
>>> from e3fp.fingerprint.fprint import Fingerprint, CountFingerprint
>>> import numpy as np
See also
Creation and Conversion¶
Here we create a bit-fingerprint with random “on” indices.
>>> bits = 2**32
>>> indices = np.sort(np.random.randint(0, bits, 30))
>>> indices
array([ 243580376, 305097549, ..., 3975407269, 4138900056])
>>> fp1 = Fingerprint(indices, bits=bits, level=0)
>>> fp1
Fingerprint(indices=array([243580376, ..., 4138900056]), level=0, bits=4294967296, name=None)
This fingerprint is extremely sparse
>>> fp1.bit_count
30
>>> fp1.density
6.984919309616089e-09
We can therefore “fold” the fingerprint through a series of bitwise “OR” operations on halves of the sparse vector until it is of a specified length, with minimal collision of bits.
>>> fp_folded = fp1.fold(1024)
>>> fp_folded
Fingerprint(indices=array([9, 70, ..., 845, 849]), level=0, bits=1024, name=None)
>>> fp_folded.bit_count
29
>>> fp_folded.density
0.0283203125
A CountFingerprint
may be created by also providing a dictionary
matching indices with nonzero counts to the counts.
>>> indices2 = np.sort(np.random.randint(0, bits, 60))
>>> counts = dict(zip(indices2, np.random.randint(1, 10, indices2.size)))
>>> counts
{80701568: 8, 580757632: 7, ..., 800291326: 5, 4057322111: 7}
>>> cfp1 = CountFingerprint(counts=counts, bits=bits, level=0)
>>> cfp1
CountFingerprint(counts={80701568: 8, 580757632: 7, ..., 3342157822: 2, 4057322111: 7}, level=0, bits=4294967296, name=None)
Unlike folding a bit fingerprint, by default, folding a count fingerprint performs a “SUM” operation on colliding counts.
>>> cfp1.bit_count
60
>>> cfp_folded = cfp1.fold(1024)
>>> cfp_folded
CountFingerprint(counts={128: 15, 257: 4, ..., 1022: 2, 639: 7}, level=0, bits=1024, name=None)
>>> cfp_folded.bit_count
57
It is trivial to interconvert the fingerprints.
>>> cfp_folded2 = CountFingerprint.from_fingerprint(fp_folded)
>>> cfp_folded2
CountFingerprint(counts={9: 1, 87: 1, ..., 629: 1, 763: 1}, level=0, bits=1024, name=None)
>>> cfp_folded2.indices[:5]
array([ 9, 70, 72, 87, 174])
>>> fp_folded.indices[:5]
array([ 9, 70, 72, 87, 174])
RDKit Morgan fingerprints (analogous to ECFP) may easily be converted to a
Fingerprint
.
>>> from rdkit import Chem
>>> from rdkit.Chem import AllChem
>>> mol = Chem.MolFromSmiles('Cc1ccccc1')
>>> mfp = AllChem.GetMorganFingerprintAsBitVect(mol, 2)
>>> mfp
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
>>> Fingerprint.from_rdkit(mfp)
Fingerprint(indices=array([389, 1055, ..., 1873, 1920]), level=-1, bits=2048, name=None)
Likewise, Fingerprint
can be easily converted to a NumPy ndarray or
SciPy sparse matrix.
>>> fp_folded.to_vector()
<1x1024 sparse matrix of type '<type 'numpy.bool_'>'
...with 29 stored elements in Compressed Sparse Row format>
>>> fp_folded.to_vector(sparse=False)
array([False, False, False, ..., False, False, False], dtype=bool)
>>> np.where(fp_folded.to_vector(sparse=False))[0]
array([ 9, 70, 72, 87, ...])
>>> cfp_folded.to_vector(sparse=False)
array([0, 0, 0, ..., 0, 2, 0], dtype=uint16)
>>> cfp_folded.to_vector(sparse=False).sum()
252
Algebra¶
Basic algebraic functions may be performed on fingerprints. If either fingerprint is a bit fingerprint, all algebraic functions are bit-wise. The following bit-wise operations are supported:
- Equality
>>> fp1 = Fingerprint([0, 1, 6, 8, 12], bits=16) >>> fp2 = Fingerprint([1, 2, 4, 8, 11, 12], bits=16) >>> fp1 == fp2 False >>> fp1_copy = Fingerprint.from_fingerprint(fp1) >>> fp1 == fp1_copy True >>> fp1_copy.level = 5 >>> fp1 == fp1_copy False
- Union/OR
>>> fp1 + fp2 Fingerprint(indices=array([0, 1, 2, 4, 6, 8, 11, 12]), level=-1, bits=16, name=None) >>> fp1 | fp2 Fingerprint(indices=array([0, 1, 2, 4, 6, 8, 11, 12]), level=-1, bits=16, name=None)
- Intersection/AND
>>> fp1 & fp2 Fingerprint(indices=array([1, 8, 12]), level=-1, bits=16, name=None)
- Difference/AND NOT
>>> fp1 - fp2 Fingerprint(indices=array([0, 6]), level=-1, bits=16, name=None) >>> fp2 - fp1 Fingerprint(indices=array([2, 4, 11]), level=-1, bits=16, name=None)
- XOR
>>> fp1 ^ fp2 Fingerprint(indices=array([0, 2, 4, 6, 11]), level=-1, bits=16, name=None)
With count or float fingerprints, bit-wise operations are still possible, but algebraic operations are applied to counts.
>>> fp1 = CountFingerprint(counts={0: 3, 1: 2, 5: 1, 9: 3}, bits=16)
>>> fp2 = CountFingerprint(counts={1: 2, 5: 2, 7: 3, 10: 7}, bits=16)
>>> fp1 + fp2
CountFingerprint(counts={0: 3, 1: 4, 5: 3, 7: 3, 9: 3, 10: 7}, level=-1, bits=16, name=None)
>>> fp1 - fp2
CountFingerprint(counts={0: 3, 1: 0, 5: -1, 7: -3, 9: 3, 10: -7}, level=-1, bits=16, name=None)
>>> fp1 * 3
CountFingerprint(counts={0: 9, 1: 6, 5: 3, 9: 9}, level=-1, bits=16, name=None)
>>> fp1 / 2
FloatFingerprint(counts={0: 1.5, 1: 1.0, 5: 0.5, 9: 1.5}, level=-1, bits=16, name=None)
Finally, fingerprints may be batch added and averaged, producing either a count or float fingerprint when sensible.
>>> from e3fp.fingerprint.fprint import add, mean
>>> fps = [Fingerprint(np.random.randint(0, 32, 8), bits=32) for i in range(100)]
>>> add(fps)
CountFingerprint(counts={0: 23, 1: 23, ..., 30: 20, 31: 14}, level=-1, bits=32, name=None)
>>> mean(fps)
FloatFingerprint(counts={0: 0.23, 1: 0.23, ..., 30: 0.2, 31: 0.14}, level=-1, bits=32, name=None)