Fingerprint Storage
The most efficient way to store and interact with fingerprints is through the
FingerprintDatabase class. This class wraps a matrix with
sparse rows (scipy.sparse.csr_matrix), where each row is a
fingerprint. This enables rapid I/O of the database while also minimizing the
memory footprint. Accessing the underlying sparse representation with the
FingerprintDatabase.array attribute is convenient for machine learning
purposes, while the database class itself provides several useful functions.
Note
We strongly recommend upgrading to at least SciPy v1.0.0 when working with large fingerprint databases, as old versions are much slower and have several bugs for database loading.
Database I/O and Indexing
See the full FingerprintDatabase documentation for a
description of basic database usage, attributes, and methods. Below, several
additional use cases are documented.
Batch Database Operations
Due to the sparse representation of the underlying data structure, an un-
folded database, a database with unfolded fingerprints does not use
significantly more disk space than a database with folded fingerprints. However,
it is usually necessary to fold fingerprints for machine learning tasks. The
FingerprintDatabase does this very quickly.
>>> from e3fp.fingerprint.db import FingerprintDatabase
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0]
>>> on_inds = [np.random.uniform(0, 2**32, size=30) for i in range(5)]
>>> fps = [Fingerprint(x, bits=2**32) for x in on_inds]
>>> db.add_fingerprints(fps)
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> db.get_density()
6.984919309616089e-09
>>> fold_db = db.fold(1024)
>>> print(fold_db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 5]
>>> fold_db.get_density()
0.0287109375
A database can be converted to a different fingerprint type:
>>> from e3fp.fingerprint.fprint import CountFingerprint
>>> count_db = db.as_type(CountFingerprint)
>>> print(count_db)
FingerprintDatabase[name: TestDB, fp_type: CountFingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> count_db[0]
CountFingerprint(counts={2977004690: 1, ..., 3041471738: 1}, level=-1, bits=4294967296, name=None)
The e3fp.fingerprint.db.concat() method allows efficient joining of multiple
databases.
>>> from e3fp.fingerprint.db import concat
>>> dbs = []
>>> for i in range(10):
... db = FingerprintDatabase(fp_type=Fingerprint)
... on_inds = [np.random.uniform(0, 1024, size=30) for j in range(5)]
... fps = [Fingerprint(x, bits=2**32, name="Mol{}".format(i)) for x in on_inds]
... db.add_fingerprints(fps)
... dbs.append(db)
>>> dbs[0][0]
Fingerprint(indices=array([94, 97, ..., 988, 994]), level=-1, bits=4294967296, name=Mol0)
>>> print(dbs[0])
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> merge_db = concat(dbs)
>>> print(merge_db)
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 50]
Database Comparison
Two databases may be compared using various metrics in
e3fp.fingerprint.metrics. Additionally, all fingerprints in a database
may be compared to each other simply by only providing a single database.
See Fingerprint Comparison for more details.
Performing Machine Learning on the Database
The underlying sparse matrix may be passed directly to machine learning tools in any package that is compatible with SciPy sparse matrices, such as scikit-learn.
>>> from sklearn.naive_bayes import BernoulliNB
>>> clf = BernoulliNB()
>>> clf.fit(db.array, ypred)
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
>>> clf.predict(db2.array)
...