Fingerprint Storage

The most efficient way to store and interact with fingerprints is through the FingerprintDatabase class. This class wraps a matrix with sparse rows (scipy.sparse.csr_matrix), where each row is a fingerprint. This enables rapid I/O of the database while also minimizing the memory footprint. Accessing the underlying sparse representation with the FingerprintDatabase.array attribute is convenient for machine learning purposes, while the database class itself provides several useful functions.

Note

We strongly recommend upgrading to at least SciPy v1.0.0 when working with large fingerprint databases, as old versions are much slower and have several bugs for database loading.

Database I/O and Indexing

See the full FingerprintDatabase documentation for a description of basic database usage, attributes, and methods. Below, several additional use cases are documented.

Batch Database Operations

Due to the sparse representation of the underlying data structure, an un- folded database, a database with unfolded fingerprints does not use significantly more disk space than a database with folded fingerprints. However, it is usually necessary to fold fingerprints for machine learning tasks. The FingerprintDatabase does this very quickly.

>>> from e3fp.fingerprint.db import FingerprintDatabase
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0]
>>> on_inds = [np.random.uniform(0, 2**32, size=30) for i in range(5)]
>>> fps = [Fingerprint(x, bits=2**32) for x in on_inds]
>>> db.add_fingerprints(fps)
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> db.get_density()
6.984919309616089e-09
>>> fold_db = db.fold(1024)
>>> print(fold_db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 5]
>>> fold_db.get_density()
0.0287109375

A database can be converted to a different fingerprint type:

>>> from e3fp.fingerprint.fprint import CountFingerprint
>>> count_db = db.as_type(CountFingerprint)
>>> print(count_db)
FingerprintDatabase[name: TestDB, fp_type: CountFingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> count_db[0]
CountFingerprint(counts={2977004690: 1, ..., 3041471738: 1}, level=-1, bits=4294967296, name=None)

The e3fp.fingerprint.db.concat() method allows efficient joining of multiple databases.

>>> from e3fp.fingerprint.db import concat
>>> dbs = []
>>> for i in range(10):
...     db = FingerprintDatabase(fp_type=Fingerprint)
...     on_inds = [np.random.uniform(0, 1024, size=30) for j in range(5)]
...     fps = [Fingerprint(x, bits=2**32, name="Mol{}".format(i)) for x in on_inds]
...     db.add_fingerprints(fps)
...     dbs.append(db)
>>> dbs[0][0]
Fingerprint(indices=array([94, 97, ..., 988, 994]), level=-1, bits=4294967296, name=Mol0)
>>> print(dbs[0])
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5]
>>> merge_db = concat(dbs)
>>> print(merge_db)
FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 50]

Database Comparison

Two databases may be compared using various metrics in e3fp.fingerprint.metrics. Additionally, all fingerprints in a database may be compared to each other simply by only providing a single database. See Fingerprint Comparison for more details.

Performing Machine Learning on the Database

The underlying sparse matrix may be passed directly to machine learning tools in any package that is compatible with SciPy sparse matrices, such as scikit-learn.

>>> from sklearn.naive_bayes import BernoulliNB
>>> clf = BernoulliNB()
>>> clf.fit(db.array, ypred)  
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
>>> clf.predict(db2.array)   
...