Fingerprint Storage =================== The most efficient way to store and interact with fingerprints is through the :py:class:`.FingerprintDatabase` class. This class wraps a matrix with sparse rows (:py:class:`scipy.sparse.csr_matrix`), where each row is a fingerprint. This enables rapid I/O of the database while also minimizing the memory footprint. Accessing the underlying sparse representation with the :py:attr:`.FingerprintDatabase.array` attribute is convenient for machine learning purposes, while the database class itself provides several useful functions. .. note:: We strongly recommend upgrading to at least SciPy v1.0.0 when working with large fingerprint databases, as old versions are much slower and have several bugs for database loading. Database I/O and Indexing ------------------------- See the full :py:class:`.FingerprintDatabase` documentation for a description of basic database usage, attributes, and methods. Below, several additional use cases are documented. Batch Database Operations ------------------------- Due to the sparse representation of the underlying data structure, an un- folded database, a database with unfolded fingerprints does not use significantly more disk space than a database with folded fingerprints. However, it is usually necessary to fold fingerprints for machine learning tasks. The :py:class:`.FingerprintDatabase` does this very quickly. .. testsetup:: import numpy as np np.random.seed(3) .. doctest:: >>> from e3fp.fingerprint.db import FingerprintDatabase >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0] >>> on_inds = [np.random.uniform(0, 2**32, size=30) for i in range(5)] >>> fps = [Fingerprint(x, bits=2**32) for x in on_inds] >>> db.add_fingerprints(fps) >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> db.get_density() 6.984919309616089e-09 >>> fold_db = db.fold(1024) >>> print(fold_db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 5] >>> fold_db.get_density() 0.0287109375 A database can be converted to a different fingerprint type: >>> from e3fp.fingerprint.fprint import CountFingerprint >>> count_db = db.as_type(CountFingerprint) >>> print(count_db) FingerprintDatabase[name: TestDB, fp_type: CountFingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> count_db[0] CountFingerprint(counts={2977004690: 1, ..., 3041471738: 1}, level=-1, bits=4294967296, name=None) The :py:func:`e3fp.fingerprint.db.concat` method allows efficient joining of multiple databases. >>> from e3fp.fingerprint.db import concat >>> dbs = [] >>> for i in range(10): ... db = FingerprintDatabase(fp_type=Fingerprint) ... on_inds = [np.random.uniform(0, 1024, size=30) for j in range(5)] ... fps = [Fingerprint(x, bits=2**32, name="Mol{}".format(i)) for x in on_inds] ... db.add_fingerprints(fps) ... dbs.append(db) >>> dbs[0][0] Fingerprint(indices=array([94, 97, ..., 988, 994]), level=-1, bits=4294967296, name=Mol0) >>> print(dbs[0]) FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 5] >>> merge_db = concat(dbs) >>> print(merge_db) FingerprintDatabase[name: None, fp_type: Fingerprint, level: -1, bits: 4294967296, fp_num: 50] Database Comparison ------------------- Two databases may be compared using various metrics in :py:mod:`e3fp.fingerprint.metrics`. Additionally, all fingerprints in a database may be compared to each other simply by only providing a single database. See :ref:`usage/fingerprints/comparison:Fingerprint Comparison` for more details. Performing Machine Learning on the Database ------------------------------------------- The underlying sparse matrix may be passed directly to machine learning tools in any package that is compatible with SciPy sparse matrices, such as `scikit-learn `_. >>> from sklearn.naive_bayes import BernoulliNB >>> clf = BernoulliNB() >>> clf.fit(db.array, ypred) # doctest: +SKIP BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True) >>> clf.predict(db2.array) # doctest: +SKIP ...