e3fp.fingerprint.db module¶
Database for accessing and serializing fingerprints.
Author: Seth Axen E-mail: seth.axen@gmail.com
- class FingerprintDatabase(fp_type=<class 'e3fp.fingerprint.fprint.Fingerprint'>, level=-1, name=None)[source]¶
Bases:
object
Efficiently build, access, compare, and save fingerprints.
Fingerprints must have the same values of
bits
and level. Additionally, all fingerprints will be cast to the type of fingerprint passed to the database upon instantiation.- Parameters
fp_type (type, optional) – Type of fingerprint (
Fingerprint
,CountFingerprint
,FloatFingerprint
).level (int, optional) – Level, or number of iterations used during fingerprinting.
name (str, optional) – Name of database.
- Variables
array (scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is
bits
, and M isfp_num
.bits (int) – Number of bits (length) of fingerprints.
fp_names_to_indices (dict) – Map from fingerprint name to row indices of
array
.fp_num (int) – Number of fingerprints in database.
fp_type (type) – Type of fingerprint (
Fingerprint
,CountFingerprint
,FloatFingerprint
)level (int) – Level, or number of iterations used during fingerprinting.
name (str) – Name of database
props (dict) – Dict with keys specifying names of fingerprint properties and values corresponding to array of values.
Notes
Since most fingerprints are very sparse length-wise,
FingerprintDatabase
is implemented as a wrapper around ascipy.sparse.csr_matrix
for efficient memory usage. This provides easy access to underlying data for tight integration with NumPy/SciPy and machine learning packages while simultaneously providing several fingerprint-specific features.See also
e3fp.fingerprint.fprint.Fingerprint
A fingerprint that stores indices of “on” bits
Examples
>>> from e3fp.fingerprint.db import FingerprintDatabase >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> np.random.seed(2) >>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0] >>> bvs = (np.random.uniform(size=(3, 1024)) > .9).astype(bool) >>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i)) ... for i in range(bvs.shape[0])] >>> db.add_fingerprints(fps) >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]
The contained fingerprints may be accessed by index or name.
>>> db[0] Fingerprint(indices=array([40, ..., 1012]), level=-1, bits=1024, name=fp0) >>> db['fp2'] [Fingerprint(indices=array([0, ..., 1013]), level=-1, bits=1024, name=fp2)]
Alternatively, the underlying
scipy.sparse.csr_matrix
may be accessed.>>> db.array <3x1024 sparse matrix of type '<... 'numpy.bool_'>' ...with 327 stored elements in Compressed Sparse Row format> >>> db.array.toarray() array([[False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], [ True, False, False, ..., False, False, False]])
Fingerprint properties may be stored in the database.
>>> db.set_prop("prop", np.arange(3))
The database can be efficiently stored and loaded.
>>> db.savez("/tmp/test_db.fpz") >>> db = FingerprintDatabase.load("/tmp/test_db.fpz") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]
Various comparison metrics in
e3fp.fingerprint.metrics
can operate efficiently directly on databases>>> from e3fp.fingerprint.metrics import tanimoto, dice, cosine >>> tanimoto(db, db) array([[1. , 0.0591133 , 0.04245283], [0.0591133 , 1. , 0.0531401 ], [0.04245283, 0.0531401 , 1. ]]) >>> dice(db, db) array([[1. , 0.11162791, 0.08144796], [0.11162791, 1. , 0.10091743], [0.08144796, 0.10091743, 1. ]]) >>> cosine(db, db) array([[1. , 0.11163878, 0.08145547], [0.11163878, 1. , 0.10095568], [0.08145547, 0.10095568, 1. ]])
- add_fingerprints(fprints)[source]¶
Add fingerprints to database.
- Parameters
fprints (iterable of Fingerprint) – Fingerprints to add to database
- as_type(fp_type, copy=False)[source]¶
Get database with fingerprint type fp_type.
- Parameters
fp_type (type) – Type of fingerprint (
Fingerprint
,CountFingerprint
,FloatFingerprint
)copy (bool, optional) – Force copy of database. If False, if database is already of requested type, no copy is made.
- Returns
Database coerced to fingerprint type of fp_type.
- Return type
- property bits¶
- fold(bits, fp_type=None, name=None)[source]¶
Get copy of database folded to specified bit length.
- Parameters
bits (int) – Number of bits to which to fold database.
fp_type (type or None, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint). Defaults to same type.
name (str, optional) – Name of database
- Returns
Database folded to specified length.
- Return type
- Raises
BitsValueError – If
bits
is greater than the length of the database or database cannot be evenly folded to lengthbits
.
- property fp_num¶
- classmethod from_array(array, fp_names, fp_type=None, level=-1, name=None, props={})[source]¶
Instantiate from array.
- Parameters
array (numpy.ndarray or scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is the number of bits in the fingerprints.
fp_names (list of str) – N names of fingerprints in
array
.fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).
level (int, optional) – Level, or number of iterations used during fingerprinting.
name (str or None, optional) – Name of database.
props (dict, optional) – Dict with keys specifying names of fingerprint properties and values corresponding to length N array of values.
- Returns
Database containing fingerprints in
array
.- Return type
- get_density(index=None)[source]¶
Get percentage of fingerprints with ‘on’ bit at position.
- Parameters
index (int or None, optional) – Index to bit for which to return positional density. If None, density for whole database is returned.
- Returns
Density of ‘on’ position in database
- Return type
- get_subset(fp_names, name=None)[source]¶
Get database with subset of fingerprints.
- Parameters
fp_names (list of str) – List of fingerprint names to include in new db.
name (str, optional) – Name of database
- classmethod load(fn)[source]¶
Load database from file.
The extension is used to determine how database was serialized (
save
vssavez
).- Parameters
fn (str) – Filename
- Returns
Database
- Return type
- save(**kwargs)¶
-
Save database to file.
- fnstr, optional
Filename or basename if extension does not include ‘.fps’
- savetxt(fn, with_names=True)[source]¶
Save bitstring representation to text file.
Only implemented for fp_type of
Fingerprint
. This should not be attempted for large numbers of bits.- Parameters
fn (str or filehandle) – Out file. Extension is automatically parsed to determine whether compression is used.
with_names (bool, optional) – Include name of fingerprint in same row after bitstring.
- Raises
E3FPInvalidFingerprintError – If fp_type is not
Fingerprint
.E3FPEfficiencyWarning – If
bits
is over 2^14 = 16384.
- savez(fn='fingerprints.fpz')[source]¶
Save database to file.
Database is serialized using
numpy.savez_compressed
.- Parameters
fn (str, optional) – Filename or basename if extension is not ‘.fpz’
- set_prop(key, vals, check_length=True)[source]¶
Set values of property for fingerprints.
- Parameters
key (str) – Name of property
vals (array_like) – Values of property.
check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.
- update_names_map(new_names=None, offset=0)[source]¶
Update map of fingerprint names to row indices of self.array.
- Parameters
new_names (iterable of str, optional) – Names to add to map. If None, map is completely rebuilt.
offset (int, optional) – Number of rows before new rows.
- update_props(props_dict, append=False, check_length=True)[source]¶
Set multiple properties at once.
- Parameters
props_dict (dict) – Dict of properties. Values must be array-like of length
fp_num
.append (bool, optional) – Append values to those already in database. By default, properties are overwritten if already present.
check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.
- concat(dbs)[source]¶
Efficiently concatenate
FingerprintDatabase
objects.The databases must be of the same type with the same number of bits, level, and property names.
- Parameters
dbs (iterable of FingerprintDatabase) – Fingerprint databases
- Returns
Database with all fingerprints from provided databases.
- Return type
See also
Examples
>>> from e3fp.fingerprint.db import FingerprintDatabase, concat >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> np.random.seed(2) >>> db1 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB1", level=5) >>> db2 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB2", level=5) >>> bvs = (np.random.uniform(size=(6, 1024)) > .9).astype(bool) >>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i), level=5) ... for i in range(bvs.shape[0])] >>> db1.add_fingerprints(fps[:3]) >>> db2.add_fingerprints(fps[3:]) >>> print(concat([db1, db2])) FingerprintDatabase[name: None, fp_type: Fingerprint, level: 5, bits: 1024, fp_num: 6]