e3fp.fingerprint.db module
Database for accessing and serializing fingerprints.
Author: Seth Axen E-mail: seth.axen@gmail.com
- class FingerprintDatabase(fp_type=<class 'e3fp.fingerprint.fprint.Fingerprint'>, level=-1, name=None)[source]
Bases:
objectEfficiently build, access, compare, and save fingerprints.
Fingerprints must have the same values of
bitsand level. Additionally, all fingerprints will be cast to the type of fingerprint passed to the database upon instantiation.- Parameters:
fp_type (type, optional) – Type of fingerprint (
Fingerprint,CountFingerprint,FloatFingerprint).level (int, optional) – Level, or number of iterations used during fingerprinting.
name (str, optional) – Name of database.
- Variables:
array (scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is
bits, and M isfp_num.bits (int) – Number of bits (length) of fingerprints.
fp_names_to_indices (dict) – Map from fingerprint name to row indices of
array.fp_num (int) – Number of fingerprints in database.
fp_type (type) – Type of fingerprint (
Fingerprint,CountFingerprint,FloatFingerprint)level (int) – Level, or number of iterations used during fingerprinting.
name (str) – Name of database
props (dict) – Dict with keys specifying names of fingerprint properties and values corresponding to array of values.
Notes
Since most fingerprints are very sparse length-wise,
FingerprintDatabaseis implemented as a wrapper around ascipy.sparse.csr_matrixfor efficient memory usage. This provides easy access to underlying data for tight integration with NumPy/SciPy and machine learning packages while simultaneously providing several fingerprint-specific features.See also
e3fp.fingerprint.fprint.FingerprintA fingerprint that stores indices of “on” bits
Examples
>>> from e3fp.fingerprint.db import FingerprintDatabase >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> np.random.seed(2) >>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0] >>> bvs = (np.random.uniform(size=(3, 1024)) > .9).astype(bool) >>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i)) ... for i in range(bvs.shape[0])] >>> db.add_fingerprints(fps) >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]
The contained fingerprints may be accessed by index or name.
>>> db[0] Fingerprint(indices=array([40, ..., 1012]), level=-1, bits=1024, name=fp0) >>> db['fp2'] [Fingerprint(indices=array([0, ..., 1013]), level=-1, bits=1024, name=fp2)]
Alternatively, the underlying
scipy.sparse.csr_matrixmay be accessed.>>> db.array <...sparse matrix...with 327 stored elements...> >>> db.array.toarray() array([[False, False, False, ..., False, False, False], [False, False, False, ..., False, False, False], [ True, False, False, ..., False, False, False]])
Fingerprint properties may be stored in the database.
>>> db.set_prop("prop", np.arange(3))
The database can be efficiently stored and loaded.
>>> db.savez("/tmp/test_db.fpz") >>> db = FingerprintDatabase.load("/tmp/test_db.fpz") >>> print(db) FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]
Various comparison metrics in
e3fp.fingerprint.metricscan operate efficiently directly on databases>>> from e3fp.fingerprint.metrics import tanimoto, dice, cosine >>> tanimoto(db, db) array([[1. , 0.0591133 , 0.04245283], [0.0591133 , 1. , 0.0531401 ], [0.04245283, 0.0531401 , 1. ]]) >>> dice(db, db) array([[1. , 0.11162791, 0.08144796], [0.11162791, 1. , 0.10091743], [0.08144796, 0.10091743, 1. ]]) >>> cosine(db, db) array([[1. , 0.11163878, 0.08145547], [0.11163878, 1. , 0.10095568], [0.08145547, 0.10095568, 1. ]])
- add_fingerprints(fprints)[source]
Add fingerprints to database.
- Parameters:
fprints (iterable of Fingerprint) – Fingerprints to add to database
- as_type(fp_type, copy=False)[source]
Get database with fingerprint type fp_type.
- Parameters:
fp_type (type) – Type of fingerprint (
Fingerprint,CountFingerprint,FloatFingerprint)copy (bool, optional) – Force copy of database. If False, if database is already of requested type, no copy is made.
- Returns:
Database coerced to fingerprint type of fp_type.
- Return type:
- property bits
- fold(bits, fp_type=None, name=None)[source]
Get copy of database folded to specified bit length.
- Parameters:
bits (int) – Number of bits to which to fold database.
fp_type (type or None, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint). Defaults to same type.
name (str, optional) – Name of database
- Returns:
Database folded to specified length.
- Return type:
- Raises:
BitsValueError – If
bitsis greater than the length of the database or database cannot be evenly folded to lengthbits.
- property fp_num
- classmethod from_array(array, fp_names, fp_type=None, level=-1, name=None, props={})[source]
Instantiate from array.
- Parameters:
array (numpy.ndarray or scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is the number of bits in the fingerprints.
fp_names (list of str) – N names of fingerprints in
array.fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).
level (int, optional) – Level, or number of iterations used during fingerprinting.
name (str or None, optional) – Name of database.
props (dict, optional) – Dict with keys specifying names of fingerprint properties and values corresponding to length N array of values.
- Returns:
Database containing fingerprints in
array.- Return type:
- get_density(index=None)[source]
Get percentage of fingerprints with ‘on’ bit at position.
- Parameters:
index (int or None, optional) – Index to bit for which to return positional density. If None, density for whole database is returned.
- Returns:
Density of ‘on’ position in database
- Return type:
- get_subset(fp_names, name=None)[source]
Get database with subset of fingerprints.
- Parameters:
fp_names (list of str) – List of fingerprint names to include in new db.
name (str, optional) – Name of database
- classmethod load(fn)[source]
Load database from file.
The extension is used to determine how database was serialized (
savevssavez).- Parameters:
fn (str) – Filename
- Returns:
Database
- Return type:
- save(**kwargs)
Save database to file.
- Parameters:
fn (str, optional) – Filename or basename if extension does not include ‘.fps’
Deprecated since version 1.2: Use
savez()instead.
- savetxt(fn, with_names=True)[source]
Save bitstring representation to text file.
Only implemented for fp_type of
Fingerprint. This should not be attempted for large numbers of bits.- Parameters:
fn (str or filehandle) – Out file. Extension is automatically parsed to determine whether compression is used.
with_names (bool, optional) – Include name of fingerprint in same row after bitstring.
- Raises:
E3FPInvalidFingerprintError – If fp_type is not
Fingerprint.E3FPEfficiencyWarning – If
bitsis over 2^14 = 16384.
- savez(fn='fingerprints.fpz')[source]
Save database to file.
Database is serialized using
numpy.savez_compressed.- Parameters:
fn (str, optional) – Filename or basename if extension is not ‘.fpz’
- set_prop(key, vals, check_length=True)[source]
Set values of property for fingerprints.
- Parameters:
key (str) – Name of property
vals (array_like) – Values of property.
check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.
- update_names_map(new_names=None, offset=0)[source]
Update map of fingerprint names to row indices of self.array.
- Parameters:
new_names (iterable of str, optional) – Names to add to map. If None, map is completely rebuilt.
offset (int, optional) – Number of rows before new rows.
- update_props(props_dict, append=False, check_length=True)[source]
Set multiple properties at once.
- Parameters:
props_dict (dict) – Dict of properties. Values must be array-like of length
fp_num.append (bool, optional) – Append values to those already in database. By default, properties are overwritten if already present.
check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.
- concat(dbs)[source]
Efficiently concatenate
FingerprintDatabaseobjects.The databases must be of the same type with the same number of bits, level, and property names.
- Parameters:
dbs (iterable of FingerprintDatabase) – Fingerprint databases
- Returns:
Database with all fingerprints from provided databases.
- Return type:
See also
Examples
>>> from e3fp.fingerprint.db import FingerprintDatabase, concat >>> from e3fp.fingerprint.fprint import Fingerprint >>> import numpy as np >>> np.random.seed(2) >>> db1 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB1", level=5) >>> db2 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB2", level=5) >>> bvs = (np.random.uniform(size=(6, 1024)) > .9).astype(bool) >>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i), level=5) ... for i in range(bvs.shape[0])] >>> db1.add_fingerprints(fps[:3]) >>> db2.add_fingerprints(fps[3:]) >>> print(concat([db1, db2])) FingerprintDatabase[name: None, fp_type: Fingerprint, level: 5, bits: 1024, fp_num: 6]