e3fp.fingerprint.db module

Database for accessing and serializing fingerprints.

Author: Seth Axen E-mail: seth.axen@gmail.com

class FingerprintDatabase(fp_type=<class 'e3fp.fingerprint.fprint.Fingerprint'>, level=-1, name=None)[source]

Bases: object

Efficiently build, access, compare, and save fingerprints.

Fingerprints must have the same values of bits and level. Additionally, all fingerprints will be cast to the type of fingerprint passed to the database upon instantiation.

Parameters:
  • fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).

  • level (int, optional) – Level, or number of iterations used during fingerprinting.

  • name (str, optional) – Name of database.

Variables:
  • array (scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is bits, and M is fp_num.

  • bits (int) – Number of bits (length) of fingerprints.

  • fp_names (list of str) – Names of fingerprints.

  • fp_names_to_indices (dict) – Map from fingerprint name to row indices of array.

  • fp_num (int) – Number of fingerprints in database.

  • fp_type (type) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint)

  • level (int) – Level, or number of iterations used during fingerprinting.

  • name (str) – Name of database

  • props (dict) – Dict with keys specifying names of fingerprint properties and values corresponding to array of values.

Notes

Since most fingerprints are very sparse length-wise, FingerprintDatabase is implemented as a wrapper around a scipy.sparse.csr_matrix for efficient memory usage. This provides easy access to underlying data for tight integration with NumPy/SciPy and machine learning packages while simultaneously providing several fingerprint-specific features.

See also

e3fp.fingerprint.fprint.Fingerprint

A fingerprint that stores indices of “on” bits

Examples

>>> from e3fp.fingerprint.db import FingerprintDatabase
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> np.random.seed(2)
>>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0]
>>> bvs = (np.random.uniform(size=(3, 1024)) > .9).astype(bool)
>>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i))
...        for i in range(bvs.shape[0])]
>>> db.add_fingerprints(fps)
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]

The contained fingerprints may be accessed by index or name.

>>> db[0]
Fingerprint(indices=array([40, ..., 1012]), level=-1, bits=1024, name=fp0)
>>> db['fp2']
[Fingerprint(indices=array([0, ..., 1013]), level=-1, bits=1024, name=fp2)]

Alternatively, the underlying scipy.sparse.csr_matrix may be accessed.

>>> db.array  
<...sparse matrix...with 327 stored elements...>
>>> db.array.toarray()
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

Fingerprint properties may be stored in the database.

>>> db.set_prop("prop", np.arange(3))

The database can be efficiently stored and loaded.

>>> db.savez("/tmp/test_db.fpz")
>>> db = FingerprintDatabase.load("/tmp/test_db.fpz")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]

Various comparison metrics in e3fp.fingerprint.metrics can operate efficiently directly on databases

>>> from e3fp.fingerprint.metrics import tanimoto, dice, cosine
>>> tanimoto(db, db)
array([[1.        , 0.0591133 , 0.04245283],
       [0.0591133 , 1.        , 0.0531401 ],
       [0.04245283, 0.0531401 , 1.        ]])
>>> dice(db, db)
array([[1.        , 0.11162791, 0.08144796],
       [0.11162791, 1.        , 0.10091743],
       [0.08144796, 0.10091743, 1.        ]])
>>> cosine(db, db)
array([[1.        , 0.11163878, 0.08145547],
       [0.11163878, 1.        , 0.10095568],
       [0.08145547, 0.10095568, 1.        ]])
add_fingerprints(fprints)[source]

Add fingerprints to database.

Parameters:

fprints (iterable of Fingerprint) – Fingerprints to add to database

as_type(fp_type, copy=False)[source]

Get database with fingerprint type fp_type.

Parameters:
  • fp_type (type) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint)

  • copy (bool, optional) – Force copy of database. If False, if database is already of requested type, no copy is made.

Returns:

Database coerced to fingerprint type of fp_type.

Return type:

FingerprintDatabase

property bits
fold(bits, fp_type=None, name=None)[source]

Get copy of database folded to specified bit length.

Parameters:
  • bits (int) – Number of bits to which to fold database.

  • fp_type (type or None, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint). Defaults to same type.

  • name (str, optional) – Name of database

Returns:

Database folded to specified length.

Return type:

FingerprintDatabase

Raises:

BitsValueError – If bits is greater than the length of the database or database cannot be evenly folded to length bits.

property fp_num
classmethod from_array(array, fp_names, fp_type=None, level=-1, name=None, props={})[source]

Instantiate from array.

Parameters:
  • array (numpy.ndarray or scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is the number of bits in the fingerprints.

  • fp_names (list of str) – N names of fingerprints in array.

  • fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).

  • level (int, optional) – Level, or number of iterations used during fingerprinting.

  • name (str or None, optional) – Name of database.

  • props (dict, optional) – Dict with keys specifying names of fingerprint properties and values corresponding to length N array of values.

Returns:

Database containing fingerprints in array.

Return type:

FingerprintDatabase

get_density(index=None)[source]

Get percentage of fingerprints with ‘on’ bit at position.

Parameters:

index (int or None, optional) – Index to bit for which to return positional density. If None, density for whole database is returned.

Returns:

Density of ‘on’ position in database

Return type:

float

get_prop(key)[source]

Get property.

Raises:

KeyError – If key not in props.

get_subset(fp_names, name=None)[source]

Get database with subset of fingerprints.

Parameters:
  • fp_names (list of str) – List of fingerprint names to include in new db.

  • name (str, optional) – Name of database

classmethod load(fn)[source]

Load database from file.

The extension is used to determine how database was serialized (save vs savez).

Parameters:

fn (str) – Filename

Returns:

Database

Return type:

FingerprintDatabase

save(**kwargs)

Save database to file.

Parameters:

fn (str, optional) – Filename or basename if extension does not include ‘.fps’

Deprecated since version 1.2: Use savez() instead.

savetxt(fn, with_names=True)[source]

Save bitstring representation to text file.

Only implemented for fp_type of Fingerprint. This should not be attempted for large numbers of bits.

Parameters:
  • fn (str or filehandle) – Out file. Extension is automatically parsed to determine whether compression is used.

  • with_names (bool, optional) – Include name of fingerprint in same row after bitstring.

Raises:
savez(fn='fingerprints.fpz')[source]

Save database to file.

Database is serialized using numpy.savez_compressed.

Parameters:

fn (str, optional) – Filename or basename if extension is not ‘.fpz’

set_prop(key, vals, check_length=True)[source]

Set values of property for fingerprints.

Parameters:
  • key (str) – Name of property

  • vals (array_like) – Values of property.

  • check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.

update_names_map(new_names=None, offset=0)[source]

Update map of fingerprint names to row indices of self.array.

Parameters:
  • new_names (iterable of str, optional) – Names to add to map. If None, map is completely rebuilt.

  • offset (int, optional) – Number of rows before new rows.

update_props(props_dict, append=False, check_length=True)[source]

Set multiple properties at once.

Parameters:
  • props_dict (dict) – Dict of properties. Values must be array-like of length fp_num.

  • append (bool, optional) – Append values to those already in database. By default, properties are overwritten if already present.

  • check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.

concat(dbs)[source]

Efficiently concatenate FingerprintDatabase objects.

The databases must be of the same type with the same number of bits, level, and property names.

Parameters:

dbs (iterable of FingerprintDatabase) – Fingerprint databases

Returns:

Database with all fingerprints from provided databases.

Return type:

FingerprintDatabase

Examples

>>> from e3fp.fingerprint.db import FingerprintDatabase, concat
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> np.random.seed(2)
>>> db1 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB1", level=5)
>>> db2 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB2", level=5)
>>> bvs = (np.random.uniform(size=(6, 1024)) > .9).astype(bool)
>>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i), level=5)
...        for i in range(bvs.shape[0])]
>>> db1.add_fingerprints(fps[:3])
>>> db2.add_fingerprints(fps[3:])
>>> print(concat([db1, db2]))
FingerprintDatabase[name: None, fp_type: Fingerprint, level: 5, bits: 1024, fp_num: 6]