e3fp.fingerprint.db module

Database for accessing and serializing fingerprints.

Author: Seth Axen E-mail: seth.axen@gmail.com

class FingerprintDatabase(fp_type=<class 'e3fp.fingerprint.fprint.Fingerprint'>, level=-1, name=None)[source]

Bases: object

Efficiently build, access, compare, and save fingerprints.

Fingerprints must have the same values of bits and level. Additionally, all fingerprints will be cast to the type of fingerprint passed to the database upon instantiation.

Parameters
  • fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).

  • level (int, optional) – Level, or number of iterations used during fingerprinting.

  • name (str, optional) – Name of database.

Variables
  • array (scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is bits, and M is fp_num.

  • bits (int) – Number of bits (length) of fingerprints.

  • fp_names (list of str) – Names of fingerprints.

  • fp_names_to_indices (dict) – Map from fingerprint name to row indices of array.

  • fp_num (int) – Number of fingerprints in database.

  • fp_type (type) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint)

  • level (int) – Level, or number of iterations used during fingerprinting.

  • name (str) – Name of database

  • props (dict) – Dict with keys specifying names of fingerprint properties and values corresponding to array of values.

Notes

Since most fingerprints are very sparse length-wise, FingerprintDatabase is implemented as a wrapper around a scipy.sparse.csr_matrix for efficient memory usage. This provides easy access to underlying data for tight integration with NumPy/SciPy and machine learning packages while simultaneously providing several fingerprint-specific features.

See also

e3fp.fingerprint.fprint.Fingerprint

A fingerprint that stores indices of “on” bits

Examples

>>> from e3fp.fingerprint.db import FingerprintDatabase
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> np.random.seed(2)
>>> db = FingerprintDatabase(fp_type=Fingerprint, name="TestDB")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: None, fp_num: 0]
>>> bvs = (np.random.uniform(size=(3, 1024)) > .9).astype(bool)
>>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i))
...        for i in range(bvs.shape[0])]
>>> db.add_fingerprints(fps)
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]

The contained fingerprints may be accessed by index or name.

>>> db[0]
Fingerprint(indices=array([40, ..., 1012]), level=-1, bits=1024, name=fp0)
>>> db['fp2']
[Fingerprint(indices=array([0, ..., 1013]), level=-1, bits=1024, name=fp2)]

Alternatively, the underlying scipy.sparse.csr_matrix may be accessed.

>>> db.array
<3x1024 sparse matrix of type '<... 'numpy.bool_'>'
...with 327 stored elements in Compressed Sparse Row format>
>>> db.array.toarray()
array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [ True, False, False, ..., False, False, False]])

Fingerprint properties may be stored in the database.

>>> db.set_prop("prop", np.arange(3))

The database can be efficiently stored and loaded.

>>> db.savez("/tmp/test_db.fpz")
>>> db = FingerprintDatabase.load("/tmp/test_db.fpz")
>>> print(db)
FingerprintDatabase[name: TestDB, fp_type: Fingerprint, level: -1, bits: 1024, fp_num: 3]

Various comparison metrics in e3fp.fingerprint.metrics can operate efficiently directly on databases

>>> from e3fp.fingerprint.metrics import tanimoto, dice, cosine
>>> tanimoto(db, db)
array([[1.        , 0.0591133 , 0.04245283],
       [0.0591133 , 1.        , 0.0531401 ],
       [0.04245283, 0.0531401 , 1.        ]])
>>> dice(db, db)
array([[1.        , 0.11162791, 0.08144796],
       [0.11162791, 1.        , 0.10091743],
       [0.08144796, 0.10091743, 1.        ]])
>>> cosine(db, db)
array([[1.        , 0.11163878, 0.08145547],
       [0.11163878, 1.        , 0.10095568],
       [0.08145547, 0.10095568, 1.        ]])
add_fingerprints(fprints)[source]

Add fingerprints to database.

Parameters

fprints (iterable of Fingerprint) – Fingerprints to add to database

as_type(fp_type, copy=False)[source]

Get database with fingerprint type fp_type.

Parameters
  • fp_type (type) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint)

  • copy (bool, optional) – Force copy of database. If False, if database is already of requested type, no copy is made.

Returns

Database coerced to fingerprint type of fp_type.

Return type

FingerprintDatabase

property bits
fold(bits, fp_type=None, name=None)[source]

Get copy of database folded to specified bit length.

Parameters
  • bits (int) – Number of bits to which to fold database.

  • fp_type (type or None, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint). Defaults to same type.

  • name (str, optional) – Name of database

Returns

Database folded to specified length.

Return type

FingerprintDatabase

Raises

BitsValueError – If bits is greater than the length of the database or database cannot be evenly folded to length bits.

property fp_num
classmethod from_array(array, fp_names, fp_type=None, level=-1, name=None, props={})[source]

Instantiate from array.

Parameters
  • array (numpy.ndarray or scipy.sparse.csr_matrix) – Sparse matrix with dimensions N x M, where M is the number of bits in the fingerprints.

  • fp_names (list of str) – N names of fingerprints in array.

  • fp_type (type, optional) – Type of fingerprint (Fingerprint, CountFingerprint, FloatFingerprint).

  • level (int, optional) – Level, or number of iterations used during fingerprinting.

  • name (str or None, optional) – Name of database.

  • props (dict, optional) – Dict with keys specifying names of fingerprint properties and values corresponding to length N array of values.

Returns

Database containing fingerprints in array.

Return type

FingerprintDatabase

get_density(index=None)[source]

Get percentage of fingerprints with ‘on’ bit at position.

Parameters

index (int or None, optional) – Index to bit for which to return positional density. If None, density for whole database is returned.

Returns

Density of ‘on’ position in database

Return type

float

get_prop(key)[source]

Get property.

Raises

KeyError – If key not in props.

get_subset(fp_names, name=None)[source]

Get database with subset of fingerprints.

Parameters
  • fp_names (list of str) – List of fingerprint names to include in new db.

  • name (str, optional) – Name of database

classmethod load(fn)[source]

Load database from file.

The extension is used to determine how database was serialized (save vs savez).

Parameters

fn (str) – Filename

Returns

Database

Return type

FingerprintDatabase

save(**kwargs)

Note

Deprecated in e3fp 1.2. save will be removed in e3fp 1.3. Use savez instead.

Save database to file.

fnstr, optional

Filename or basename if extension does not include ‘.fps’

savetxt(fn, with_names=True)[source]

Save bitstring representation to text file.

Only implemented for fp_type of Fingerprint. This should not be attempted for large numbers of bits.

Parameters
  • fn (str or filehandle) – Out file. Extension is automatically parsed to determine whether compression is used.

  • with_names (bool, optional) – Include name of fingerprint in same row after bitstring.

Raises
savez(fn='fingerprints.fpz')[source]

Save database to file.

Database is serialized using numpy.savez_compressed.

Parameters

fn (str, optional) – Filename or basename if extension is not ‘.fpz’

set_prop(key, vals, check_length=True)[source]

Set values of property for fingerprints.

Parameters
  • key (str) – Name of property

  • vals (array_like) – Values of property.

  • check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.

update_names_map(new_names=None, offset=0)[source]

Update map of fingerprint names to row indices of self.array.

Parameters
  • new_names (iterable of str, optional) – Names to add to map. If None, map is completely rebuilt.

  • offset (int, optional) – Number of rows before new rows.

update_props(props_dict, append=False, check_length=True)[source]

Set multiple properties at once.

Parameters
  • props_dict (dict) – Dict of properties. Values must be array-like of length fp_num.

  • append (bool, optional) – Append values to those already in database. By default, properties are overwritten if already present.

  • check_length (bool, optional) – Check to ensure number of properties match number of fingerprints already in database. This should only be set to False for temporary iterative updating.

concat(dbs)[source]

Efficiently concatenate FingerprintDatabase objects.

The databases must be of the same type with the same number of bits, level, and property names.

Parameters

dbs (iterable of FingerprintDatabase) – Fingerprint databases

Returns

Database with all fingerprints from provided databases.

Return type

FingerprintDatabase

Examples

>>> from e3fp.fingerprint.db import FingerprintDatabase, concat
>>> from e3fp.fingerprint.fprint import Fingerprint
>>> import numpy as np
>>> np.random.seed(2)
>>> db1 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB1", level=5)
>>> db2 = FingerprintDatabase(fp_type=Fingerprint, name="TestDB2", level=5)
>>> bvs = (np.random.uniform(size=(6, 1024)) > .9).astype(bool)
>>> fps = [Fingerprint.from_vector(bvs[i, :], name="fp" + str(i), level=5)
...        for i in range(bvs.shape[0])]
>>> db1.add_fingerprints(fps[:3])
>>> db2.add_fingerprints(fps[3:])
>>> print(concat([db1, db2]))
FingerprintDatabase[name: None, fp_type: Fingerprint, level: 5, bits: 1024, fp_num: 6]