e3fp.fingerprint.fprint module

Classes and methods for chemical fingerprint storage and comparison.

Author: Seth Axen E-mail: seth.axen@gmail.com

class CountFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: Fingerprint

A fingerprint that stores number of occurrences of each index.

Parameters:
  • indices (array_like of int, optional) – log2(bits)-bit indices in a sparse vector, corresponding to positions with counts greater than 0. If not provided, counts must be provided.

  • counts (dict, optional) – Dict matching each index in indices to number of counts. All counts default to 1 if not provided.

  • bits (int, optional) – Number of bits in bitvector.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Variables:
  • bits (int) – Number of bits in bitvector, length of fingerprint.

  • counts (dict) – Dict matching each index in indices to number of counts.

  • indices (numpy.ndarray of int) – Indices of fingerprint with counts greater than 0.

  • level (int) – Level of fingerprint, corresponding to fingerprinting iterations.

  • mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in props).

  • name (str or None) – Name of fingerprint (stored in props).

  • props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.

  • vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)

See also

Fingerprint

A fingerprint that stores indices of “on” bits

FloatFingerprint

A fingerprint that stores float counts

Examples

>>> import e3fp.fingerprint.fprint as fp
>>> from e3fp.fingerprint.metrics import soergel
>>> import numpy as np
>>> np.random.seed(1)
>>> bits = 1024
>>> indices = np.random.randint(0, bits, 30)
>>> print(indices)
[ 37 235 908  72 767 905 715 645 847 960 144 129 972 583 749 508 390 281
 178 276 254 357 914 468 907 252 490 668 925 398]
>>> counts = dict(zip(indices,
...                   np.random.randint(1, 100, indices.shape[0])))
>>> f = fp.CountFingerprint(indices, counts=counts, bits=bits, level=0)
>>> sorted(f.counts.items())
[(np.int64(37), 51), (np.int64(72), 88), ..., (np.int64(960), 8), (np.int64(972), 23)]
>>> f_folded = f.fold(bits=32)
>>> print(sorted(f_folded.counts.items()))
[(np.int64(0), 8), (np.int64(1), 62), ..., (np.int64(30), 14), (np.int64(31), 95)]
>>> print(f_folded.to_vector(sparse=False, dtype=int))
[  8  62   0   0   0 113  61  58  88  97  71 228 111   2  58  10  64   0
  82   0 120   0   0   0   0  82   0   0  27  50  14  95]
>>> fp.Fingerprint.from_fingerprint(f_folded)
Fingerprint(indices=array([0, 1, ...]), level=0, bits=32, name=None)
>>> indices2 = np.random.randint(0, bits, 30)
>>> counts2 = dict(zip(indices2,
...                    np.random.randint(1, 100, indices.shape[0])))
>>> f_folded2 = fp.CountFingerprint.from_indices(indices2, counts=counts2,
...                                              bits=bits).fold(bits=32)
>>> sorted(f_folded2.counts.items())
[(np.int64(0), 93), (np.int64(2), 33), ..., (np.int64(26), 89), (np.int64(30), 53)]
>>> print(soergel(f_folded, f_folded2))
0.17492946392...
property counts
fold(*args, **kwargs)[source]

Fold fingerprint while considering counts.

Optionally, provide a function to reduce colliding counts.

Parameters:
  • bits (int, optional) – Length of new bitvector, ideally multiple of 2.

  • method ({0, 1}, optional) – Method to use for folding.

    0

    partitioning (array is divided into equal sized arrays of length bits which are bitwise combined with counts_method)

    1

    compression (adjacent bits pairs are combined with counts_method until length is bits)

  • linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.

  • counts_method (function, optional) – Function for combining counts. Default is summation.

Returns:

CountFingerprint

Return type:

Fingerprint of folded vector

classmethod from_counts(counts, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters:
  • counts (dict) – Dictionary mapping sparse indices to counts.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

CountFingerprint

classmethod from_fingerprint(fp, **kwargs)[source]

Initialize by copying existing fingerprint.

Parameters:
  • fp (Fingerprint) – Existing fingerprint.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

Fingerprint

classmethod from_indices(indices, counts=None, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters:
  • indices (array_like of int, optional) – Indices in a sparse bitvector of length bits which correspond to 1.

  • counts (dict, optional) – Dictionary mapping sparse indices to counts.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

CountFingerprint

get_count(index)[source]

Return count index in fingerprint.

Returns:

int

Return type:

Count of index in fingerprint

mean()[source]

Return mean of counts.

Returns:

float

Return type:

Mean

reset(*args, **kwargs)[source]

Reset all values.

std()[source]

Return standard deviation of fingerprint.

Returns:

float

Return type:

Standard deviation

vector_dtype

alias of uint16

class Fingerprint(indices, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: object

A fingerprint that stores indices of “on” bits.

Parameters:
  • indices (array_like of int, optional) – log2(bits)-bit indices in a sparse bitvector of bits which correspond to 1.

  • bits (int, optional) – Number of bits in bitvector.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Variables:
  • bits (int) – Number of bits in bitvector, length of fingerprint.

  • counts (dict) – Dict matching each index in indices to number of counts (1 for bits).

  • indices (numpy.ndarray of int) – Indices of “on” bits

  • level (int) – Level of fingerprint, corresponding to fingerprinting iterations.

  • mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in props).

  • name (str or None) – Name of fingerprint (stored in props).

  • props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.

  • vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)

See also

CountFingerprint

A fingerprint that stores number of occurrences of each index

FloatFingerprint

A fingerprint that stores indices of “on” bits

e3fp.fingerprint.db.FingerprintDatabase

Efficiently store fingerprints

Examples

>>> import e3fp.fingerprint.fprint as fp
>>> from e3fp.fingerprint.metrics import tanimoto
>>> import numpy as np
>>> np.random.seed(0)
>>> bits = 1024
>>> indices = np.random.randint(0, bits, 30)
>>> print(indices)
[684 559 629 192 835 763 707 359   9 723 277 754 804 599  70 472 600 396
 314 705 486 551  87 174 600 849 677 537 845  72]
>>> f = fp.Fingerprint(indices, bits=bits, level=0)
>>> f_folded = f.fold(bits=32)
>>> print(f_folded.indices)
[ 0  1  3  4  5  6  7  8  9 12 13 14 15 17 18 19 21 23 24 25 26 27]
>>> print(f_folded.to_vector(sparse=False, dtype=int))
[1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0]
>>> print(f_folded.to_bitstring())
11011111110011110111010111110000
>>> print(f_folded.to_rdkit())
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
>>> f_folded2 = fp.Fingerprint.from_indices(np.random.randint(0, bits, 30),
...                                         bits=bits).fold(bits=32)
>>> print(f_folded2.indices)
[ 0  1  3  5  7  9 10 14 15 16 17 18 19 20 23 24 25 29 30 31]
>>> print(tanimoto(f_folded, f_folded2))
0.5
property bit_count
property bits
clear()[source]

Clear temporary (and possibly large) values.

property counts
property density
fold(bits=1024, method=0, linked=True)[source]

Return fingerprint for bitvector folded to size bits.

Parameters:
  • bits (int, optional) – Length of new bitvector, ideally multiple of 2.

  • method ({0, 1}, optional) – Method to use for folding.

    0

    partitioning (array is divided into equal sized arrays of length bits which are bitwise combined with OR)

    1

    compression (adjacent bits pairs are combined with OR until length is bits)

  • linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.

Returns:

Fingerprint

Return type:

Fingerprint of folded bitvector

classmethod from_bitstring(bitstring, level=-1, **kwargs)[source]

Initialize from bitstring (e.g. ‘10010011’).

Parameters:
  • bitstring (str) – String of 1s and 0s.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

Fingerprint

classmethod from_fingerprint(fp, **kwargs)[source]

Initialize by copying existing fingerprint.

Parameters:

fp (Fingerprint) – Existing fingerprint.

Returns:

fingerprint

Return type:

Fingerprint

classmethod from_indices(indices, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters:
  • indices (array_like of int) – Indices in a sparse bitvector of length bits which correspond to 1.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

Fingerprint

classmethod from_rdkit(rdkit_fprint, **kwargs)[source]

Initialize from RDKit fingerprint.

If provided fingerprint is of length 2^32 - 1, assumes real fingerprint is of length 2^32.

Parameters:
  • rdkit_fprint (RDKit ExplicitBitVect or SparseBitVect) – Existing RDKit fingerprint.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

Fingerprint

classmethod from_vector(vector, level=-1, **kwargs)[source]

Initialize from vector.

Parameters:
  • vector (numpy.ndarray or scipy.sparse.csr_matrix) – Array of bits/counts/floats

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns:

fingerprint

Return type:

Fingerprint

get_count(index)[source]

Return count index in fingerprint.

Defaults to 1 if index in self.indices

Returns:

int

Return type:

Count of bit in fingerprint

get_folding_index_map()[source]

Get map of sparse indices to folded indices.

Returns:

dict

Return type:

Map of sparse index (keys) to corresponding folded index.

get_prop(key)[source]

Get property. If not set, raise KeyError.

get_unfolding_index_map()[source]

Get map of sparse indices to unfolded indices.

Returns:

dict – indices.

Return type:

Map of sparse index (keys) to set of corresponding unfolded

property index_id_map
property indices
property level
mean()[source]

Return mean, i.e. proportion of “on” bits in fingerprint.

Returns:

float

Return type:

Mean

property mol
property name
property props
reset()[source]

Reset all values.

set_prop(key, val)[source]

Set property.

std()[source]

Return standard deviation of fingerprint.

Returns:

float

Return type:

Standard deviation

to_bitstring()[source]

Get bitstring as string of 1s and 0s.

Returns:

str

Return type:

bitstring

to_bitvector(sparse=True)[source]

Get full bitvector.

Returns:

numpy.ndarray or scipy.sparse.csr_matrix of bool

Return type:

Bitvector

to_rdkit()[source]

Convert to RDKit fingerprint.

If number of bits exceeds 2^31 - 1, fingerprint will be folded to length 2^31 - 1 before conversion.

Returns:

rdkit_fprint – Convert to bitvector used for RDKit fingerprints. If self.bits is less than 10^5, ExplicitBitVect is used. Otherwise, SparseBitVect is used.

Return type:

RDKit ExplicitBitVect or SparseBitVect

to_vector(sparse=True, dtype=None)[source]

Get vector of bits/counts/floats.

Returns:

Vector of bits/counts/floats

Return type:

numpy.ndarray or scipy.sparse.csr_matrix

unfold()[source]

Return unfolded parent fingerprint for bitvector.

Returns:

Fingerprint – None.

Return type:

Fingerprint of unfolded bitvector. If None, return

update_props(props_dict)[source]

Set multiple properties at once.

vector_dtype

alias of bool

class FloatFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: CountFingerprint

A Fingerprint that stores float counts.

Nearly identical to CountFingerprint. Mainly a naming convention, but count values are stored as floats.

See also

Fingerprint

A fingerprint that stores indices of “on” bits

CountFingerprint

A fingerprint that stores number of occurrences of each index

property counts
vector_dtype

alias of float64

add(fprints, weights=None)[source]

Add fingerprints by count to new CountFingerprint.

If any of the fingerprints are FloatFingerprint, resulting fingerprint is likewise a FloatFingerprint. Otherwise, resulting fingerprint is CountFingerprint.

Parameters:
  • fprints (iterable of Fingerprint) – Fingerprints to be added by count.

  • weights (iterable of float) – Weights for weighted sum. Results in FloatFingerprint output.

Returns:

Fingerprint with counts as sum of counts in fprints.

Return type:

CountFingerprint or FloatFingerprint

See also

mean

coerce_to_valid_dtype(dtype)[source]

Coerce provided NumPy data type to closest fingerprint data type.

If provided dtype cannot be read, default corresponding to bit Fingerprint is returned.

Parameters:

dtype (numpy.dtype or str) – Input NumPy data type.

Returns:

Output NumPy data type.

Return type:

numpy.dtype

diff_counts_dict(fp1, fp2, only_positive=False)[source]

Given two fingerprints, returns difference of their counts dicts.

Parameters:
  • fp1, fp2 (Fingerprint) – Fingerprint objects, fp2 subtracted from fp1.

  • only_positive (bool, optional) – Return only positive counts, negative being thresholded to 0.

Returns:

counts_diff – Count indices in either fp1 or fp2 with value as diff of counts.

Return type:

dict

See also

sum_counts_dict

dtype_from_fptype(fp_type)[source]

Get NumPy data type from fingerprint type.

Parameters:

fp_type (class or Fingerprint) – Class of fingerprint

Returns:

NumPy data type

Return type:

numpy.dtype

fptype_from_dtype(dtype)[source]

Get corresponding fingerprint type from NumPy data type.

Parameters:

dtype (numpy.dtype or str) – NumPy data type.

Returns:

class – Class of fingerprint

Return type:

{Fingerprint, CountFingerprint, FloatFingerprint}

load(f, update_structure=True)[source]

Load Fingerprint object from file.

Parameters:
  • f (str or File) – File name or file-like object to load file from.

  • update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable.

Returns:

Fingerprint

Return type:

Pickled fingerprint.

See also

loadz, save

loadz(f, update_structure=True)[source]

Load Fingerprint objects from file.

Parameters:
  • f (str or File) – File name or file-like object to load file from.

  • update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable. If this doesn’t work, falls back to the original saved fingerprint.

Returns:

list of Fingerprint

Return type:

Fingerprints in pickle.

See also

load, savez

mean(fprints, weights=None)[source]

Average fingerprints to generate FloatFingerprint.

Parameters:
  • fprints (iterable of Fingerprint) – Fingerprints to be added by count.

  • weights (array_like of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.

Returns:

FloatFingerprintfprints.

Return type:

Fingerprint with float counts as average of counts in

save(f, fp, **kwargs)[source]

Save Fingerprint object to file.

Parameters:
  • f (str or File) – filename str or file-like object to save file to

  • fp (Fingerprint) – Fingerprint to save to file

  • protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.

Returns:

bool

Return type:

Success or fail

See also

savez, load

savez(f, *fps, **kwargs)[source]

Save multiple Fingerprint objects to file.

Parameters:
  • f (str or File) – filename str or file-like object to save file to

  • fps (list of Fingerprint) – List of Fingerprints to save to file

  • protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.

Returns:

bool

Return type:

Success or fail

See also

save, loadz

sum_counts_dict(*fprints, **kwargs)[source]

Given fingerprints, return sum of their counts dicts.

If an optional weights iterable of the same length as fprints is provided, the weighted sum is returned.

Parameters:
  • *fprints – One or more Fingerprint objects

  • weights (iterable of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.

Returns:

dict – as sum of counts.

Return type:

Dict of non-zero count indices in any of the fprints with value

See also

diff_counts_dict