e3fp.fingerprint.fprint module

Classes and methods for chemical fingerprint storage and comparison.

Author: Seth Axen E-mail: seth.axen@gmail.com

class CountFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: Fingerprint

A fingerprint that stores number of occurrences of each index.

Parameters
  • indices (array_like of int, optional) – log2(bits)-bit indices in a sparse vector, corresponding to positions with counts greater than 0. If not provided, counts must be provided.

  • counts (dict, optional) – Dict matching each index in indices to number of counts. All counts default to 1 if not provided.

  • bits (int, optional) – Number of bits in bitvector.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Variables
  • bits (int) – Number of bits in bitvector, length of fingerprint.

  • counts (dict) – Dict matching each index in indices to number of counts.

  • indices (numpy.ndarray of int) – Indices of fingerprint with counts greater than 0.

  • level (int) – Level of fingerprint, corresponding to fingerprinting iterations.

  • mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in props).

  • name (str or None) – Name of fingerprint (stored in props).

  • props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.

  • vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)

See also

Fingerprint

A fingerprint that stores indices of “on” bits

FloatFingerprint

A fingerprint that stores float counts

Examples

>>> import e3fp.fingerprint.fprint as fp
>>> from e3fp.fingerprint.metrics import soergel
>>> import numpy as np
>>> np.random.seed(1)
>>> bits = 1024
>>> indices = np.random.randint(0, bits, 30)
>>> print(indices)
[ 37 235 908  72 767 905 715 645 847 960 144 129 972 583 749 508 390 281
 178 276 254 357 914 468 907 252 490 668 925 398]
>>> counts = dict(zip(indices,
...                   np.random.randint(1, 100, indices.shape[0])))
>>> print(sorted(counts.items()))
[(37, 51), (72, 88), (129, 62), ..., (925, 50), (960, 8), (972, 23)]
>>> f = fp.CountFingerprint(indices, counts=counts, bits=bits, level=0)
>>> f_folded = f.fold(bits=32)
>>> print(sorted(f_folded.counts.items()))
[(0, 8), (1, 62), (5, 113), ..., (29, 50), (30, 14), (31, 95)]
>>> print(f_folded.to_vector(sparse=False, dtype=int))
[  8  62   0   0   0 113  61  58  88  97  71 228 111   2  58  10  64   0
  82   0 120   0   0   0   0  82   0   0  27  50  14  95]
>>> fp.Fingerprint.from_fingerprint(f_folded)
Fingerprint(indices=array([0, 1, ...]), level=0, bits=32, name=None)
>>> indices2 = np.random.randint(0, bits, 30)
>>> counts2 = dict(zip(indices2,
...                    np.random.randint(1, 100, indices.shape[0])))
>>> f_folded2 = fp.CountFingerprint.from_indices(indices2, counts=counts2,
...                                              bits=bits).fold(bits=32)
>>> print(sorted(f_folded2.counts.items()))
[(0, 93), (2, 33), (3, 106), ..., (25, 129), (26, 89), (30, 53)]
>>> print(soergel(f_folded, f_folded2))
0.17492946392...
property counts
fold(*args, **kwargs)[source]

Fold fingerprint while considering counts.

Optionally, provide a function to reduce colliding counts.

Parameters
  • bits (int, optional) – Length of new bitvector, ideally multiple of 2.

  • method ({0, 1}, optional) – Method to use for folding.

    0

    partitioning (array is divided into equal sized arrays of length bits which are bitwise combined with counts_method)

    1

    compression (adjacent bits pairs are combined with counts_method until length is bits)

  • linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.

  • counts_method (function, optional) – Function for combining counts. Default is summation.

Returns

CountFingerprint

Return type

Fingerprint of folded vector

classmethod from_counts(counts, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters
  • counts (dict) – Dictionary mapping sparse indices to counts.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

CountFingerprint

classmethod from_fingerprint(fp, **kwargs)[source]

Initialize by copying existing fingerprint.

Parameters
  • fp (Fingerprint) – Existing fingerprint.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

Fingerprint

classmethod from_indices(indices, counts=None, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters
  • indices (array_like of int, optional) – Indices in a sparse bitvector of length bits which correspond to 1.

  • counts (dict, optional) – Dictionary mapping sparse indices to counts.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

CountFingerprint

get_count(index)[source]

Return count index in fingerprint.

Returns

int

Return type

Count of index in fingerprint

mean()[source]

Return mean of counts.

Returns

float

Return type

Mean

reset(*args, **kwargs)[source]

Reset all values.

std()[source]

Return standard deviation of fingerprint.

Returns

float

Return type

Standard deviation

vector_dtype

alias of uint16

class Fingerprint(indices, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: object

A fingerprint that stores indices of “on” bits.

Parameters
  • indices (array_like of int, optional) – log2(bits)-bit indices in a sparse bitvector of bits which correspond to 1.

  • bits (int, optional) – Number of bits in bitvector.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Variables
  • bits (int) – Number of bits in bitvector, length of fingerprint.

  • counts (dict) – Dict matching each index in indices to number of counts (1 for bits).

  • indices (numpy.ndarray of int) – Indices of “on” bits

  • level (int) – Level of fingerprint, corresponding to fingerprinting iterations.

  • mol (RDKit Mol) – Mol to which fingerprint corresponds (stored in props).

  • name (str or None) – Name of fingerprint (stored in props).

  • props (dict) – Custom properties of fingerprint, consisting of a string keyword and some value.

  • vector_dtype (numpy.dtype) – NumPy data type associated with fingerprint values (e.g. bits)

See also

CountFingerprint

A fingerprint that stores number of occurrences of each index

FloatFingerprint

A fingerprint that stores indices of “on” bits

e3fp.fingerprint.db.FingerprintDatabase

Efficiently store fingerprints

Examples

>>> import e3fp.fingerprint.fprint as fp
>>> from e3fp.fingerprint.metrics import tanimoto
>>> import numpy as np
>>> np.random.seed(0)
>>> bits = 1024
>>> indices = np.random.randint(0, bits, 30)
>>> print(indices)
[684 559 629 192 835 763 707 359   9 723 277 754 804 599  70 472 600 396
 314 705 486 551  87 174 600 849 677 537 845  72]
>>> f = fp.Fingerprint(indices, bits=bits, level=0)
>>> f_folded = f.fold(bits=32)
>>> print(f_folded.indices)
[ 0  1  3  4  5  6  7  8  9 12 13 14 15 17 18 19 21 23 24 25 26 27]
>>> print(f_folded.to_vector(sparse=False, dtype=int))
[1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0]
>>> print(f_folded.to_bitstring())
11011111110011110111010111110000
>>> print(f_folded.to_rdkit())
<rdkit.DataStructs.cDataStructs.ExplicitBitVect object at 0x...>
>>> f_folded2 = fp.Fingerprint.from_indices(np.random.randint(0, bits, 30),
...                                         bits=bits).fold(bits=32)
>>> print(f_folded2.indices)
[ 0  1  3  5  7  9 10 14 15 16 17 18 19 20 23 24 25 29 30 31]
>>> print(tanimoto(f_folded, f_folded2))
0.5
property bit_count
property bits
clear()[source]

Clear temporary (and possibly large) values.

property counts
property density
fold(bits=1024, method=0, linked=True)[source]

Return fingerprint for bitvector folded to size bits.

Parameters
  • bits (int, optional) – Length of new bitvector, ideally multiple of 2.

  • method ({0, 1}, optional) – Method to use for folding.

    0

    partitioning (array is divided into equal sized arrays of length bits which are bitwise combined with OR)

    1

    compression (adjacent bits pairs are combined with OR until length is bits)

  • linked (bool, optional) – Link folded and unfolded fingerprints for easy referencing. Set to False if intending to save and want to reduce file size.

Returns

Fingerprint

Return type

Fingerprint of folded bitvector

classmethod from_bitstring(bitstring, level=-1, **kwargs)[source]

Initialize from bitstring (e.g. ‘10010011’).

Parameters
  • bitstring (str) – String of 1s and 0s.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

Fingerprint

classmethod from_fingerprint(fp, **kwargs)[source]

Initialize by copying existing fingerprint.

Parameters

fp (Fingerprint) – Existing fingerprint.

Returns

fingerprint

Return type

Fingerprint

classmethod from_indices(indices, bits=4294967296, level=-1, **kwargs)[source]

Initialize from an array of indices.

Parameters
  • indices (array_like of int) – Indices in a sparse bitvector of length bits which correspond to 1.

  • bits (int, optional) – Number of bits in array. Indices will be log2(bits)-bit integers.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

Fingerprint

classmethod from_rdkit(rdkit_fprint, **kwargs)[source]

Initialize from RDKit fingerprint.

If provided fingerprint is of length 2^32 - 1, assumes real fingerprint is of length 2^32.

Parameters
  • rdkit_fprint (RDKit ExplicitBitVect or SparseBitVect) – Existing RDKit fingerprint.

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

Fingerprint

classmethod from_vector(vector, level=-1, **kwargs)[source]

Initialize from vector.

Parameters
  • vector (numpy.ndarray or scipy.sparse.csr_matrix) – Array of bits/counts/floats

  • level (int, optional) – Level of fingerprint, corresponding to fingerprinting iterations.

  • name (str, optional) – Name of fingerprint.

  • props (dict, optional) – Custom properties of fingerprint, consisting of a string keyword and some value.

Returns

fingerprint

Return type

Fingerprint

get_count(index)[source]

Return count index in fingerprint.

Defaults to 1 if index in self.indices

Returns

int

Return type

Count of bit in fingerprint

get_folding_index_map()[source]

Get map of sparse indices to folded indices.

Returns

dict

Return type

Map of sparse index (keys) to corresponding folded index.

get_prop(key)[source]

Get property. If not set, raise KeyError.

get_unfolding_index_map()[source]

Get map of sparse indices to unfolded indices.

Returns

dict – indices.

Return type

Map of sparse index (keys) to set of corresponding unfolded

property index_id_map
property indices
property level
mean()[source]

Return mean, i.e. proportion of “on” bits in fingerprint.

Returns

float

Return type

Mean

property mol
property name
property props
reset()[source]

Reset all values.

set_prop(key, val)[source]

Set property.

std()[source]

Return standard deviation of fingerprint.

Returns

float

Return type

Standard deviation

to_bitstring()[source]

Get bitstring as string of 1s and 0s.

Returns

str

Return type

bitstring

to_bitvector(sparse=True)[source]

Get full bitvector.

Returns

numpy.ndarray or scipy.sparse.csr_matrix of bool

Return type

Bitvector

to_rdkit()[source]

Convert to RDKit fingerprint.

If number of bits exceeds 2^31 - 1, fingerprint will be folded to length 2^31 - 1 before conversion.

Returns

rdkit_fprint – Convert to bitvector used for RDKit fingerprints. If self.bits is less than 10^5, ExplicitBitVect is used. Otherwise, SparseBitVect is used.

Return type

RDKit ExplicitBitVect or SparseBitVect

to_vector(sparse=True, dtype=None)[source]

Get vector of bits/counts/floats.

Returns

Vector of bits/counts/floats

Return type

numpy.ndarray or scipy.sparse.csr_matrix

unfold()[source]

Return unfolded parent fingerprint for bitvector.

Returns

Fingerprint – None.

Return type

Fingerprint of unfolded bitvector. If None, return

update_props(props_dict)[source]

Set multiple properties at once.

vector_dtype

alias of bool_

class FloatFingerprint(indices=None, counts=None, bits=4294967296, level=-1, name=None, props={}, **kwargs)[source]

Bases: CountFingerprint

A Fingerprint that stores float counts.

Nearly identical to CountFingerprint. Mainly a naming convention, but count values are stored as floats.

See also

Fingerprint

A fingerprint that stores indices of “on” bits

CountFingerprint

A fingerprint that stores number of occurrences of each index

property counts
vector_dtype

alias of float64

add(fprints, weights=None)[source]

Add fingerprints by count to new CountFingerprint.

If any of the fingerprints are FloatFingerprint, resulting fingerprint is likewise a FloatFingerprint. Otherwise, resulting fingerprint is CountFingerprint.

Parameters
  • fprints (iterable of Fingerprint) – Fingerprints to be added by count.

  • weights (iterable of float) – Weights for weighted sum. Results in FloatFingerprint output.

Returns

Fingerprint with counts as sum of counts in fprints.

Return type

CountFingerprint or FloatFingerprint

See also

mean

coerce_to_valid_dtype(dtype)[source]

Coerce provided NumPy data type to closest fingerprint data type.

If provided dtype cannot be read, default corresponding to bit Fingerprint is returned.

Parameters

dtype (numpy.dtype or str) – Input NumPy data type.

Returns

Output NumPy data type.

Return type

numpy.dtype

diff_counts_dict(fp1, fp2, only_positive=False)[source]

Given two fingerprints, returns difference of their counts dicts.

Parameters
  • fp1, fp2 (Fingerprint) – Fingerprint objects, fp2 subtracted from fp1.

  • only_positive (bool, optional) – Return only positive counts, negative being thresholded to 0.

Returns

counts_diff – Count indices in either fp1 or fp2 with value as diff of counts.

Return type

dict

See also

sum_counts_dict

dtype_from_fptype(fp_type)[source]

Get NumPy data type from fingerprint type.

Parameters

fp_type (class or Fingerprint) – Class of fingerprint

Returns

NumPy data type

Return type

numpy.dtype

fptype_from_dtype(dtype)[source]

Get corresponding fingerprint type from NumPy data type.

Parameters

dtype (numpy.dtype or str) – NumPy data type.

Returns

class – Class of fingerprint

Return type

{Fingerprint, CountFingerprint, FloatFingerprint}

load(f, update_structure=True)[source]

Load Fingerprint object from file.

Parameters
  • f (str or File) – File name or file-like object to load file from.

  • update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable.

Returns

Fingerprint

Return type

Pickled fingerprint.

See also

loadz, save

loadz(f, update_structure=True)[source]

Load Fingerprint objects from file.

Parameters
  • f (str or File) – File name or file-like object to load file from.

  • update_structure (bool, optional) – Attempt to update the class structure by initializing a new, shiny fingerprint from each fingerprint in the file. Useful for guaranteeing that old, dusty fingerprints are always upgradeable. If this doesn’t work, falls back to the original saved fingerprint.

Returns

list of Fingerprint

Return type

Fingerprints in pickle.

See also

load, savez

mean(fprints, weights=None)[source]

Average fingerprints to generate FloatFingerprint.

Parameters
  • fprints (iterable of Fingerprint) – Fingerprints to be added by count.

  • weights (array_like of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.

Returns

FloatFingerprintfprints.

Return type

Fingerprint with float counts as average of counts in

save(f, fp, **kwargs)[source]

Save Fingerprint object to file.

Parameters
  • f (str or File) – filename str or file-like object to save file to

  • fp (Fingerprint) – Fingerprint to save to file

  • protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.

Returns

bool

Return type

Success or fail

See also

savez, load

savez(f, *fps, **kwargs)[source]

Save multiple Fingerprint objects to file.

Parameters
  • f (str or File) – filename str or file-like object to save file to

  • fps (list of Fingerprint) – List of Fingerprints to save to file

  • protocol ({0, 1, 2, None}, optional) – Pickle protocol to use. If None, highest available protocol is used. This will not affect fingerprint loading.

Returns

bool

Return type

Success or fail

See also

save, loadz

sum_counts_dict(*fprints, **kwargs)[source]

Given fingerprints, return sum of their counts dicts.

If an optional weights iterable of the same length as fprints is provided, the weighted sum is returned.

Parameters
  • *fprints – One or more Fingerprint objects

  • weights (iterable of float, optional) – Weights for weighted mean. Weights are normalized to a sum of 1.

Returns

dict – as sum of counts.

Return type

Dict of non-zero count indices in any of the fprints with value

See also

diff_counts_dict