Average Nucleotide Identity#

Sketch#

class pyfastani.Sketch#

An index computing minimizers over the reference genomes.

Use this class to add reference genomes with the add_genome or add_draft methods, then call the index method to obtain a Mapper that can be used to map query genomes.

minimizers#

A view over the minimizers currently recorded in the sketch.

Type:

Minimizers

__init__(*, k=16, fragment_length=3000, minimum_fraction=0.2, p_value=0.001, percentage_identity=80, reference_size=5000000000.0, protein=False)#

Create a new FastANI sequence sketch.

Keyword Arguments:
  • k (int) – The size of the k-mers. FastANI authors recommend a size of at most 16, but any positive number below up to pyfastani.MAX_KMER_SIZE will work.

  • fragment_length (int) – The lengths the blocks should have when splitting the query. Queries smaller than this number won’t be processed.

  • minimum_fraction (float) – The minimum fraction of genome that must be shared for a hit to be reported. If reference and query genome size differ, smaller one among the two is considered.

  • p_value (float) – The p-value cutoff. Used to determine the recommended window size.

  • percentage_identity (int) – An identity percentage above which ANI values between two sequences can be trusted. Used to to determine the recommended window size.

  • reference_size (int) – An estimate of the reference length. Used to determine the recommended window size.

  • protein (bool) – Whether or not protein sequences are expected. If True, the alphabet size is changed from 4 to 20, minimizers are not computed on the “reverse” strand, and the window size is set to 1.

add_draft(name, contigs)#

Add a reference draft genome to the sketcher.

Using this method is fine even when the genome has a single contig, although Sketch.add_genome is easier to use in that case.

Parameters:
  • name (object) – The name of the genome to add. When a reference matches this query genome, name will be exposed as the Hit.name attribute of the corresponding hit.

  • contigs (iterable of str or bytes) – The contigs of the genome.

Returns:

Sketch – the object itself, for method chaining.

Hint

Contigs smaller than the window size and the k-mer size will be skipped.

add_genome(name, sequence)#

Add a reference genome to the sketcher.

This method is a shortcut for Sketch.add_draft when a genome is complete (i.e. only contains a single contig).

Parameters:
  • name (object) – The name of the genome to add. When a reference matches this query genome, name will be exposed as the Hit.name attribute of the corresponding hit.

  • sequence (str or bytes) – The sequence of the genome.

Returns:

Sketch – the object itself, for method chaining.

Hint

Sequence must be larger than the window size and the k-mer size to be sketched, otherwise no minifiers will be computed.

clear()#

Reset the Sketch, removing any reference genome it may contain.

Returns:

Sketch – the object itself, for method chaining.

index()#

Index the reference genomes for fast lookups using the minimizers.

Once all the reference sequences have been added to the Sketch, use this method to create an efficient mapper, dropping the most common minifiers among the reference sequences.

Returns:

Mapper – An indexed mapper that can be used for fast querying.

Note

Calling this method will effectively transfer ownership of the data to the Mapper, and reset the internals of this Sketch. It will be essentially cleared, but should remain usable.

fragment_length#

The minimum read length to use for mapping.

Type:

int

k#

The k-mer size used for sketching.

Type:

int

minimum_fraction#

The minimum genome fraction required to trust ANI values.

Type:

float

names#

The names of the sequences currently sketched.

Type:

list of str

occurences_threshold#

The occurence threshold above which minimizers are ignored.

Type:

int

p_value#

The p-value threshold for similarity when estimating hits.

Type:

float

percentage_identity#

The identity threshold for similarity when estimating hits.

Type:

float

protein#

Whether or not the object expects peptides or nucleotides.

Type:

bool

window_size#

The window size used for sketching.

Type:

int

Mapper#

class pyfastani.Mapper#

A genome mapper using Murmur3 hashes and k-mers to compute ANI.

minimizers#

A view over the minimizers recorded in the mapper.

Type:

Minimizers

query_draft(contigs, threads=0)#

Query the mapper for a complete genome.

Parameters:
  • contigs (iterable or str or bytes) – The genome to query the mapper with.

  • threads (int) – The number of threads to use to run the fragment mapping in parallel. Pass 0 (the default) to auto-detect the number of threads on the local machine.

Returns:

list of Hit – The hits found for the query.

Hint

Sequence must be larger than the window size, the k-mer size, and the fragment length to be mapped, otherwise an empty list of hits will be returned.

Note

This method is reentrant and releases the GIL when hashing the blocks allowing to query the mapper in parallel for several individual genomes.

Added in version 0.4.0: The threads argument.

query_genome(sequence, threads=0)#

Query the mapper for a complete genome.

Parameters:
  • sequence (str or bytes) – The closed genome to query the mapper with.

  • threads (int) – The number of threads to use to run the fragment mapping in parallel. Pass 0 (the default) to auto-detect the number of threads on the local machine.

Returns:

list of Hit – The hits found for the query.

Hint

Sequence must be larger than the window size, the k-mer size, and the fragment length to be mapped, otherwise an empty list of hits will be returned.

Note

This method is reentrant and releases the GIL when hashing the blocks allowing to query the mapper in parallel for several individual genomes.

Added in version 0.4.0: The threads argument.

fragment_length#

The minimum read length to use for mapping.

Type:

int

k#

The k-mer size used for sketching.

Type:

int

lookup_index#

The index of initial minimizer positions.

This table is used to retrieve at which positions the minimizers appear in the reference genomes.

Type:

MinimizerLookupIndex

minimum_fraction#

The minimum genome fraction required to trust ANI values.

Type:

float

p_value#

The p-value threshold for similarity when estimating hits.

Type:

float

percentage_identity#

The identity threshold for similarity when estimating hits.

Type:

float

protein#

Whether or not the object expects peptides or nucleotides.

Type:

bool

window_size#

The window size used for sketching.

Type:

int