Reference documentation¶

Documentation for module functions (for developers)

assign.py¶

poppunk_assign main function

PopPUNK.assign.assign_query(dbFuncs, ref_db, q_files, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_sketch, gpu_dist, gpu_graph, deviceid, save_partial_query_graph, use_full_network)[source]¶: Code for assign query mode for CLI

PopPUNK.assign.assign_query_hdf5(dbFuncs, ref_db, qNames, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_dist, gpu_graph, save_partial_query_graph, use_full_network)[source]¶: Code for assign query mode taking hdf5 as input. Written as a separate function so it can be called by web APIs

PopPUNK.assign.main()[source]¶: Main function. Parses cmd line args and runs in the specified mode.

bgmm.py¶

Functions used to fit the mixture model to a database. Access using BGMMFit.

BGMM using sklearn

PopPUNK.bgmm.findBetweenLabel_bgmm(means, assignments)[source]¶

Identify between-strain links

Finds the component with the largest number of points assigned to it

Args:

means (numpy.array): K x 2 array of mixture component means
assignments (numpy.array): Sample cluster assignments

Returns:

between_label (int): The cluster label with the most points assigned to it

PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]¶

Identify within-strain links

Finds the component with mean closest to the origin and also makes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)

Args:

means (numpy.array): K x 2 array of mixture component means
assignments (numpy.array): Sample cluster assignments
rank (int): Which label to find, ordered by distance from origin. 0-indexed. (default = 0)

Returns:

within_label (int): The cluster label for the within-strain assignments

PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]¶

Main function to fit BGMM model, called from fit()

Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.

Args:

X (np.array): n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
dpgmm_max_K (int): Maximum number of components to use with the EM fit. (default = 2)

Returns:

dpgmm (sklearn.mixture.BayesianGaussianMixture): Fitted bgmm model

PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]¶

modified sklearn GMM function predicting distribution membership

Returns the mixture LL for points X. Used by assign_samples() and plot_contours()

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples
weights (numpy.array): Component weights from fit2dMultiGaussian()
means (numpy.array): Component means from fit2dMultiGaussian()
covars (numpy.array): Component covariances from fit2dMultiGaussian()
scale (numpy.array): Scaling of core and accessory distances from fit2dMultiGaussian()

Returns:

logprob (numpy.array): The log of the probabilities under the mixture model
lpr (numpy.array): The components of the log probability from each mixture component

PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]¶

Log likelihood of multivariate normal density distribution

Used to calculate per component Gaussian likelihood in assign_samples()

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples
means (numpy.array): Component means from fit2dMultiGaussian()
covars (numpy.array): Component covariances from fit2dMultiGaussian()
min_covar (float): Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)

Returns:

log_prob (numpy.array): An n-vector with the log-likelihoods for each sample being in this component

dbscan.py¶

Functions used to fit DBSCAN to a database. Access using DBSCANFit.

DBSCAN using hdbscan

PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]¶

Evaluate whether fitted dbscan model contains non-overlapping clusters

Args:

model (DBSCANFit): Fitted model from fit()

Returns:

indistinct (bool): Boolean indicating whether putative within- and between-strain clusters of points overlap

PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]¶

Identify between-strain links from a DBSCAN model

Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.

Args:

assignments (numpy.array): Sample cluster assignments
within_cluster (int): Cluster ID assigned to within-strain assignments, from findWithinLabel()

Returns:

between_cluster (int): The cluster label for the between-strain assignments

PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out, use_gpu=False)[source]¶

Function to fit DBSCAN model as an alternative to the Gaussian

Fits the DBSCAN model to the distances using hdbscan

Args:

X (np.array): n x 2 array of core and accessory distances for n samples
min_samples (int): Parameter for DBSCAN clustering ‘conservativeness’
min_cluster_size (int): Minimum number of points in a cluster for HDBSCAN
cache_out (str): Prefix for DBSCAN cache used for refitting
use_gpu (bool): Whether GPU algorithms should be used in DBSCAN fitting

Returns:

hdb (hdbscan.HDBSCAN or cuml.cluster.HDBSCAN): Fitted HDBSCAN to subsampled data
labels (list): Cluster assignments of each sample
n_clusters (int): Number of clusters used

mandrake.py¶

PopPUNK.mandrake.generate_embedding(seqLabels, accMat, perplexity, outPrefix, overwrite, kNN=50, maxIter=10000000, n_threads=1, use_gpu=False, device_id=0)[source]¶

Generate t-SNE projection using accessory distances

Writes a plot of t-SNE clustering of accessory distances (.dot)

Args:

seqLabels (list): Processed names of sequences being analysed.
accMat (numpy.array): n x n array of accessory distances for n samples.
perplexity (int): Perplexity parameter passed to t-SNE
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
overwrite (bool): Overwrite existing output if present (default = False)
kNN (int): Number of neigbours to use with SCE (cannot be > n_samples) (default = 50)
maxIter (int): Number of iterations to run (default = 1000000)
n_threads (int): Number of CPU threads to use (default = 1)
use_gpu (bool): Whether to use GPU libraries
device_id (int): Device ID of GPU to be used (default = 0)

Returns:

mandrake_filename (str): Filename with .dot of embedding

models.py¶

network.py¶

Functions used to construct the network, and update with new queries. Main entry point is constructNetwork() for new reference databases, and findQueryLinksToNetwork() for querying databases.

refine.py¶

Functions used to refine an existing model. Access using RefineFit.

plot.py¶

Plots of GMM results, k-mer fits, and microreact output

PopPUNK.plot.createMicroreact(prefix, microreact_files, api_key=None, info_csv=None)[source]¶

Creates a .microreact file, and instance via the API

Args:

prefix (str): Prefix for output file
microreact_files (str): List of Microreact files [clusters, dot, tree, mst_tree]
api_key (str): API key for your account
info_csv (str): CSV file containing additional information for Microreact

PopPUNK.plot.distHistogram(dists, rank, outPrefix)[source]¶

Plot a histogram of distances (1D)

Args:

dists (np.array): Distance vector
rank (int): Rank (used for name and title)
outPrefix (int): Full path prefix for plot file

PopPUNK.plot.drawMST(mst, outPrefix, isolate_clustering, clustering_name, overwrite)[source]¶

Plot a layout of the minimum spanning tree

Args:

mst (graph_tool.Graph): A minimum spanning tree
outPrefix (str): Output prefix for save files
isolate_clustering (dict): Dictionary of ID: cluster, used for colouring vertices
clustering_name (str): Name of clustering scheme to be used for colouring
overwrite (bool): Overwrite existing output files

PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]¶

Get a square grid of points to evaluate a function across

Used for plot_scatter() and plot_contours()

Args:

minimum (float): Minimum value for grid
maximum (float): Maximum value for grid
resolution (int): Number of points along each axis

Returns:

xx (numpy.array): x values across n x n grid
yy (numpy.array): y values across n x n grid
xy (numpy.array): n x 2 pairs of x, y values grid is over

PopPUNK.plot.outputsForCytoscape(G, G_mst, isolate_names, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True, use_partial_query_graph=None)[source]¶

Write outputs for cytoscape. A graphml of the network, and CSV with metadata

Args:

G (graph): The network to write
G_mst (graph): The minimum spanning tree of G
isolate_names (list): Ordered list of sequence names
clustering (dict): Dictionary of cluster assignments (keys are nodeNames).
outPrefix (str): Prefix for files to be written
epiCsv (str): Optional CSV of epi data to paste in the output in addition to the clusters.
queryList (list): Optional list of isolates that have been added as a query. (default = None)
suffix (string): String to append to network file name. (default = None)
writeCsv (bool): Whether to print CSV file to accompany network
use_partial_query_graph (str): File listing sequences to be included in output graph

PopPUNK.plot.outputsForGrapetree(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶

Generate files for Grapetree

Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory.
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False).

PopPUNK.plot.outputsForMicroreact(combined_list, clustering, nj_tree, mst_tree, accMat, perplexity, maxIter, outPrefix, epiCsv, queryList=None, overwrite=False, n_threads=1, use_gpu=False, device_id=0)[source]¶

Generate files for microreact

Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
accMat (numpy.array): n x n array of accessory distances for n samples.
perplexity (int): Perplexity parameter passed to mandrake
maxIter (int): Maximum iterations for mandrake
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False)
n_threads (int): Number of CPU threads to use (default = 1)
use_gpu (bool): Whether to use a GPU for t-SNE generation
device_id (int): Device ID of GPU to be used (default = 0)

Returns:

outfiles (list): List of output files create

PopPUNK.plot.outputsForPhandango(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶

Generate files for Phandango

Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)

Args:

combined_list (list): Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
clustering (dict or dict of dicts): List of cluster assignments from printClusters(). Further clusterings (e.g. 1D core only) can be included by passing these as a dict.
nj_tree (str or None): String representation of a Newick-formatted NJ tree
mst_tree (str or None): String representation of a Newick-formatted minimum-spanning tree
outPrefix (str): Prefix for all generated output files, which will be placed in outPrefix subdirectory
epiCsv (str): A CSV containing other information, to include with the CSV of clusters
queryList (list): Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
overwrite (bool): Overwrite existing output if present (default = False)
threads (int): Number of threads to use with rapidnj

PopPUNK.plot.plot_contours(model, assignments, title, out_prefix)[source]¶

Draw contours of mixture model assignments

Will draw the decision boundary for between/within in red

Args:

model (BGMMFit): Model we are plotting from
assignments (numpy.array): n-vectors of cluster assignments for model
title (str): The title to display above the plot
out_prefix (str): Prefix for output plot file (.pdf will be appended)

PopPUNK.plot.plot_database_evaluations(prefix, genome_lengths, ambiguous_bases)[source]¶

Plot histograms of sequence characteristics for database evaluation.

Args:

prefix (str): Prefix for output files
genome_lengths (list): Lengths of genomes in database
ambiguous_bases (list): Counts of ambiguous bases in genomes in database

PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix, use_gpu)[source]¶

Draw a scatter plot (png) to show the DBSCAN model fit

A scatter plot of core and accessory distances, coloured by component membership. Black is noise

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
n_clusters (int): Number of clusters used (excluding noise)
out_prefix (str): Prefix for output file (.png will be appended)
use_gpu (bool): Whether model was fitted with GPU-enabled code

PopPUNK.plot.plot_evaluation_histogram(input_data, n_bins=100, prefix='hist', suffix='', plt_title='histogram', xlab='x')[source]¶

Plot histograms of sequence characteristics for database evaluation.

Args:

input_data (list): Input data (list of numbers)
n_bins (int): Number of bins to use for the histogram
prefix (str): Prefix of database
suffix (str): Suffix specifying plot type
plt_title (str): Title for plot
xlab (str): Title for the horizontal axis

PopPUNK.plot.plot_fit(klist, raw_matching, raw_fit, corrected_matching, corrected_fit, out_prefix, title)[source]¶

Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance

K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance

Args:

klist (list): List of k-mer sizes
raw_matching (list): Proportion of matching k-mers at each klist value
raw_fit (numpy.array): Fit to klist and raw_matching from fitKmerCurve()
corrected_matching (list): Corrected proportion of matching k-mers at each klist value
corrected_fit (numpy.array): Fit to klist and corrected_matching from fitKmerCurve()
out_prefix (str): Prefix for output plot file (.pdf will be appended)
title (str): The title to display above the plot

PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, min_move, max_move, scale, threshold, indiv_boundaries, unconstrained, title, out_prefix)[source]¶

Draw a scatter plot (png) to show the refined model fit

A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
x_boundary (float): Intercept of boundary with x-axis, from RefineFit
y_boundary (float): Intercept of boundary with y-axis, from RefineFit
core_boundary (float): Intercept of 1D (core) boundary with x-axis, from RefineFit
accessory_boundary (float): Intercept of 1D (core) boundary with y-axis, from RefineFit
mean0 (numpy.array): Centre of within-strain distribution
mean1 (numpy.array): Centre of between-strain distribution
min_move (float): Minimum s range
max_move (float): Maximum s range
scale (numpy.array): Scaling factor from RefineFit
threshold (bool): If fit was just from a simple thresholding
indiv_boundaries (bool): Whether to draw lines for core and accessory refinement
title (str): The title to display above the plot
out_prefix (str): Prefix for output plot file (.png will be appended)

PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]¶

Draw a scatter plot (png) to show the BGMM model fit

A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).

This is based on the example in the sklearn documentation.

Args:

X (numpy.array): n x 2 array of core and accessory distances for n samples.
Y (numpy.array): n x 1 array of cluster assignments for n samples.
means (numpy.array): Component means from BGMMFit
covars (numpy.array): Component covariances from BGMMFit
scale (numpy.array): Scaling factor from BGMMFit
out_prefix (str): Prefix for output plot file (.png will be appended)
title (str): The title to display above the plot

PopPUNK.plot.plot_scatter(X, out_prefix, title, kde=True)[source]¶

Draws a 2D scatter plot (png) of the core and accessory distances

Also draws contours of kernel density estimare

Args:

X (numpy.array)

n x 2 array of core and accessory distances for n samples.

out_prefix (str)

Prefix for output plot file (.png will be appended)

title (str)

The title to display above the plot

kde (bool)

Whether to draw kernel density estimate contours

(default = True)

PopPUNK.plot.update_maps_timelines(micoreact_sample_json, info_csv=None)[source]¶

Update the maps and timelines in the Microreact JSON file. Removes maps and timelines if the required columns are not present in the info CSV.

Args:

micoreact_sample_json (dict): Microreact JSON file
info_csv (str): CSV file containing additional information for Microreact

PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None, suffix='_Cluster')[source]¶

Print CSV file of clustering and optionally epi data

Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.

The epiCsv, if provided, should have the node labels in the first column.

Args:

outfile (str)

File to write the CSV to.

nodeNames (list)

Names of sequences in clustering (includes path).

nodeLabels (list)

Names of sequences to write in CSV (usually has path removed).

clustering (dict or dict of dicts)

Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.

output_format (str)

Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)

epiCsv (str)

Optional CSV of epi data to paste in the output in addition to the clusters (default = None).

queryNames (list)

Optional list of isolates that have been added as a query.

(default = None)

sparse_mst.py¶

sketchlib.py¶

Sketchlib functions for database construction

PopPUNK.sketchlib.addRandom(oPrefix, sequence_names, klist, strand_preserved=False, overwrite=False, threads=1)[source]¶

Add chance of random match to a HDF5 sketch DB

Args:

oPrefix (str): Sketch database prefix
sequence_names (list): Names of sequences to include in calculation
klist (list): List of k-mer sizes to sketch
strand_preserved (bool): Set true to ignore rc k-mers
overwrite (str): Set true to overwrite existing random match chances
threads (int): Number of threads to use (default = 1)

PopPUNK.sketchlib.checkSketchlibLibrary()[source]¶

Gets the location of the sketchlib library

Returns:

lib (str): Location of sketchlib .so/.dyld

PopPUNK.sketchlib.checkSketchlibVersion()[source]¶

Checks that sketchlib can be run, and returns version

Returns:

version (str): Version string

PopPUNK.sketchlib.constructDatabase(assemblyList, klist, sketch_size, oPrefix, threads, overwrite, strand_preserved, min_count, use_exact, calc_random=True, codon_phased=False, use_gpu=False, deviceid=0)[source]¶

Sketch the input assemblies at the requested k-mer lengths

A multithread wrapper around runSketch(). Threads are used to either run multiple sketch processes for each klist value.

Also calculates random match probability based on length of first genome in assemblyList.

Args:

assemblyList (str): File with locations of assembly files to be sketched
klist (list): List of k-mer sizes to sketch
sketch_size (int): Size of sketch (-s option)
oPrefix (str): Output prefix for resulting sketch files
threads (int): Number of threads to use (default = 1)
overwrite (bool): Whether to overwrite sketch DBs, if they already exist. (default = False)
strand_preserved (bool): Ignore reverse complement k-mers (default = False)
min_count (int): Minimum count of k-mer in reads to include (default = 0)
use_exact (bool): Use exact count of k-mer appearance in reads (default = False)
calc_random (bool): Add random match chances to DB (turn off for queries)
codon_phased (bool): Use codon phased seeds (default = False)
use_gpu (bool): Use GPU for read sketching (default = False)
deviceid (int): GPU device id (default = 0)

Returns:

names (list): List of names included in the database (from rfile)

PopPUNK.sketchlib.createDatabaseDir(outPrefix, kmers)[source]¶

Creates the directory to write sketches to, removing old files if unnecessary

Args:

outPrefix (str): output db prefix
kmers (list): k-mer sizes in db

PopPUNK.sketchlib.fitKmerCurve(pairwise, klist, jacobian)[source]¶

Fit the function \(pr = (1-a)(1-c)^k\)

Supply jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))

Args:

pairwise (numpy.array): Proportion of shared k-mers at k-mer values in klist
klist (list): k-mer sizes used
jacobian (numpy.array): Should be set as above (set once to try and save memory)

Returns:

transformed_params (numpy.array): Column with core and accessory distance

PopPUNK.sketchlib.getKmersFromReferenceDatabase(dbPrefix)[source]¶

Get kmers lengths from existing database

Args:

dbPrefix (str): Prefix for sketch DB files

Returns:

kmers (list): List of k-mer lengths used in database

PopPUNK.sketchlib.getSeqsInDb(dbname)[source]¶

Return an array with the sequences in the passed database

Args:

dbname (str): Sketches database filename

Returns:

seqs (list): List of sequence names in sketch DB

PopPUNK.sketchlib.getSketchSize(dbPrefix)[source]¶

Determine sketch size, and ensures consistent in whole database

sys.exit(1) is called if DBs have different sketch sizes

Args:

dbprefix (str): Prefix for databases

Returns:

sketchSize (int): sketch size (64x C++ definition)
codonPhased (bool): whether the DB used codon phased seeds

PopPUNK.sketchlib.get_database_statistics(prefix)[source]¶

Extract statistics for evaluating databases.

Args:

prefix (str): Prefix of database

PopPUNK.sketchlib.joinDBs(db1, db2, output, update_random=None, full_names=False)[source]¶

Join two sketch databases with the low-level HDF5 copy interface

Args:

db1 (str): Prefix for db1
db2 (str): Prefix for db2
output (str): Prefix for joined output
update_random (dict): Whether to re-calculate the random object. May contain control arguments strand_preserved and threads (see addRandom())
full_names (bool): If True, db_name and out_name are the full paths to h5 files

PopPUNK.sketchlib.queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self=True, number_plot_fits=0, threads=1, use_gpu=False, deviceid=0)[source]¶

Calculate core and accessory distances between query sequences and a sketched database

For a reference database, runs the query against itself to find all pairwise core and accessory distances.

Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)

To get the ref and query name for each row of the returned distances, call to the iterator iterDistRows() with the returned refList and queryList

Args:

rNames (list): Names of references to query
qNames (list): Names of queries
dbPrefix (str): Prefix for reference sketch database created by constructDatabase()
queryPrefix (str): Prefix for query sketch database created by constructDatabase()
klist (list): K-mer sizes to use in the calculation
self (bool): Set true if query = ref (default = True)
number_plot_fits (int): If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls plot_fit() (default = 0)
threads (int): Number of threads to use in the process (default = 1)
use_gpu (bool): Use a GPU for querying (default = False)
deviceid (int): Index of the CUDA GPU device to use (default = 0)

Returns:

distMat (numpy.array): Core distances (column 0) and accessory distances (column 1) between refList and queryList

PopPUNK.sketchlib.readDBParams(dbPrefix)[source]¶

Get kmers lengths and sketch sizes from existing database

Calls getKmersFromReferenceDatabase() and getSketchSize() Uses passed values if db missing

Args:

dbPrefix (str): Prefix for sketch DB files

Returns:

kmers (list): List of k-mer lengths used in database
sketch_sizes (list): List of sketch sizes used in database
codonPhased (bool): whether the DB used codon phased seeds

PopPUNK.sketchlib.removeFromDB(db_name, out_name, removeSeqs, full_names=False)[source]¶

Remove sketches from the DB the low-level HDF5 copy interface

Args:

db_name (str): Prefix for hdf database
out_name (str): Prefix for output (pruned) database
removeSeqs (list): Names of sequences to remove from database
full_names (bool): If True, db_name and out_name are the full paths to h5 files

utils.py¶

General utility functions for data read/writing/manipulation in PopPUNK

PopPUNK.utils.check_and_set_gpu(use_gpu, gpu_lib, quit_on_fail=False)[source]¶

Check GPU libraries can be loaded and set managed memory.

Args:

use_gpu (bool): Whether GPU packages have been requested
gpu_lib (bool): Whether GPU packages are available

Returns:

use_gpu (bool): Whether GPU packages can be used

PopPUNK.utils.decisionBoundary(intercept, gradient, adj=0.0)[source]¶

Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.

Args:

intercept (numpy.array): Cartesian co-ordinates of point along line (transformLine()) which intercepts the boundary
gradient (float): Gradient of the line
adj (float): Fraction by which to shift the intercept up the y axis

Returns:

x (float): The x-axis intercept
y (float): The y-axis intercept

PopPUNK.utils.get_match_search_depth(rlist, rank_list)[source]¶

Return a default search depth for lineage model fitting.

Args:

rlist (list): List of sequences in database
rank_list (list): List of ranks to be used to fit lineage models

Returns:

max_search_depth (int): Maximum kNN used for lineage model fitting

PopPUNK.utils.isolateNameToLabel(names)[source]¶

Function to process isolate names to labels appropriate for visualisation.

Args:

names (list): List of isolate names.

Returns:

labels (list): List of isolate labels.

PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]¶

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:

refSeqs (list): List of reference sequence names.
querySeqs (list): List of query sequence names.
self (bool): Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:

ref, query (str, str): Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.joinClusterDicts(d1, d2)[source]¶

Join two dictionaries returned by readIsolateTypeFromCsv() with return_dict = True. Useful for concatenating ref and query assignments

Args:

d1 (dict of dicts): First dictionary to concat
d2 (dict of dicts): Second dictionary to concat

Returns:

d1 (dict of dicts): d1 with d2 appended

PopPUNK.utils.listDistInts(refSeqs, querySeqs, self=True)[source]¶

Gets the ref and query ID for each row of the distance matrix

Returns an iterable with ref and query ID pairs by row.

Args:

refSeqs (list): List of reference sequence names.
querySeqs (list): List of query sequence names.
self (bool): Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True

Returns:

ref, query (str, str): Iterable of tuples with ref and query names for each distMat row.

PopPUNK.utils.readIsolateTypeFromCsv(clustCSV, mode='clusters', return_dict=False)[source]¶

Read cluster definitions from CSV file.

Args:

clustCSV (str): File name of CSV with isolate assignments
mode (str): Type of file to read ‘clusters’, ‘lineages’, or ‘external’
return_dict (bool): If True, return a dict with sample->cluster instead of sets [default = False]

Returns:

clusters (dict): Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.

PopPUNK.utils.readPickle(pklName, enforce_self=False, distances=True)[source]¶

Loads core and accessory distances saved by storePickle()

Called during --fit-model

Args:

pklName (str)

Prefix for saved files

enforce_self (bool)

Error if self == False

[default = True]

distances (bool)

Read the distance matrix

[default = True]

Returns:

rlist (list): List of reference sequence names (for iterDistRows())
qlist (list): List of query sequence names (for iterDistRows())
self (bool): Whether an all-vs-all self DB (for iterDistRows())
X (numpy.array): n x 2 array of core and accessory distances

PopPUNK.utils.readRfile(rFile, oneSeq=False)[source]¶

Reads in files for sketching. Names and sequence, tab separated

Args:

rFile (str): File with locations of assembly files to be sketched
oneSeq (bool): Return only the first sequence listed, rather than a list (used with mash)

Returns:

names (list): Array of sequence names
sequences (list of lists): Array of sequence files

PopPUNK.utils.read_rlist_from_distance_pickle(fn, allow_non_self=True, include_queries=False)[source]¶

Return the list of reference sequences from a distance pickle.

Args:

fn (str): Name of distance pickle
allow_non_self (bool): Whether non-self distance datasets are permissible
include_queries (bool): Whether queries should be included in the rlist

Returns:

rlist (list): List of reference sequence names

PopPUNK.utils.set_env(**environ)[source]¶: Temporarily set the process environment variables. >>> with set_env(PLUGINS_DIR=u’test/plugins’): … “PLUGINS_DIR” in os.environ True >>> “PLUGINS_DIR” in os.environ False

PopPUNK.utils.setupDBFuncs(args)[source]¶

Wraps common database access functions from sketchlib and mash, to try and make their API more similar

Args:

args (argparse.opts): Parsed command lines options
qc_dict (dict): Table of parameters for QC function

Returns:

dbFuncs (dict): Functions with consistent arguments to use as the database API

PopPUNK.utils.stderr_redirected(to='/dev/null')[source]¶

import os

with stdout_redirected(to=filename):: print(“from Python”) os.system(“echo non-Python applications are also supported”)

PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]¶

Saves core and accessory distances in a .npy file, names in a .pkl

Called during --create-db

Args:

rlist (list)

List of reference sequence names (for iterDistRows())

qlist (list)

List of query sequence names (for iterDistRows())

self (bool)

Whether an all-vs-all self DB (for iterDistRows())

X (numpy.array)

n x 2 array of core and accessory distances

If None, do not save

pklName (str)

Prefix for output files

PopPUNK.utils.transformLine(s, mean0, mean1)[source]¶

Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.

Args:

s (float): Distance along line from mean0
mean0 (numpy.array): Start position of line (x0, y0)
mean1 (numpy.array): End position of line (x1, y1)

Returns:

x (float): The Cartesian x-coordinate
y (float): The Cartesian y-coordinate

PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None, threads=1)[source]¶

Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.

Args:

refList (list): List of references
distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
queryList (list): List of queries
query_ref_distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
query_query_distMat (numpy.array): Two column long form list of core and accessory distances for pairwise comparisons between query sequences
threads (int): Number of threads to use

Returns:

seqLabels (list): Combined list of reference and query sequences
coreMat (numpy.array): NxN array of core distances for N sequences
accMat (numpy.array): NxN array of accessory distances for N sequences

visualise.py¶

poppunk_visualise main function

PopPUNK.visualise.main()[source]¶: Main function. Parses cmd line args and runs in the specified mode.

web.py¶

Functions used by the web API to convert a sketch to an h5 database, then generate visualisations and post results to PopPUNK-web.

PopPUNK.web.api(query, ref_db)[source]¶: Post cluster and tree information to microreact

PopPUNK.web.calc_prevalence(cluster, cluster_list, num_samples)[source]¶: Cluster prevalences for Plotly.js

PopPUNK.web.graphml_to_json(network_dir)[source]¶: Converts full GraphML file to JSON subgraph

PopPUNK.web.highlight_cluster(query, cluster)[source]¶: Colour assigned cluster in Microreact output

PopPUNK.web.sketch_to_hdf5(sketches_dict, output)[source]¶: Convert dict of JSON sketches to query hdf5 database

PopPUNK.web.summarise_clusters(output, species, species_db, qNames)[source]¶: Retreieve assigned query and all cluster prevalences. Write list of all isolates in cluster for tree subsetting