Reference documentation¶
Documentation for module functions (for developers)
assign.py¶
poppunk_assign
main function
- PopPUNK.assign.assign_query(dbFuncs, ref_db, q_files, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_sketch, gpu_dist, gpu_graph, deviceid, save_partial_query_graph, use_full_network)[source]¶
Code for assign query mode for CLI
- PopPUNK.assign.assign_query_hdf5(dbFuncs, ref_db, qNames, output, qc_dict, update_db, write_references, distances, serial, stable, threads, overwrite, plot_fit, graph_weights, model_dir, strand_preserved, previous_clustering, external_clustering, core, accessory, gpu_dist, gpu_graph, save_partial_query_graph, use_full_network)[source]¶
Code for assign query mode taking hdf5 as input. Written as a separate function so it can be called by web APIs
bgmm.py¶
Functions used to fit the mixture model to a database. Access using
BGMMFit
.
BGMM using sklearn
- PopPUNK.bgmm.findBetweenLabel_bgmm(means, assignments)[source]¶
Identify between-strain links
Finds the component with the largest number of points assigned to it
- Args:
- means (numpy.array)
K x 2 array of mixture component means
- assignments (numpy.array)
Sample cluster assignments
- Returns:
- between_label (int)
The cluster label with the most points assigned to it
- PopPUNK.bgmm.findWithinLabel(means, assignments, rank=0)[source]¶
Identify within-strain links
Finds the component with mean closest to the origin and also makes sure some samples are assigned to it (in the case of small weighted components with a Dirichlet prior some components are unused)
- Args:
- means (numpy.array)
K x 2 array of mixture component means
- assignments (numpy.array)
Sample cluster assignments
- rank (int)
Which label to find, ordered by distance from origin. 0-indexed. (default = 0)
- Returns:
- within_label (int)
The cluster label for the within-strain assignments
- PopPUNK.bgmm.fit2dMultiGaussian(X, dpgmm_max_K=2)[source]¶
Main function to fit BGMM model, called from
fit()
Fits the mixture model specified, saves model parameters to a file, and assigns the samples to a component. Write fit summary stats to STDERR.
- Args:
- X (np.array)
n x 2 array of core and accessory distances for n samples. This should be subsampled to 100000 samples.
- dpgmm_max_K (int)
Maximum number of components to use with the EM fit. (default = 2)
- Returns:
- dpgmm (sklearn.mixture.BayesianGaussianMixture)
Fitted bgmm model
- PopPUNK.bgmm.log_likelihood(X, weights, means, covars, scale)[source]¶
modified sklearn GMM function predicting distribution membership
Returns the mixture LL for points X. Used by
assign_samples()
andplot_contours()
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples
- weights (numpy.array)
Component weights from
fit2dMultiGaussian()
- means (numpy.array)
Component means from
fit2dMultiGaussian()
- covars (numpy.array)
Component covariances from
fit2dMultiGaussian()
- scale (numpy.array)
Scaling of core and accessory distances from
fit2dMultiGaussian()
- Returns:
- logprob (numpy.array)
The log of the probabilities under the mixture model
- lpr (numpy.array)
The components of the log probability from each mixture component
- PopPUNK.bgmm.log_multivariate_normal_density(X, means, covars, min_covar=1e-07)[source]¶
Log likelihood of multivariate normal density distribution
Used to calculate per component Gaussian likelihood in
assign_samples()
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples
- means (numpy.array)
Component means from
fit2dMultiGaussian()
- covars (numpy.array)
Component covariances from
fit2dMultiGaussian()
- min_covar (float)
Minimum covariance, added when Choleksy decomposition fails due to too few observations (default = 1.e-7)
- Returns:
- log_prob (numpy.array)
An n-vector with the log-likelihoods for each sample being in this component
dbscan.py¶
Functions used to fit DBSCAN to a database. Access using
DBSCANFit
.
DBSCAN using hdbscan
- PopPUNK.dbscan.evaluate_dbscan_clusters(model)[source]¶
Evaluate whether fitted dbscan model contains non-overlapping clusters
- Args:
- model (DBSCANFit)
Fitted model from
fit()
- Returns:
- indistinct (bool)
Boolean indicating whether putative within- and between-strain clusters of points overlap
- PopPUNK.dbscan.findBetweenLabel(assignments, within_cluster)[source]¶
Identify between-strain links from a DBSCAN model
Finds the component containing the largest number of between-strain links, excluding the cluster identified as containing within-strain links.
- Args:
- assignments (numpy.array)
Sample cluster assignments
- within_cluster (int)
Cluster ID assigned to within-strain assignments, from
findWithinLabel()
- Returns:
- between_cluster (int)
The cluster label for the between-strain assignments
- PopPUNK.dbscan.fitDbScan(X, min_samples, min_cluster_size, cache_out, use_gpu=False)[source]¶
Function to fit DBSCAN model as an alternative to the Gaussian
Fits the DBSCAN model to the distances using hdbscan
- Args:
- X (np.array)
n x 2 array of core and accessory distances for n samples
- min_samples (int)
Parameter for DBSCAN clustering ‘conservativeness’
- min_cluster_size (int)
Minimum number of points in a cluster for HDBSCAN
- cache_out (str)
Prefix for DBSCAN cache used for refitting
- use_gpu (bool)
Whether GPU algorithms should be used in DBSCAN fitting
- Returns:
- hdb (hdbscan.HDBSCAN or cuml.cluster.HDBSCAN)
Fitted HDBSCAN to subsampled data
- labels (list)
Cluster assignments of each sample
- n_clusters (int)
Number of clusters used
mandrake.py¶
- PopPUNK.mandrake.generate_embedding(seqLabels, accMat, perplexity, outPrefix, overwrite, kNN=50, maxIter=10000000, n_threads=1, use_gpu=False, device_id=0)[source]¶
Generate t-SNE projection using accessory distances
Writes a plot of t-SNE clustering of accessory distances (.dot)
- Args:
- seqLabels (list)
Processed names of sequences being analysed.
- accMat (numpy.array)
n x n array of accessory distances for n samples.
- perplexity (int)
Perplexity parameter passed to t-SNE
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- overwrite (bool)
Overwrite existing output if present (default = False)
- kNN (int)
Number of neigbours to use with SCE (cannot be > n_samples) (default = 50)
- maxIter (int)
Number of iterations to run (default = 1000000)
- n_threads (int)
Number of CPU threads to use (default = 1)
- use_gpu (bool)
Whether to use GPU libraries
- device_id (int)
Device ID of GPU to be used (default = 0)
- Returns:
- mandrake_filename (str)
Filename with .dot of embedding
models.py¶
network.py¶
Functions used to construct the network, and update with new queries. Main
entry point is constructNetwork()
for new reference
databases, and findQueryLinksToNetwork()
for querying
databases.
refine.py¶
Functions used to refine an existing model. Access using
RefineFit
.
plot.py¶
Plots of GMM results, k-mer fits, and microreact output
- PopPUNK.plot.createMicroreact(prefix, microreact_files, api_key=None, info_csv=None)[source]¶
Creates a .microreact file, and instance via the API
- Args:
- prefix (str)
Prefix for output file
- microreact_files (str)
List of Microreact files [clusters, dot, tree, mst_tree]
- api_key (str)
API key for your account
- info_csv (str)
CSV file containing additional information for Microreact
- PopPUNK.plot.distHistogram(dists, rank, outPrefix)[source]¶
Plot a histogram of distances (1D)
- Args:
- dists (np.array)
Distance vector
- rank (int)
Rank (used for name and title)
- outPrefix (int)
Full path prefix for plot file
- PopPUNK.plot.drawMST(mst, outPrefix, isolate_clustering, clustering_name, overwrite)[source]¶
Plot a layout of the minimum spanning tree
- Args:
- mst (graph_tool.Graph)
A minimum spanning tree
- outPrefix (str)
Output prefix for save files
- isolate_clustering (dict)
Dictionary of ID: cluster, used for colouring vertices
- clustering_name (str)
Name of clustering scheme to be used for colouring
- overwrite (bool)
Overwrite existing output files
- PopPUNK.plot.get_grid(minimum, maximum, resolution)[source]¶
Get a square grid of points to evaluate a function across
Used for
plot_scatter()
andplot_contours()
- Args:
- minimum (float)
Minimum value for grid
- maximum (float)
Maximum value for grid
- resolution (int)
Number of points along each axis
- Returns:
- xx (numpy.array)
x values across n x n grid
- yy (numpy.array)
y values across n x n grid
- xy (numpy.array)
n x 2 pairs of x, y values grid is over
- PopPUNK.plot.outputsForCytoscape(G, G_mst, isolate_names, clustering, outPrefix, epiCsv, queryList=None, suffix=None, writeCsv=True, use_partial_query_graph=None)[source]¶
Write outputs for cytoscape. A graphml of the network, and CSV with metadata
- Args:
- G (graph)
The network to write
- G_mst (graph)
The minimum spanning tree of G
- isolate_names (list)
Ordered list of sequence names
- clustering (dict)
Dictionary of cluster assignments (keys are nodeNames).
- outPrefix (str)
Prefix for files to be written
- epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters.
- queryList (list)
Optional list of isolates that have been added as a query. (default = None)
- suffix (string)
String to append to network file name. (default = None)
- writeCsv (bool)
Whether to print CSV file to accompany network
- use_partial_query_graph (str)
File listing sequences to be included in output graph
- PopPUNK.plot.outputsForGrapetree(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶
Generate files for Grapetree
Write a neighbour joining tree (.nwk) from core distances and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory.
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False).
- PopPUNK.plot.outputsForMicroreact(combined_list, clustering, nj_tree, mst_tree, accMat, perplexity, maxIter, outPrefix, epiCsv, queryList=None, overwrite=False, n_threads=1, use_gpu=False, device_id=0)[source]¶
Generate files for microreact
Output a neighbour joining tree (.nwk) from core distances, a plot of t-SNE clustering of accessory distances (.dot) and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- accMat (numpy.array)
n x n array of accessory distances for n samples.
- perplexity (int)
Perplexity parameter passed to mandrake
- maxIter (int)
Maximum iterations for mandrake
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False)
- n_threads (int)
Number of CPU threads to use (default = 1)
- use_gpu (bool)
Whether to use a GPU for t-SNE generation
- device_id (int)
Device ID of GPU to be used (default = 0)
- Returns:
- outfiles (list)
List of output files create
- PopPUNK.plot.outputsForPhandango(combined_list, clustering, nj_tree, mst_tree, outPrefix, epiCsv, queryList=None, overwrite=False)[source]¶
Generate files for Phandango
Write a neighbour joining tree (.tree) from core distances and cluster assignment (.csv)
- Args:
- combined_list (list)
Name of sequences being analysed. The part of the name before the first ‘.’ will be shown in the output
- clustering (dict or dict of dicts)
List of cluster assignments from
printClusters()
. Further clusterings (e.g. 1D core only) can be included by passing these as a dict.- nj_tree (str or None)
String representation of a Newick-formatted NJ tree
- mst_tree (str or None)
String representation of a Newick-formatted minimum-spanning tree
- outPrefix (str)
Prefix for all generated output files, which will be placed in outPrefix subdirectory
- epiCsv (str)
A CSV containing other information, to include with the CSV of clusters
- queryList (list)
Optional list of isolates that have been added as a query for colouring in the CSV. (default = None)
- overwrite (bool)
Overwrite existing output if present (default = False)
- threads (int)
Number of threads to use with rapidnj
- PopPUNK.plot.plot_contours(model, assignments, title, out_prefix)[source]¶
Draw contours of mixture model assignments
Will draw the decision boundary for between/within in red
- Args:
- model (BGMMFit)
Model we are plotting from
- assignments (numpy.array)
n-vectors of cluster assignments for model
- title (str)
The title to display above the plot
- out_prefix (str)
Prefix for output plot file (.pdf will be appended)
- PopPUNK.plot.plot_database_evaluations(prefix, genome_lengths, ambiguous_bases)[source]¶
Plot histograms of sequence characteristics for database evaluation.
- Args:
- prefix (str)
Prefix for output files
- genome_lengths (list)
Lengths of genomes in database
- ambiguous_bases (list)
Counts of ambiguous bases in genomes in database
- PopPUNK.plot.plot_dbscan_results(X, y, n_clusters, out_prefix, use_gpu)[source]¶
Draw a scatter plot (png) to show the DBSCAN model fit
A scatter plot of core and accessory distances, coloured by component membership. Black is noise
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- n_clusters (int)
Number of clusters used (excluding noise)
- out_prefix (str)
Prefix for output file (.png will be appended)
- use_gpu (bool)
Whether model was fitted with GPU-enabled code
- PopPUNK.plot.plot_evaluation_histogram(input_data, n_bins=100, prefix='hist', suffix='', plt_title='histogram', xlab='x')[source]¶
Plot histograms of sequence characteristics for database evaluation.
- Args:
- input_data (list)
Input data (list of numbers)
- n_bins (int)
Number of bins to use for the histogram
- prefix (str)
Prefix of database
- suffix (str)
Suffix specifying plot type
- plt_title (str)
Title for plot
- xlab (str)
Title for the horizontal axis
- PopPUNK.plot.plot_fit(klist, raw_matching, raw_fit, corrected_matching, corrected_fit, out_prefix, title)[source]¶
Draw a scatter plot (pdf) of k-mer sizes vs match probability, and the fit used to assign core and accessory distance
K-mer sizes on x-axis, log(pr(match)) on y - expect a straight line fit with intercept representing accessory distance and slope core distance
- Args:
- klist (list)
List of k-mer sizes
- raw_matching (list)
Proportion of matching k-mers at each klist value
- raw_fit (numpy.array)
Fit to klist and raw_matching from
fitKmerCurve()
- corrected_matching (list)
Corrected proportion of matching k-mers at each klist value
- corrected_fit (numpy.array)
Fit to klist and corrected_matching from
fitKmerCurve()
- out_prefix (str)
Prefix for output plot file (.pdf will be appended)
- title (str)
The title to display above the plot
- PopPUNK.plot.plot_refined_results(X, Y, x_boundary, y_boundary, core_boundary, accessory_boundary, mean0, mean1, min_move, max_move, scale, threshold, indiv_boundaries, unconstrained, title, out_prefix)[source]¶
Draw a scatter plot (png) to show the refined model fit
A scatter plot of core and accessory distances, coloured by component membership. The triangular decision boundary is also shown
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- x_boundary (float)
Intercept of boundary with x-axis, from
RefineFit
- y_boundary (float)
Intercept of boundary with y-axis, from
RefineFit
- core_boundary (float)
Intercept of 1D (core) boundary with x-axis, from
RefineFit
- accessory_boundary (float)
Intercept of 1D (core) boundary with y-axis, from
RefineFit
- mean0 (numpy.array)
Centre of within-strain distribution
- mean1 (numpy.array)
Centre of between-strain distribution
- min_move (float)
Minimum s range
- max_move (float)
Maximum s range
- scale (numpy.array)
Scaling factor from
RefineFit
- threshold (bool)
If fit was just from a simple thresholding
- indiv_boundaries (bool)
Whether to draw lines for core and accessory refinement
- title (str)
The title to display above the plot
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- PopPUNK.plot.plot_results(X, Y, means, covariances, scale, title, out_prefix)[source]¶
Draw a scatter plot (png) to show the BGMM model fit
A scatter plot of core and accessory distances, coloured by component membership. Also shown are ellipses for each component (centre: means axes: covariances).
This is based on the example in the sklearn documentation.
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- Y (numpy.array)
n x 1 array of cluster assignments for n samples.
- means (numpy.array)
Component means from
BGMMFit
- covars (numpy.array)
Component covariances from
BGMMFit
- scale (numpy.array)
Scaling factor from
BGMMFit
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- title (str)
The title to display above the plot
- PopPUNK.plot.plot_scatter(X, out_prefix, title, kde=True)[source]¶
Draws a 2D scatter plot (png) of the core and accessory distances
Also draws contours of kernel density estimare
- Args:
- X (numpy.array)
n x 2 array of core and accessory distances for n samples.
- out_prefix (str)
Prefix for output plot file (.png will be appended)
- title (str)
The title to display above the plot
- kde (bool)
Whether to draw kernel density estimate contours
(default = True)
- PopPUNK.plot.update_maps_timelines(micoreact_sample_json, info_csv=None)[source]¶
Update the maps and timelines in the Microreact JSON file. Removes maps and timelines if the required columns are not present in the info CSV.
- Args:
- micoreact_sample_json (dict)
Microreact JSON file
- info_csv (str)
CSV file containing additional information for Microreact
- PopPUNK.plot.writeClusterCsv(outfile, nodeNames, nodeLabels, clustering, output_format='microreact', epiCsv=None, queryNames=None, suffix='_Cluster')[source]¶
Print CSV file of clustering and optionally epi data
Writes CSV output of clusters which can be used as input to microreact and cytoscape. Uses pandas to deal with CSV reading and writing nicely.
The epiCsv, if provided, should have the node labels in the first column.
- Args:
- outfile (str)
File to write the CSV to.
- nodeNames (list)
Names of sequences in clustering (includes path).
- nodeLabels (list)
Names of sequences to write in CSV (usually has path removed).
- clustering (dict or dict of dicts)
Dictionary of cluster assignments (keys are nodeNames). Pass a dict with depth two to include multiple possible clusterings.
- output_format (str)
Software for which CSV should be formatted (microreact, phandango, grapetree and cytoscape are accepted)
- epiCsv (str)
Optional CSV of epi data to paste in the output in addition to the clusters (default = None).
- queryNames (list)
Optional list of isolates that have been added as a query.
(default = None)
sparse_mst.py¶
sketchlib.py¶
Sketchlib functions for database construction
- PopPUNK.sketchlib.addRandom(oPrefix, sequence_names, klist, strand_preserved=False, overwrite=False, threads=1)[source]¶
Add chance of random match to a HDF5 sketch DB
- Args:
- oPrefix (str)
Sketch database prefix
- sequence_names (list)
Names of sequences to include in calculation
- klist (list)
List of k-mer sizes to sketch
- strand_preserved (bool)
Set true to ignore rc k-mers
- overwrite (str)
Set true to overwrite existing random match chances
- threads (int)
Number of threads to use (default = 1)
- PopPUNK.sketchlib.checkSketchlibLibrary()[source]¶
Gets the location of the sketchlib library
- Returns:
- lib (str)
Location of sketchlib .so/.dyld
- PopPUNK.sketchlib.checkSketchlibVersion()[source]¶
Checks that sketchlib can be run, and returns version
- Returns:
- version (str)
Version string
- PopPUNK.sketchlib.constructDatabase(assemblyList, klist, sketch_size, oPrefix, threads, overwrite, strand_preserved, min_count, use_exact, calc_random=True, codon_phased=False, use_gpu=False, deviceid=0)[source]¶
Sketch the input assemblies at the requested k-mer lengths
A multithread wrapper around
runSketch()
. Threads are used to either run multiple sketch processes for each klist value.Also calculates random match probability based on length of first genome in assemblyList.
- Args:
- assemblyList (str)
File with locations of assembly files to be sketched
- klist (list)
List of k-mer sizes to sketch
- sketch_size (int)
Size of sketch (
-s
option)- oPrefix (str)
Output prefix for resulting sketch files
- threads (int)
Number of threads to use (default = 1)
- overwrite (bool)
Whether to overwrite sketch DBs, if they already exist. (default = False)
- strand_preserved (bool)
Ignore reverse complement k-mers (default = False)
- min_count (int)
Minimum count of k-mer in reads to include (default = 0)
- use_exact (bool)
Use exact count of k-mer appearance in reads (default = False)
- calc_random (bool)
Add random match chances to DB (turn off for queries)
- codon_phased (bool)
Use codon phased seeds (default = False)
- use_gpu (bool)
Use GPU for read sketching (default = False)
- deviceid (int)
GPU device id (default = 0)
- Returns:
- names (list)
List of names included in the database (from rfile)
- PopPUNK.sketchlib.createDatabaseDir(outPrefix, kmers)[source]¶
Creates the directory to write sketches to, removing old files if unnecessary
- Args:
- outPrefix (str)
output db prefix
- kmers (list)
k-mer sizes in db
- PopPUNK.sketchlib.fitKmerCurve(pairwise, klist, jacobian)[source]¶
Fit the function \(pr = (1-a)(1-c)^k\)
Supply
jacobian = -np.hstack((np.ones((klist.shape[0], 1)), klist.reshape(-1, 1)))
- Args:
- pairwise (numpy.array)
Proportion of shared k-mers at k-mer values in klist
- klist (list)
k-mer sizes used
- jacobian (numpy.array)
Should be set as above (set once to try and save memory)
- Returns:
- transformed_params (numpy.array)
Column with core and accessory distance
- PopPUNK.sketchlib.getKmersFromReferenceDatabase(dbPrefix)[source]¶
Get kmers lengths from existing database
- Args:
- dbPrefix (str)
Prefix for sketch DB files
- Returns:
- kmers (list)
List of k-mer lengths used in database
- PopPUNK.sketchlib.getSeqsInDb(dbname)[source]¶
Return an array with the sequences in the passed database
- Args:
- dbname (str)
Sketches database filename
- Returns:
- seqs (list)
List of sequence names in sketch DB
- PopPUNK.sketchlib.getSketchSize(dbPrefix)[source]¶
Determine sketch size, and ensures consistent in whole database
sys.exit(1)
is called if DBs have different sketch sizes- Args:
- dbprefix (str)
Prefix for databases
- Returns:
- sketchSize (int)
sketch size (64x C++ definition)
- codonPhased (bool)
whether the DB used codon phased seeds
- PopPUNK.sketchlib.get_database_statistics(prefix)[source]¶
Extract statistics for evaluating databases.
- Args:
- prefix (str)
Prefix of database
- PopPUNK.sketchlib.joinDBs(db1, db2, output, update_random=None, full_names=False)[source]¶
Join two sketch databases with the low-level HDF5 copy interface
- Args:
- db1 (str)
Prefix for db1
- db2 (str)
Prefix for db2
- output (str)
Prefix for joined output
- update_random (dict)
Whether to re-calculate the random object. May contain control arguments strand_preserved and threads (see
addRandom()
)- full_names (bool)
If True, db_name and out_name are the full paths to h5 files
- PopPUNK.sketchlib.queryDatabase(rNames, qNames, dbPrefix, queryPrefix, klist, self=True, number_plot_fits=0, threads=1, use_gpu=False, deviceid=0)[source]¶
Calculate core and accessory distances between query sequences and a sketched database
For a reference database, runs the query against itself to find all pairwise core and accessory distances.
Uses the relation \(pr(a, b) = (1-a)(1-c)^k\)
To get the ref and query name for each row of the returned distances, call to the iterator
iterDistRows()
with the returned refList and queryList- Args:
- rNames (list)
Names of references to query
- qNames (list)
Names of queries
- dbPrefix (str)
Prefix for reference sketch database created by
constructDatabase()
- queryPrefix (str)
Prefix for query sketch database created by
constructDatabase()
- klist (list)
K-mer sizes to use in the calculation
- self (bool)
Set true if query = ref (default = True)
- number_plot_fits (int)
If > 0, the number of k-mer length fits to plot (saved as pdfs). Takes random pairs of comparisons and calls
plot_fit()
(default = 0)- threads (int)
Number of threads to use in the process (default = 1)
- use_gpu (bool)
Use a GPU for querying (default = False)
- deviceid (int)
Index of the CUDA GPU device to use (default = 0)
- Returns:
- distMat (numpy.array)
Core distances (column 0) and accessory distances (column 1) between refList and queryList
- PopPUNK.sketchlib.readDBParams(dbPrefix)[source]¶
Get kmers lengths and sketch sizes from existing database
Calls
getKmersFromReferenceDatabase()
andgetSketchSize()
Uses passed values if db missing- Args:
- dbPrefix (str)
Prefix for sketch DB files
- Returns:
- kmers (list)
List of k-mer lengths used in database
- sketch_sizes (list)
List of sketch sizes used in database
- codonPhased (bool)
whether the DB used codon phased seeds
- PopPUNK.sketchlib.removeFromDB(db_name, out_name, removeSeqs, full_names=False)[source]¶
Remove sketches from the DB the low-level HDF5 copy interface
- Args:
- db_name (str)
Prefix for hdf database
- out_name (str)
Prefix for output (pruned) database
- removeSeqs (list)
Names of sequences to remove from database
- full_names (bool)
If True, db_name and out_name are the full paths to h5 files
utils.py¶
General utility functions for data read/writing/manipulation in PopPUNK
- PopPUNK.utils.check_and_set_gpu(use_gpu, gpu_lib, quit_on_fail=False)[source]¶
Check GPU libraries can be loaded and set managed memory.
- Args:
- use_gpu (bool)
Whether GPU packages have been requested
- gpu_lib (bool)
Whether GPU packages are available
- Returns:
- use_gpu (bool)
Whether GPU packages can be used
- PopPUNK.utils.decisionBoundary(intercept, gradient, adj=0.0)[source]¶
Returns the co-ordinates where the triangle the decision boundary forms meets the x- and y-axes.
- Args:
- intercept (numpy.array)
Cartesian co-ordinates of point along line (
transformLine()
) which intercepts the boundary- gradient (float)
Gradient of the line
- adj (float)
Fraction by which to shift the intercept up the y axis
- Returns:
- x (float)
The x-axis intercept
- y (float)
The y-axis intercept
- PopPUNK.utils.get_match_search_depth(rlist, rank_list)[source]¶
Return a default search depth for lineage model fitting.
- Args:
- rlist (list)
List of sequences in database
- rank_list (list)
List of ranks to be used to fit lineage models
- Returns:
- max_search_depth (int)
Maximum kNN used for lineage model fitting
- PopPUNK.utils.isolateNameToLabel(names)[source]¶
Function to process isolate names to labels appropriate for visualisation.
- Args:
- names (list)
List of isolate names.
- Returns:
- labels (list)
List of isolate labels.
- PopPUNK.utils.iterDistRows(refSeqs, querySeqs, self=True)[source]¶
Gets the ref and query ID for each row of the distance matrix
Returns an iterable with ref and query ID pairs by row.
- Args:
- refSeqs (list)
List of reference sequence names.
- querySeqs (list)
List of query sequence names.
- self (bool)
Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True
- Returns:
- ref, query (str, str)
Iterable of tuples with ref and query names for each distMat row.
- PopPUNK.utils.joinClusterDicts(d1, d2)[source]¶
Join two dictionaries returned by
readIsolateTypeFromCsv()
with return_dict = True. Useful for concatenating ref and query assignments- Args:
- d1 (dict of dicts)
First dictionary to concat
- d2 (dict of dicts)
Second dictionary to concat
- Returns:
- d1 (dict of dicts)
d1 with d2 appended
- PopPUNK.utils.listDistInts(refSeqs, querySeqs, self=True)[source]¶
Gets the ref and query ID for each row of the distance matrix
Returns an iterable with ref and query ID pairs by row.
- Args:
- refSeqs (list)
List of reference sequence names.
- querySeqs (list)
List of query sequence names.
- self (bool)
Whether a self-comparison, used when constructing a database. Requires refSeqs == querySeqs Default is True
- Returns:
- ref, query (str, str)
Iterable of tuples with ref and query names for each distMat row.
- PopPUNK.utils.readIsolateTypeFromCsv(clustCSV, mode='clusters', return_dict=False)[source]¶
Read cluster definitions from CSV file.
- Args:
- clustCSV (str)
File name of CSV with isolate assignments
- mode (str)
Type of file to read ‘clusters’, ‘lineages’, or ‘external’
- return_dict (bool)
If True, return a dict with sample->cluster instead of sets [default = False]
- Returns:
- clusters (dict)
Dictionary of cluster assignments (keys are cluster names, values are sets containing samples in the cluster). Or if return_dict is set keys are sample names, values are cluster assignments.
- PopPUNK.utils.readPickle(pklName, enforce_self=False, distances=True)[source]¶
Loads core and accessory distances saved by
storePickle()
Called during
--fit-model
- Args:
- pklName (str)
Prefix for saved files
- enforce_self (bool)
Error if self == False
[default = True]
- distances (bool)
Read the distance matrix
[default = True]
- Returns:
- rlist (list)
List of reference sequence names (for
iterDistRows()
)- qlist (list)
List of query sequence names (for
iterDistRows()
)- self (bool)
Whether an all-vs-all self DB (for
iterDistRows()
)- X (numpy.array)
n x 2 array of core and accessory distances
- PopPUNK.utils.readRfile(rFile, oneSeq=False)[source]¶
Reads in files for sketching. Names and sequence, tab separated
- Args:
- rFile (str)
File with locations of assembly files to be sketched
- oneSeq (bool)
Return only the first sequence listed, rather than a list (used with mash)
- Returns:
- names (list)
Array of sequence names
- sequences (list of lists)
Array of sequence files
- PopPUNK.utils.read_rlist_from_distance_pickle(fn, allow_non_self=True, include_queries=False)[source]¶
Return the list of reference sequences from a distance pickle.
- Args:
- fn (str)
Name of distance pickle
- allow_non_self (bool)
Whether non-self distance datasets are permissible
- include_queries (bool)
Whether queries should be included in the rlist
- Returns:
- rlist (list)
List of reference sequence names
- PopPUNK.utils.set_env(**environ)[source]¶
Temporarily set the process environment variables. >>> with set_env(PLUGINS_DIR=u’test/plugins’): … “PLUGINS_DIR” in os.environ True >>> “PLUGINS_DIR” in os.environ False
- PopPUNK.utils.setupDBFuncs(args)[source]¶
Wraps common database access functions from sketchlib and mash, to try and make their API more similar
- Args:
- args (argparse.opts)
Parsed command lines options
- qc_dict (dict)
Table of parameters for QC function
- Returns:
- dbFuncs (dict)
Functions with consistent arguments to use as the database API
- PopPUNK.utils.stderr_redirected(to='/dev/null')[source]¶
import os
- with stdout_redirected(to=filename):
print(“from Python”) os.system(“echo non-Python applications are also supported”)
- PopPUNK.utils.storePickle(rlist, qlist, self, X, pklName)[source]¶
Saves core and accessory distances in a .npy file, names in a .pkl
Called during
--create-db
- Args:
- rlist (list)
List of reference sequence names (for
iterDistRows()
)- qlist (list)
List of query sequence names (for
iterDistRows()
)- self (bool)
Whether an all-vs-all self DB (for
iterDistRows()
)- X (numpy.array)
n x 2 array of core and accessory distances
If None, do not save
- pklName (str)
Prefix for output files
- PopPUNK.utils.transformLine(s, mean0, mean1)[source]¶
Return x and y co-ordinates for traversing along a line between mean0 and mean1, parameterised by a single scalar distance s from the start point mean0.
- Args:
- s (float)
Distance along line from mean0
- mean0 (numpy.array)
Start position of line (x0, y0)
- mean1 (numpy.array)
End position of line (x1, y1)
- Returns:
- x (float)
The Cartesian x-coordinate
- y (float)
The Cartesian y-coordinate
- PopPUNK.utils.update_distance_matrices(refList, distMat, queryList=None, query_ref_distMat=None, query_query_distMat=None, threads=1)[source]¶
Convert distances from long form (1 matrix with n_comparisons rows and 2 columns) to a square form (2 NxN matrices), with merging of query distances if necessary.
- Args:
- refList (list)
List of references
- distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between reference db sequences
- queryList (list)
List of queries
- query_ref_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between queries and reference db sequences
- query_query_distMat (numpy.array)
Two column long form list of core and accessory distances for pairwise comparisons between query sequences
- threads (int)
Number of threads to use
- Returns:
- seqLabels (list)
Combined list of reference and query sequences
- coreMat (numpy.array)
NxN array of core distances for N sequences
- accMat (numpy.array)
NxN array of accessory distances for N sequences
visualise.py¶
poppunk_visualise
main function
web.py¶
Functions used by the web API to convert a sketch to an h5 database, then generate visualisations and post results to PopPUNK-web.
- PopPUNK.web.calc_prevalence(cluster, cluster_list, num_samples)[source]¶
Cluster prevalences for Plotly.js