Through use of the -f, --features-vcf
argument it is possible to provide lorikeet with a list of known variants or variants of concern that will be forcibly called by
the algorithm if there is any activity occurring at that genomic location in the provided samples. It is possible that lorikeet
will also call any potential variation events surrounding a given location as well. This can be useful if you are more
interested in the activity occurring around some given locations rather than specific variation events.
The list of variants must be provided in VCF format with some caveats on how the variant locations are written. E.g.
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##source=lorikeet-v0.7.0
##contig=<ID=random10000~random_sequence_length_10000_1,length=10000>
#CHROM POS ID REF ALT QUAL FILTER
random10000~random_sequence_length_10000_1 223 . G C 1445.63 .
random10000~random_sequence_length_10000_1 435 . G C 1941.63 .
random10000~random_sequence_length_10000_1 949 . T A 383.629 .
Note that you need to specific both the genome name and contig name in the CHROM
column separated by the ~
character
e.g. random10000~random_sequence_length_10000_1
. Additionally, the standard FORMAT
and INFO
columns are optional when
provding a VCF file.
Finally, the provided VCF file must be compressed using bgzip
and indexed using bcftools index
e.g.
bgzip -c random10000.vcf > random10000.vcf;
bcftools index random10000.vcf.gz;
lorikeet call -r random10000.fna -1 forward_reads.fastq -2 reverse_reads.fastq -l longread.bam -f random10000.vcf.gz
The ANI tables that Lorikeet produces can feel overwhelming at first, but luckily they can be fairly easily visualized into interpretable and digestable heatmaps. The following is going to be a short walkthrough on the processes I have taken previously to generate heatmap results for a singular novel MAG.
Imagine we've used SingleM to search the SRA for hits for our novel organism and then used Kingfisher to download all of the samples where our organism is apparently appearing in excess of 5 times coverage and contains 10 or more gene hits. We would then pass all of these samples to Lorikeet with our novel MAG to generate the variants, ANI tables, and optionally the strain genotypes as well. If we just wanted the variants and ANI tables we'd use this single command:
lorikeet call -r novel_mag.fna -c sra_samples/*_[12].fastq* -t 30 --bam-file-cache-directory bams/ -o lorikeet_out/
Since this example involves mapping over 90 samples to our reference genome, I've chosen to use the --bam-file-cache-directory
flag so that the read mappins are saved and we can reuse them if we experience a crash or wish to change the settings. If we
wanted to rerun Lorikeet using the BAM files generate from the mappings then we'd just change our command to:
lorikeet call -r novel_mag.fna -b bams/short/*.bam -t 30 -o lorikeet_out/
Once Lorikeet has finished up, we should have an output folder filled with ANI tables in lorikeet_out/novel_mag/
. We then
want to use some R libraries like data.table
, ggplot2
, and plotly
to visualize what the results look like.
A simple example of this would look like:
library(data.table)
library(ggplot2)
library(plotly)
pop_ani <- fread("lorikeet_out/novel_mag/novel_mag_population_ani.tsv", skip="SampleID",) # skip here so we skip all the contig header info
con_ani <- fread("lorikeet_out/novel_mag/novel_mag_consensus_ani.tsv", skip="SampleID")
sub_pop_ani <- fread("lorikeet_out/novel_mag/novel_mag_population_ani.tsv", skip="SampleID",)
## An example using consensus ANI
# sample_labels
con_ani[, SampleID:=factor(SampleID)]
tmp <-heatmaply(con_ani[, -1, with=FALSE],
xlab = "Sample Name",
ylab = "Sample Name",
file = "con_ani_heatmap.html",
k_row = 1, k_col = 1, plot_method="ggplot")
This will produce a HTML
file using plotly
that should look something like heatmap where
dark cells indicate more distant ANI values and lighter cells represent closely related samples. This example only explores
the use of consensus ANI, but we strongly recommend using both population and subpopulation ANI to assess how your communities
are differing.
Powered by Doctave