Rosella provides the user with a lot of different outputs for each genome and some of the information present in those outputs requires.
A FASTA file is a text based file used to represent either genomic nucleotide sequences or amino acids. It consists of
of headers (lines starting with >
) and blocks of sequences immediately following the headers. Fasta files are the format
used for the input reference/MAGs that Rosella uses. The extension for such files is usually .fasta
, .fa
, or .fna
.
For more info refer to the wikipedia article
FASTQ or the file format used to store data resulting from sequencing. The sequences present in FASTQ files represent short
genomic sequences of DNA. FASTQ files are used to build assemblies, MAG binnings, genomic coverage etc. You can provide
both paired end and unpared reads to Rosella, as well as short and long reads from a variety of different sequencing platforms.
The file extension for FASTQ files is generally .fastq
, but often they have been compressed so the extension ends in .gz
.
Compressed FASTQ files are accepted as input to Rosella so you do not have to uncompress them.
For more info refer to the wikipedia article
BAM and SAM (Sequence Alignment/Map) format files are the standard format for indicating the alignment start, end, and quality
of FASTQ files to FASTA files. BAM files are the binary format of SAM files, as such can not be read by conventional means.
When performing read mapping the output from the alignment tool will most likely be in SAM/ BAM files. Rosella produces
BAM files when supplied raw reads which can be stored using the --bam-file-cache-directory
argument.
For more info refer to the SAM specification
Binning refers to the process of taking a metagenomic assembly and using certain statistics to cluster segments of the assembly into candidate genomes. Some of these statistics include contig coverage and tetra nucleotide frequency.
The resulting set of contigs generated by connecting sequenced reads within a sample back together. See spades
A contiguous stretch DNA contained within an assembly
Metagenome assembled genomes (MAGs) are sets of one or more contigs that represent candidate species genomes. They are generated from metagenome assemblies via binning processes. The quality of a MAG is assessed via marker based algorithms like CheckM
Coverage is an estimate of how deeply sequenced a particular genome or contig is. Coverage is calculated by observing the read mappings across a specific genomic interval and getting the mean number of reads found at any given position in that interval. There are a number of different ways to calculate coverage, all of which are covered by CoverM
Tetra nucleotide frequency (TNF) is the count of k-mers of size 4 across a given contig. The occurrence of each k-mer is tallied for a given contig and then each k-mer count is divided by the total number of observed k-mers on the contig to produce a frequency.
Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique which aims to produce a low dimensional representation of a high dimensional manifold. It aims to try and preserve both the global and local topological structure of the high dimensional manifold when projecting into low dimensional space. See UMAP
Hierarchical Density-Based Spatial CLustering of Applications with Noise (HDBSCAN) is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
Powered by Doctave