Concepts

Rosella provides the user with a lot of different outputs for each genome and some of the information present in those outputs requires.

File types

FASTA

A FASTA file is a text based file used to represent either genomic nucleotide sequences or amino acids. It consists of of headers (lines starting with >) and blocks of sequences immediately following the headers. Fasta files are the format used for the input reference/MAGs that Rosella uses. The extension for such files is usually .fasta, .fa, or .fna.

For more info refer to the wikipedia article

FASTQ

FASTQ or the file format used to store data resulting from sequencing. The sequences present in FASTQ files represent short genomic sequences of DNA. FASTQ files are used to build assemblies, MAG binnings, genomic coverage etc. You can provide both paired end and unpared reads to Rosella, as well as short and long reads from a variety of different sequencing platforms. The file extension for FASTQ files is generally .fastq, but often they have been compressed so the extension ends in .gz. Compressed FASTQ files are accepted as input to Rosella so you do not have to uncompress them.

For more info refer to the wikipedia article

BAM/SAM

BAM and SAM (Sequence Alignment/Map) format files are the standard format for indicating the alignment start, end, and quality of FASTQ files to FASTA files. BAM files are the binary format of SAM files, as such can not be read by conventional means. When performing read mapping the output from the alignment tool will most likely be in SAM/ BAM files. Rosella produces BAM files when supplied raw reads which can be stored using the --bam-file-cache-directory argument.

For more info refer to the SAM specification

Binning

Binning refers to the process of taking a metagenomic assembly and using certain statistics to cluster segments of the assembly into candidate genomes. Some of these statistics include contig coverage and tetra nucleotide frequency.

Assembly

The resulting set of contigs generated by connecting sequenced reads within a sample back together. See spades

Contig

A contiguous stretch DNA contained within an assembly

MAGs

Metagenome assembled genomes (MAGs) are sets of one or more contigs that represent candidate species genomes. They are generated from metagenome assemblies via binning processes. The quality of a MAG is assessed via marker based algorithms like CheckM

Coverage

Coverage is an estimate of how deeply sequenced a particular genome or contig is. Coverage is calculated by observing the read mappings across a specific genomic interval and getting the mean number of reads found at any given position in that interval. There are a number of different ways to calculate coverage, all of which are covered by CoverM

Tetra nucleotide frequency

Tetra nucleotide frequency (TNF) is the count of k-mers of size 4 across a given contig. The occurrence of each k-mer is tallied for a given contig and then each k-mer count is divided by the total number of observed k-mers on the contig to produce a frequency.

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimensionality reduction technique which aims to produce a low dimensional representation of a high dimensional manifold. It aims to try and preserve both the global and local topological structure of the high dimensional manifold when projecting into low dimensional space. See UMAP

HDBSCAN

Hierarchical Density-Based Spatial CLustering of Applications with Noise (HDBSCAN) is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.