Quality control during NGS data analysis in Intelliseq workflows
Monika Opalek, PhD
July 26, 2023
Implementation of Quality Control (QC) steps in the Next-Generation Sequencing (NGS) data analysis pipelines ensures the reliability of the generated genomic information. NGS technologies produce vast amounts of data, making it crucial to identify and address potential issues early in the analysis workflow. QC involves a series of systematic tests and filtering steps aimed at assessing the quality of sequencing data and removing artifacts. Rigorous QC allows for confident identification of genetic variants that should be passed to the downstream analysis stages.
All workflows developed by Intelliseq include quality control steps that are summarized using MulitQC software. MulitQC is a tool that aggregates the results from multiple quality control analyses across many samples into a single report. It summarizes numerical statistics and produces interactive plots that allow fast identification of global biases and trends. This article focuses on the quality control of raw sequencing reads (fastq files) and aligned reads (bam files). If you’re interested in how we identify and filter variants on the further stages of the analysis please see the articles describing workflow identifying SNVs and structural variants.
Quality control of raw sequencing reads: The quality assessment of each fastq file is conducted using the FastQC software. It provides a comprehensive analysis of the raw sequencing reads (fastq) to assess its quality and detect any potential issues or biases that may arise due to sequencing errors. FastQC provides following information:
Sequence Counts - the total number of sequencing reads obtained from the sequencing run.
Sequence Quality Histograms - graphical representation of the mean quality value across all base positions in the sequencing reads. Quality scores are represented as Phred scores, which is a logarithmic transformation of the probability that the variant call is incorrect. Higher scores indicate higher confidence, for example, a score of 30 corresponds to a 1 in 1,000 chance of a false positive, and a score of 50 corresponds to a 1 in 100,000 chance of a false positive base call. The histogram allows assessing the overall quality of the sequencing data and identifies regions of low quality that might require trimming.
Per Sequence Quality Scores - represents the number of reads with average quality scores (Phred scores). This shows if a subset of the reads are of poor quality.
Per Base Sequence Content - proportion of nucleotide bases (A, T, C, and G) at each position in the sequencing reads. It helps to identify any biases or anomalies in the nucleotide composition at specific positions, which could indicate potential sequencing or library preparation issues.
Per Sequence GC Content - calculates and visualizes the distribution of GC (guanine-cytosine) content for a read. Typically, a roughly normal distribution of GC content is expected
Per Base N Content - evaluates the percentage of N at each position. An N is substituted to given position if a sequencer is unable to make a base call with sufficient confidence
Sequence Length Distribution - the distribution of read lengths in the dataset
Overrepresented sequences - identifies sequences that occur at unusually high frequencies, which could be contaminants or adapter/primer sequences that need to be removed
Adapter Content - checks for the presence of adapter sequences in the data, which are commonly used in library preparation and may need to be trimmed
Quality control of aligned reads (bam files): The quality of aligned reads is checked using Picard software, which provides following information:
Alignment Summary - an overview of the alignment results for the sequencing reads, including statistics such as the number and percentage of aligned and unmapped reads as well as the overall alignment rate.
Mean read length - mean length of the DNA fragments sequenced (sequencing reads) in the dataset
Insert Size - the physical length of the DNA fragment that is present between the two sequenced ends in paired-end sequencing. It represents the length of the original DNA fragment before it was fragmented and amplified for sequencing
Mark Duplicates - identifies and flags duplicate reads that are artifacts of the sequencing process, such as those generated during library preparation by PCR amplification
HSMetrics [WES only] - provides various metrics and statistics related to the performance of the target enrichment process, which selectively captures and sequences exome
Target Region Coverage [WES only] - refers to the depth of coverage or the number of sequencing reads that align to the specific target regions (exons) of interest in Whole Exome Sequencing
HS Penalty [WES only] - measure of the efficiency of target enrichment in Whole Exome Sequencing
WGS Coverage [WGS only] - refers to the depth of coverage or the number of sequencing reads that align to each position in the whole genome
WGS Filtered Bases [WGS only] - represents the number of bases that were filtered out from the whole genome sequencing data during preprocessing or quality control step due to various reasons e.g. low mapping quality, duplicated reads, low base quality or overlapping inserts.