Identification of SNVs (Single Nucleotide Variants) and small indels (insertion-deletion) by iFlow

Monika Opalek, PhD

Ten most common types of genetic variations in the human genome are variations of a single nucleotide (SNV, Single Nucleotide Variant). These include a single nucleotide substitution (transition and transversion), insertion, or deletion within a DNA sequence. 

All of us inherit multiple single nucleotide variants (SNVs) from our parents. Additionally, during our lifetime, de novo mutations can arise as a result of internal cellular damages or environmental factors, such as UV radiation. SNVs can occur in both coding and non-coding regions of the genome, and their impact on phenotype and disease can vary depending on their location and functional consequences. While the majority of variants do not manifest any phenotypic symptoms, it is still possible for us to carry pathogenic variants. Therefore, the examination of SNVs holds particular importance in the analysis of hereditary disorders, including rare diseases.

Diagnosis of rare diseases is particularly challenging, not only due to the potential lack of specific symptoms but also because different genetic mutations can result in similar medical problems. Therefore, genetic diagnosis plays a crucial role in providing tailored treatment. Specifically, identifying the precise genetic variations that cause the illness enables clinicians to develop treatments that are uniquely suited to the individual needs of the patient, ultimately leading to personalized healthcare plans. Considering the patient's unique genetic makeup enhances the probability of achieving successful outcomes

iFlow allows for high-resolution identification of genetic variants. Below, we present how we analyze SNVs at the iFlow platform using hereditary disorder workflows as an example.

1. Alignment and bam files processing

Reads are aligned to the human reference genome GRCh38 without alternative loci with the bwa-mem software.

Mapped reads are examined for contamination, which includes the removal of reads with a high fraction of mismatched bases (differences between the base that is sequenced in a read and the expected base based on a reference genome) and gap openings (insertions or deletions). If a read has a high proportion of these errors, indicated by a fraction greater than 0.1 (10%), it suggests potential contamination or sequencing/alignment artifacts. For example, a 100 base-long read would have a fraction of mismatches and gap openings greater than 0.1, if it has more than 10 such differences identified. 

Very short reads with soft-clipped bases at both ends are also removed. Soft-clipping is a process in which the alignment algorithm allows for a portion of a read to not match the reference sequence. Very short reads with soft-clipped bases at both ends are considered less reliable, therefore, they are removed as they may introduce noise or bias into downstream analyses.

During DNA sequencing, it is possible to obtain multiple identical or near-identical reads that arise from technical artifacts rather than representing independent DNA fragments. Therefore the GATK MarkDuplicates is used to identify and flag reads that are likely duplicates.

The filtering of mapped reads improves the overall quality and accuracy of the NGS data, which ensures that subsequent analysis steps are performed on more reliable reads.

In addition, in the case of WES and NGS panel sequencing data, the GATK BaseRecalibrator and ApplyBQSR tools are used to adjust base qualities. During this step, BaseRecalibrator tool improves the accuracy of base calling quality scores by taking into account various factors that can introduce systematic errors, e.g. sequencing machine artifacts or biases. 

2. Short variant calling and spurious variant removing

Variant calling is a step where aligned reads obtained from a sample are compared to the reference genome and the differences (variants) are identified. The GATK HaplotypeCaller with the -ERC GVCF option is used for variant calling step, resulting in the gvcf file. The gvcf (Genomic Variant Call Format) file contains information about each genomic position, including the likelihoods of each possible genotype, quality scores, and coverage depth. Next, the single sample genotyping is performed with the GATK GenotypeGVCF tool. The resulting vcf is further hard-filtered, according to the recommendations of the Broad Institute [1]. In particular, SNPs are removed if any of the following conditions is true (please, see the detailed description of the parameters at the end of the article):

  • Variant Quality (QUAL) < 30 
  • QualByDepth (QD) < 2.0 
  • FisherStrand (FS) > 60.0
  • RMSMappingQuality (MQ) < 40.0 
  • MappingQualityRankSumTest (MQRankSum) < -12.5 
  • ReadPosRankSum Test (ReadPosRankSum) < -8.0 

There are differences in the parameters values in the case of indels. In particular, indels are removed if any of the following conditions is true (please, see the detailed description of the parameters at the end of the article):

  • QualByDepth (QD) < 2.0, 
  • FisherStrand (FS) > 200.0, 
  • ReadPosRankSum < -20.0, 
  • StrandOddsRatio (SOR) > 10.0, 
  • Variant Quality (QUAL) < 30.0

Mitochondrial variants are called and filtered separately with GATK Mutect2 and FilterMutectCalls programs run in the mitochondrial mode.

3. Short variant annotation and annotation-based filtering

Variant pathogenicity is assessed based on the information acquired from the databases. These include: gnomAD v2.1 and v3 (frequencies, coverage, constraint), 1000Genomes (frequencies), MITOMAP (frequencies, contributed diseases), ClinVar (contributed diseases, pathogenicity), HPO (inheritance mode, contributed phenotypes and diseases), UCSC (repeats, PHAST conservation scores), SIFT4G (constraint), SnpEff, VEP and LOFTEE (predicted impact on gene product), dbSNP (rsID), Ensembl (gene and transcript information), COSMIC (somatic mutations data), dbNSFP (in silico pathogenicity predictors), dbscSNV (abnormal splicing predictors), UniProt and NextProt (effect of a mutation on the protein), CIViC (somatic mutations data). 

4. ACMG Classification and report generation

Annotated variants, for genes that are likely to contribute to the patient phenotype and/or for genes from the user-defined gene list/panels, are then classified and prioritized according to the ACMG criteria. A detailed description of the method used for variant classification can be found on the Intelliseq website.

In short, the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) categorize genetic variants in terms of the strength of their pathogenic or benign impact on human phenotype [2-4]. The following variant characteristics are taken into account:

  • the impact that a given variant has on a particular gene or its protein product, i.e. mutation type (e.g. loss of function, missense). Mutation types are based on SnpEff, Ensembl VEP, and LOFTEE annotations
  • whether the detected type of mutation is a validated mechanism of disease for the affected gene (based on ClinVar statistics and gnomAD constraint data) 
  • computational predictions for the mutated site (dbNSFP conservation and functional scores) 
  • whether variant disrupts conserved splicing motifs (dbscSNV)
  • ClinVar database information about the pathogenicity of the same or similar variants 
  • UniProt database information about mutation effects indicated in functional studies; 
  • whether variant lies within a region essential for protein function (UniProt database) 
  • whether the affected gene is likely to contribute to the patient phenotype (based on the Human Phenotype Ontology database). 

These characteristics are used to score variant pathogenicity and assign ACMG minor categories. The final ACMG score is a weighted sum of the gained subcategory points, with the negative weights being used for the benign and positive ones for the pathogenic subcategories. The absolute values of the applied weights correspond to the given subcategory importance. The final score is used to classify variants into one of the major categories: pathogenic, likely pathogenic, benign, likely benign, or of uncertain significance. In addition, each variant is checked for its inheritance pattern taking into account the zygosity. 

The main report by default shows a maximum of 50 variants (together with a potential second variant from compound heterozygote). These include all variants classified in the ClinVar database as pathogenic or likely pathogenic, along with variants that gained the highest pathogenicity score in the ACMG classification. However, this can be also manually modified by the Manual Filtering option.

Apart from single-nucleotide variations, mutations can include large-scale genome rearrangements. If you’re interested in how we analyze the structural variants (SVs) via our workflows at the iFlow, check the article.

We encourage you to sign up for a demo with one of our highly knowledgeable bioinformatics experts. Don't hesitate to contact us if you are interested in testing the iFlow platform.

Parameters description

  • Variant Quality (QUAL) - Variant Quality (QUAL) represents an estimate of the confidence of the variant call at a specific genomic position. The QUAL score is typically reported as a phred-scaled quality score, which is a logarithmic transformation of the probability that the variant call is incorrect. Higher scores indicate higher confidence, for example, a QUAL score of 30 corresponds to a 1 in 1,000 chance of a false positive, and a score of 50 corresponds to a 1 in 100,000 chance of a false positive.
  • QualByDepth (QD) - QualByDepth (QD) is a normalized measure of the quality of the variant call relative to the depth of coverage. It is calculated by dividing the quality score (QUAL) of the variant by the depth of coverage (DP) at that position. It helps to evaluate if the variant call is supported by a sufficient number of high-quality reads or if it may be influenced by low coverage or sequencing artifacts. A higher QD value indicates a higher quality variant call, as it indicates that the variant is supported by a greater number of high-quality reads relative to the depth of coverage.
  • FisherStrand (FS) - The FisherStrand test examines whether there is an imbalance in the distribution of reference and alternate alleles between the forward strand and the reverse strand. A high FS value indicates a significant imbalance, suggesting potential strand bias, while a low FS value indicates a more balanced distribution. The strand bias can occur due to biases or artifacts in the sequencing or mapping process, as well as it can indicate true biological phenomena
  • RMSMappingQuality (MQ) - RMSMappingQuality (root mean square mapping quality) assess the average mapping quality of the reads aligned to a specific genomic position. It refers to the confidence that a read is aligned correctly to its true position in the reference genome, with the higher RMSMappingQuality value indicating higher confidence. 
  • MappingQualityRankSumTest (MQRankSum) -  MappingQualityRankSumTest is a statistical test used to assess the difference in mapping qualities between reads supporting the reference allele and reads supporting the alternate allele at a specific genomic position. A positive MQRankSum score suggests that the alternate-supporting reads have higher mapping qualities compared to the reference-supporting reads, while a negative MQRankSum score suggests that the reference-supporting reads have higher mapping qualities compared to the alternate-supporting reads.
  • ReadPosRankSum Test (ReadPosRankSum) - The ReadPosRankSum test examines the position of reads supporting the reference allele and the alternate allele within a variant site. It evaluates whether the position of reads carrying the alternate allele is significantly different from the position of reads carrying the reference allele, allowing for identification of potential sequencing or alignment artifacts
  • StrandOddsRatio (SOR) - The StrandOddsRatio (SOR) is a statistical metric used to assess the balance between forward and reverse reads supporting the reference allele and the alternate allele at a given variant site. The strand bias can be indicative of sequencing or alignment artifacts, where one strand is more prone to errors then the other. The SOR helps to identify potential false positive variant calls

References:

[1] https://gatk.broadinstitute.org/hc/en-us/articles/360035890471-Hard-filtering-germline-short-variants.

[2] S. Richards et al., Standards and Guidelines for the Interpretation of Sequence Variants: A joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology Genet. Med. Off. J. Am. Coll. Genet., vol 17. no. S, pp. 405-424, May 2015

[3] A. N.Abou Tayoun e. al., Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion, Hum. Mutat., vol 39, no 11, pp. 1517-1524, 2018, Online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/

[4] S. V. Tavtigian et al., Modeling the ACMG/AMP Variant Classification Guidelines as a Bayesian Classification Framework, Genet. Med. Off. J. Am. Coll. Med. Genet., vol. 20. no. 9. pp. 1054-1060, Sep. 2018.

<h2>Want to know more?</h2>

Want to know more?

Get in touch with us.