Ten most common types of genetic variations in the human genome are variations of a single nucleotide (SNV, Single Nucleotide Variant). These include a single nucleotide substitution (transition and transversion), insertion, or deletion within a DNA sequence.
All of us inherit multiple single nucleotide variants (SNVs) from our parents. Additionally, during our lifetime, de novo mutations can arise as a result of internal cellular damages or environmental factors, such as UV radiation. SNVs can occur in both coding and non-coding regions of the genome, and their impact on phenotype and disease can vary depending on their location and functional consequences. While the majority of variants do not manifest any phenotypic symptoms, it is still possible for us to carry pathogenic variants. Therefore, the examination of SNVs holds particular importance in the analysis of hereditary disorders, including rare diseases.
Diagnosis of rare diseases is particularly challenging, not only due to the potential lack of specific symptoms but also because different genetic mutations can result in similar medical problems. Therefore, genetic diagnosis plays a crucial role in providing tailored treatment. Specifically, identifying the precise genetic variations that cause the illness enables clinicians to develop treatments that are uniquely suited to the individual needs of the patient, ultimately leading to personalized healthcare plans. Considering the patient's unique genetic makeup enhances the probability of achieving successful outcomes
Reads are aligned to the human reference genome GRCh38 without alternative loci with the bwa-mem software.
Mapped reads are examined for contamination, which includes the removal of reads with a high fraction of mismatched bases (differences between the base that is sequenced in a read and the expected base based on a reference genome) and gap openings (insertions or deletions). If a read has a high proportion of these errors, indicated by a fraction greater than 0.1 (10%), it suggests potential contamination or sequencing/alignment artifacts. For example, a 100 base-long read would have a fraction of mismatches and gap openings greater than 0.1, if it has more than 10 such differences identified.
Very short reads with soft-clipped bases at both ends are also removed. Soft-clipping is a process in which the alignment algorithm allows for a portion of a read to not match the reference sequence. Very short reads with soft-clipped bases at both ends are considered less reliable, therefore, they are removed as they may introduce noise or bias into downstream analyses.
During DNA sequencing, it is possible to obtain multiple identical or near-identical reads that arise from technical artifacts rather than representing independent DNA fragments. Therefore the GATK MarkDuplicates is used to identify and flag reads that are likely duplicates.
The filtering of mapped reads improves the overall quality and accuracy of the NGS data, which ensures that subsequent analysis steps are performed on more reliable reads.
In addition, in the case of WES and NGS panel sequencing data, the GATK BaseRecalibrator and ApplyBQSR tools are used to adjust base qualities. During this step, BaseRecalibrator tool improves the accuracy of base calling quality scores by taking into account various factors that can introduce systematic errors, e.g. sequencing machine artifacts or biases.
Variant calling is a step where aligned reads obtained from a sample are compared to the reference genome and the differences (variants) are identified. The GATK HaplotypeCaller with the -ERC GVCF option is used for variant calling step, resulting in the gvcf file. The gvcf (Genomic Variant Call Format) file contains information about each genomic position, including the likelihoods of each possible genotype, quality scores, and coverage depth. Next, the single sample genotyping is performed with the GATK GenotypeGVCF tool. The resulting vcf is further hard-filtered, according to the recommendations of the Broad Institute [1]. In particular, SNPs are removed if any of the following conditions is true (please, see the detailed description of the parameters at the end of the article):
There are differences in the parameters values in the case of indels. In particular, indels are removed if any of the following conditions is true (please, see the detailed description of the parameters at the end of the article):
Mitochondrial variants are called and filtered separately with GATK Mutect2 and FilterMutectCalls programs run in the mitochondrial mode.
Variant pathogenicity is assessed based on the information acquired from the databases. These include: gnomAD v2.1 and v3 (frequencies, coverage, constraint), 1000Genomes (frequencies), MITOMAP (frequencies, contributed diseases), ClinVar (contributed diseases, pathogenicity), HPO (inheritance mode, contributed phenotypes and diseases), UCSC (repeats, PHAST conservation scores), SIFT4G (constraint), SnpEff, VEP and LOFTEE (predicted impact on gene product), dbSNP (rsID), Ensembl (gene and transcript information), COSMIC (somatic mutations data), dbNSFP (in silico pathogenicity predictors), dbscSNV (abnormal splicing predictors), UniProt and NextProt (effect of a mutation on the protein), CIViC (somatic mutations data).
Annotated variants, for genes that are likely to contribute to the patient phenotype and/or for genes from the user-defined gene list/panels, are then classified and prioritized according to the ACMG criteria. A detailed description of the method used for variant classification can be found on the Intelliseq website.
In short, the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) categorize genetic variants in terms of the strength of their pathogenic or benign impact on human phenotype [2-4]. The following variant characteristics are taken into account:
These characteristics are used to score variant pathogenicity and assign ACMG minor categories. The final ACMG score is a weighted sum of the gained subcategory points, with the negative weights being used for the benign and positive ones for the pathogenic subcategories. The absolute values of the applied weights correspond to the given subcategory importance. The final score is used to classify variants into one of the major categories: pathogenic, likely pathogenic, benign, likely benign, or of uncertain significance. In addition, each variant is checked for its inheritance pattern taking into account the zygosity.
The main report by default shows a maximum of 50 variants (together with a potential second variant from compound heterozygote). These include all variants classified in the ClinVar database as pathogenic or likely pathogenic, along with variants that gained the highest pathogenicity score in the ACMG classification. However, this can be also manually modified by the Manual Filtering option.
Apart from single-nucleotide variations, mutations can include large-scale genome rearrangements. If you’re interested in how we analyze the structural variants (SVs) via our workflows at the iFlow, check the article.
We encourage you to sign up for a demo with one of our highly knowledgeable bioinformatics experts. Don't hesitate to contact us if you are interested in testing the iFlow platform.
Parameters description
References:
[2] S. Richards et al., Standards and Guidelines for the Interpretation of Sequence Variants: A joint Consensus Recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology Genet. Med. Off. J. Am. Coll. Genet., vol 17. no. S, pp. 405-424, May 2015
[3] A. N.Abou Tayoun e. al., Recommendations for interpreting the loss of function PVS1 ACMG/AMP variant criterion, Hum. Mutat., vol 39, no 11, pp. 1517-1524, 2018, Online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544753/
[4] S. V. Tavtigian et al., Modeling the ACMG/AMP Variant Classification Guidelines as a Bayesian Classification Framework, Genet. Med. Off. J. Am. Coll. Med. Genet., vol. 20. no. 9. pp. 1054-1060, Sep. 2018.
Get in touch with us.