Structural Variants (SVs) – Large-scale genome rearrangements

Monika Opalek, PhD

May 25, 2023

The Structural Variants (SVs) are large-scale genetic alterations that affect the structure of the genome. They can involve deletions, insertions, inversions, duplications, and translocations of DNA segments that are at least 50 base pairs long.

SCHEME ILLUSTRATING THE SV:

Deletions: Loss of a segment of DNA from the genome. Insertions: Addition of a segment of DNA to the genome. Inversions: Reversal of the orientation of a DNA segment. Duplications: Copying and insertion of a segment of DNA into the genome. Translocations: Movement of a segment of DNA from one chromosome to another, or to a different location on the same chromosome.

Structural variants (SVs) can be found in both coding and non-coding regions of the genome. Some SVs can have a significant impact on gene expression, protein function, and disease susceptibility, resulting in severe genetic disorders, while other SVs can be benign. 

Due to the complexity and diversity of the human genome, the identification and classification of SVs is challenging. At the IntelliseqFlow we have developed workflows specifically designed to detect SVs in WGS (Whole Genome Sequencing), WES (Whole Exome Sequencing) and NGS panels data. Below you can find the description of how we analyse the NGS data to identify SVs.

WGS structural variants research analysis 

For the WGS (Whole Genome Sequencing) data, the structural variants (SV) are identified using the lumpy software. Lumpy software simultaneously integrates multiple SVs detection signals (i.e. read-pair, split-read, read-depth), which results in increased sensitivity of SVs identification, compared to other softwares [1]. The data is genotyped with the svtyper software, which is a tool specifically designed for genotyping structural variants. Both lumpy and svtyper are run with the usage of smoove wrapper, which is a tool that simplify calling and genotyping structural variants by parallelizing some steps or by addition of internal read-filtering [2]

The duphold software [3] is used to add information of the sequence depth, in particular following scores are added: (1) variant coverage relative to other genomic regions with similar GC content (2) variant coverage relative to coverage of its flanking regions (1kb), and (3) a number of heterozygous SNV within the structural variant. These values are used to add or remove the confidence to predicted SVs. 

Next, variants are filtered based on quality score (QUAL) and values obtained from the duphold software. All variants with the QUAL score below 30 are excluded. The deletions with relative coverage higher than 0.7 or with a high fraction of heterozygous SNV within the region (higher than 0.25), as well as the duplications with relative coverage lower than 1.3 are also removed. 

The analyses differ if the input files are WES (whole exome sequencing) or NGS panel data. In particular, the copy number variants are called using ExomeDepth, this step also requires several reference samples without SVs to compare their coverage. Genome assignment is performed with the following rules: in the case of duplication, all variants are assumed to be heterozygous, while in the case of deletion, the variants with relative coverage below 0.1 are assumed to be homozygous and all other variants are assumed to be heterozygous. The BF (Bayes Factor) parameter is used to filter variants, keeping only those with the BF above 40.0. 

The annotation and annotation-based filtering are conducted in the same way for all types of input files (WGS, WES, and NGS panels). In particular, the variant annotation and classification according to ACMG are performed with the AnnotSV program with minor custom modifications. The report includes all duplications and deletions that are classified as pathogenic or likely pathogenic according to the ACMG classification. 

The fragment of the demo report of the Whole Exome Sequencing analysis of the structural variants.
1 - total number of pathogenic variants identified
2 - location of the SV in the genome: chromosome number
3 - location of the SV in the genome: position
4 - type of SV identified (deletion, duplication)
5 - diseases known to be associated with each of the identified gene (9) affected by SV, according to the HPO database
6 - ACMG score of the SV identified
7 - CNV score of the SV identified
8 - zygosity (homozygous or heterozygous)
9 - list of genes affected by SV
10 - BF (Bayes Factor): statistical measure employed to evaluate the likelihood of a SV occurring at a specific genomic location. High BY factor suggests strong evidence in favour of the presence of SV at given genomic location. All variants with BY below 40 are removed.
11 - total number of genes checked within the analysis. For more details check our article on how the gene panels are created.

References

[1] Layer, R.M., Chiang, C., Quinlan, A.R. et al. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol 15, R84 (2014). https://doi.org/10.1186/gb-2014-15-6-r84 

[2] https://brentp.github.io/post/smoove/ 

[3] Brent S Pedersen, Aaron R Quinlan, Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls, GigaScience, Volume 8, Issue 4, April 2019, giz040, https://doi.org/10.1093/gigascience/giz040 

Softwares and programs sources:

AnnotSV https://lbgi.fr/AnnotSV/ 

duphold https://github.com/brentp/duphold 

ExomeDepth https://github.com/vplagnol/ExomeDepth 

lumpy https://github.com/arq5x/lumpy-sv 

smoove https://github.com/brentp/smoove svtyper https://github.com/hall-lab/svtyper

<h2>Want to know more?</h2>

Want to know more?

Get in touch with us.