A typical single read using current short-read sequencing technology spans approximately 150 nucleotides. Alignment of these reads becomes difficult when a sequence lacks specificity, such as when it contains repetitive motifs. Consequently, such reads often receive lower mapping quality scores since they can potentially align to multiple regions within a genome. For example, the read shown below can be equally well mapped to several regions on chromosome 22.
Long-read technology significantly improves mapping accuracy due to the considerably longer lengths of individual reads. The increased read length substantially increases the probability of identifying a unique sequence within each read. This in turn facilitates clear and unambiguous alignment to a specific genomic region.
Another significant advantage of long-read sequencing technology is the ability to conduct phasing. Phasing is the process of determining the pattern of genetic variants on a single chromosome, making it possible to identify haplotypes and determine the specific combination of alleles carried on each chromosome.
Haplotypes are specific combinations of genetic variants, or alleles, that tend to be inherited together on the same chromosome. In the context of pharmacogenetics, the identification of whole haplotypes rather than individual variants is crucial to determine the association with drug metabolism. The relevance of haplotypes lies in their ability to capture the complexity of genetic variation within genes. Since multiple genetic variants can affect the function of pharmacogene, analysis of haplotypes provides a more complete understanding of how genetic variants interact and affect drug metabolism. The use of long-read sequencing technology enhances the accuracy of haplotype determination.
CYP2D6 is one of the most important pharmacogenes. It is involved in the metabolism of ~25% of commonly used drugs. Below the CYP2D6 gene is visualized via IGV viewer, with the top panel showing the reads from short-read sequencing and the bottom one from long-read sequencing (colors indicate phased reads).
Huntington's disease is caused by a mutation in the HTT (huntingtin) gene on chromosome 4. This mutation involves a trinucleotide (usually CAG) repeat which encodes the amino acid glutamine. In healthy individuals, the number of repeats is usually between 10 and 35, while in people with Huntington's disease, the number exceeds 40. Although it is possible to determine the size of the repeated region using short-read technology, accurate determination of the number of repeats within haplotypes is much more direct and straightforward using long-read technology.
Using short-read sequencing to identify large genomic rearrangements, i.e. structural variants, can be challenging, especially when they occur as heterozygotes. The example below shows a heterozygous deletion of the whole CYP2D6 gene on one chromosome 11 (blue space on the top panel). Despite the lower coverage in this region for short-read sequencing (middle panel), the deletion remained undetected. In contrast, phased long reads (shown in colors in the bottom panel) clearly reveal the deletion on one chromosome while showing a normal sequence on the other.
Although significant improvements are already available, the accuracy of long-read sequencing is still lower than that of short-read sequencing. Currently, multiple sequencing technologies often need to be combined to produce the most reliable genomes.
The other challenge may be related to computational resources; the raw sequence files generated by long-read sequencing tend to be larger and require more computing power to be analyzed in comparison to files generated by short-read sequencing.
Get in touch with us.