Whose genome is the reference genome?

Katarzyna Tomala, PhD

In this text, we’ll take a closer look at the human reference genome - what is it and when do we need it in the context of Next-Generation Sequencing (NGS) data analysis.

PRECISION AND SENSITIVITY OF NGS DATA ANALYSIS 

The first step in the standard Next-Generation Sequencing data analysis is alignment. During this process, reads obtained from the sequencing device are compared to the standard reference sequence, representing the whole genome of the analyzed species. Each read is assigned to the piece of the reference, where it fits best (has the lowest number of mismatches). It is also given a mapping quality score, which reflects how confident the assignment is. We can then use the aligned reads to see what distinguishes them from the reference sequence in the variant calling analysis step. 

If you’re wondering what determines the precision and sensitivity of the variant calling step, the answer is – quality, purity, and number of reads. On top of that, it’s also the reference sequence that matters. To exemplify this thesis, if the reference misses some fragment of the real species genome or represents it in a very erroneous way, reads covering the fragment will be excluded from analysis or will be aligned to the incorrect location. Either way, you run the risk of losing the information about the fragment. What’s more, if there’s a misalignment, you may discover false variants.

HUMAN REFERENCE GENOME

Let’s look back at history for a moment. The first version of the human reference genome was published in 2001 and since then was gradually improved by the Genome Reference Consortium (GRC) https://www.ncbi.nlm.nih.gov/grc/human. The current version - GRCh38 (p14) is almost complete. Still, it does not correctly cover some regions of the genome (the short arms of acrocentric chromosomes, centromeres, several duplicated euchromatic regions, heterochromatic regions). It is also not fully assembled, which means that some parts of the genome are not placed within the chromosomes, but left as short, unlocalized sequences. 

IS THERE A PERSON OUT THERE WITH THE REFERENCE GENOME?

Let us explain that question. So, the genome of each and every one of us, humans, consists of about 6 billion nucleotides. Average genomes of every two unrelated persons differ in ~5 million positions. With that said, there is no such thing as  “the most representative” or “the best” human DNA. Based on that, it’s also safe to say that the reference genome is also not the perfect, the most common, or a healthy one. Actually, it doesn’t even represent the real haplotype of a human. Instead, it’s a mixture of genomic sequences of several volunteers and it contains a small number of pathogenic variants. But, since it is established as a standard, all human genetic variation data is represented as deviations from it. The conclusion is that such data, concerning known variant population frequencies, association with diseases and phenotypes is essential to evaluate the pathogenicity of the called variants.  

Telomere-to-Telomere (T2T) GENOME

Recently, the Telomere-to-Telomere (T2T) Consortium has published a new human reference. It represents a haplotype from one individual of European origin – to be more specific, the complete hydatidiform mole CHM13 cell line - autosomes and chromosome X; and GIAB HG002 sample - chromosome Y. Due to the usage of the long-read sequencing technologies, along with short-read data, the published sequence is fully and correctly assembled and covers also the repetitive regions of the genome, missing in the GRCh38 reference (~8 % of the genome). What’s even more interesting and ground-breaking is that new protein-coding genes have been identified within these regions.

HUMAN DIVERSITY

Both T2T-CHM13 and GRCh38 references have one additional limitation – they can’t represent the full diversity of all human genomes. It’s really common and easily noticeable in some very variable regions, such as the HLA loci. In other words, the sequence within such regions of some individuals differs from the references so much that the reads belonging to these regions cannot be correctly aligned. 

Along the way, there have been several attempts to solve this problem. As the first solution, the ALT contigs that contain sequences of such regions from different individuals have been added to the GRCh38 reference. Even though it did enable such region analysis for some individuals, it led to several other problems. This is because the ALT contigs mixed the very variable regions with regions identical to those present in the “normal” contigs. During the alignment, if such contigs were treated properly, many reads mapped equally well to both the “normal”  and ALT contigs and received low mapping scores.

Another solution has been proposed by the Human Pangenome Consortium. The consortium collects high-quality haplotype assemblies of people from different populations to incorporate them into a graph-based reference pangenome. Thanks to that, this type of reference will contain not only one linear genome but also edges representing distinct variants and haplotypes. The consortium also develops tools that will be necessary to use this type of reference in sequencing data analyses.

OK, SO WHICH REFERENCE SHOULD WE USE?

To sum up, we use the GRCh38 reference genome for our workflows right now. This lets us easily access information from large databases and services which support data based on this reference: gnomAD, Ensembl, dbSNP, dnsNFP, UCSC, etc. Given the huge impact of the reference quality on the NGS analysis precision, we will switch to the more correct T2T-CHM13 assembly, or even to the graph-based one, as soon as the above databases start to support it and the analysis tools will be ready.

Katarzyna Tomala, PhD

Senior Genome Scientist at Intelliseq

Intelliseq

REFERENCES

International Human Genome Sequencing Consortium. “Finishing the euchromatic sequence of the human genome.” Nature vol. 431,7011 (2004): 931-45. doi:10.1038/nature03001

Ballouz, Sara et al. “Is it time to change the reference genome?.” Genome biology vol. 20,1 159. 9 Aug. 2019, doi:10.1186/s13059-019-1774-4

Nurk, Sergey et al. “The complete sequence of a human genome.” Science (New York, N.Y.) vol. 376,6588 (2022): 44-53. doi:10.1126/science.abj6987

Church, Deanna M. “A next-generation human genome sequence.” Science (New York, N.Y.) vol. 376,6588 (2022): 34-35. doi:10.1126/science.abo5367

https://gatk.broadinstitute.org/hc/en-us/articles/360037498992--How-to-Map-reads-to-a-reference-with-alternate-contigs-like-GRCH38

Wang, Ting et al. “The Human Pangenome Project: a global resource to map genomic diversity.” Nature vol. 604,7906 (2022): 437-446. doi:10.1038/s41586-022-04601-8

<h2>Want to know more?</h2>

Want to know more?

Get in touch with us.