Autor/es reacciones

Lluís Montoliu

Research professor at the National Biotechnology Centre (CNB-CSIC) and at the CIBERER-ISCIII

 

The sequencing of the first human genome in 2001 was a remarkable milestone. Being able to read the more than three billion base pairs of the genome (albeit with many gaps and uncertainties) allowed us, for the first time, to have a reference genome with which any individual genome could be compared to identify possible disease-causing mutations. That first sequenced genome did not belong to a single individual but was constructed using genetic data from various people. The technology used at the time allowed for relatively short reads. With the development of massive sequencing—which generally also produces short reads of about 150 bases—combined with long-read sequencing in 2022, many of the gaps were filled in and about 200 million new letters were added to the human genome, through a consortium of researchers who named themselves “Telomere-to-Telomere” (T2T), referring to the ends of chromosomes, the telomeres. Something like “from end to end.” In 2023, the sequencing of the Y chromosome—the smallest of all, which had not yet been obtained—was completed, adding another 30 million letters to the human genome, bringing its size to 3.23 billion base pairs. Any two humans share 99.9% of those letters, differing in only 0.1%, which corresponds to about 3.2 million letters (inherited from our mother and another 3.2 million from our father).

The technology that enables the reading of very long DNA strands—tens or hundreds of thousands of intact bases—made it possible in 2023 to begin discovering underlying genetic variability between different human genomes. At that time, genomes from 47 populations around the world were characterized. This was the first version of the so-called "Pangenome," a collection of genomes that captures the existing genetic variability among human beings. There is no single genome; rather, each population (and essentially each individual) has slightly different genomes, especially in intergenic regions—between genes—which make up a whopping 98% of our genome, leaving just 2% for our twenty thousand genes, which are the ones we need to live.

This week, the journal Nature publishes two related collaborative papers from the T2T consortium, along with contributions from many other international laboratories (mostly German and American), in which the most optimized versions of long-read DNA sequencing technologies have been applied. What these researchers found is a large number of previously unknown structural variants (SVs) that had gone unnoticed. For example, if we have a 5,000-base DNA segment repeated several times in tandem and you sequence the genome using short fragments of 150 bases, since each of these segments is essentially identical, you won’t be able to detect all the repetitions—you might at most detect a few. However, if you apply long-read sequencing technology and can pass very long DNA strands through a nanopore that contains all of these tandemly repeated units—either as direct or inverted repeats—you might deduce that one person has, say, 47 repetitions while another has only 23, and inverted ones at that. In other words, this once again reveals additional underlying genetic variability in our genomes—variability we suspected but did not know or couldn’t interpret until technologies emerged that allow us to read very long strands of intact DNA, such as those offered by the most sophisticated sequencing methods developed by Oxford Nanopore Technologies (ONT) and PacBio.

The first paper reports up to 65 representative human genomes (expanding the pangenome) containing up to 130 haplotypes (contiguous chromosomal fragments inherited together from parents), filling in many of the previously unknown intervals and gaps still present in the human genome. A second paper details the most precise sequencing yet, using long reads from over a thousand human individuals, enabling the identification of up to 100,000 structural variants and 300,000 variable-number tandem sequences. Mobile elements—jumping genes, transposons, and retrotransposons—are proposed as the origin of this structural diversity, along with the existence of homologous recombination events, that is, the mixing of sequences based on the similarity of their bases.

We still know very little about the true meaning and impact of having 40 or 400 copies of a specific DNA segment, but what these two publications show is that each individual’s genome is unique, with its own structural variations, which can coexist within a population. Hence the move toward the pangenome (a set of descriptive genomes from dozens of human populations) as our new "reference genome"—not just a single genome anymore, but many genomes—which we should use to detect the presence or absence of mutations in genes or intergenic sequences that can help diagnose people with genetic diseases. Genetic diagnosis always precedes the development of any potential gene therapy. That’s why these two papers are significant: they reveal the additional complexity of our genome, which is much more variable between individuals than we ever imagined. And this should help us better diagnose patients affected by congenital disorders or genetically based diseases.

EN