Closing the Gaps in the Human Genome: Why Y Was the Final Hurdle


By 2022, each human chromosome had been fully mapped with the exception of the Y chromosome.1 Despite being the shortest one, this chromosome has been the toughest to sequence because it’s studded with repetitive DNA.2 Common sequencing techniques collect short reads from random sites on a chromosome and piece them into a single read where they overlap, but repetitive DNA on the Y chromosome complicates assembly due to multisite overlaps

At last, two teams of scientists have collaboratively tackled this challenge and fully sequenced the Y chromosome. They reported their results in two independent studies published in Nature. The first study described a carefully validated, complete reference sequence, while the second study reported Y chromosome variation between 43 men from different backgrounds.3,4 Together, these data create new opportunities for exploring the genetic makeup and diversity of the Y chromosome.

“A lot of people don’t appreciate the technological development that went on under the hood. It’s really impressive, and it’s going to make assembling accurate and full genomes a lot more possible,” said Brianna Chrisman, a computational genomics researcher at the cancer genomics company GRAIL, who was not involved in either study.

See also “Large Scientific Collaborations Aim to Complete Human Genome”

In the first study, researchers from different institutes banded together under the Telomere-to-Telomere (T2T) consortium to fill in the gaps in the reference human genome. To sequence the Y chromosome, Adam Phillippy, a genomics researcher at the National Human Genome Research Institute and study coauthor, together with his colleagues, chose nanopore sequencing because it produces long reads, which unambiguously overlap even if repetitive DNA is present.5 However, this technique is error prone, producing an error every 100 bases or so. So, the researchers also used a high-fidelity technique called single-molecule circular consensus sequencing that produces shorter reads and generates an error every 1000 bases on average.6 Then, in a first, the T2T consortium used an algorithm named Verkko that incorporated both techniques to assemble highly accurate long reads into a full Y chromosome sequence.7

The first full sequence of the Y chromosome contained 30 million new base pairs. Phillippy said that most of these newly discovered sequences relate to sequences on other chromosomes but carry subtle variations. “Now the question is ‘are those subtle variations doing anything interesting?’” he said.

Phillippy and his colleagues found 110 new genes, 41 of which are predicted to code for proteins. The majority were extra copies of the TSPY gene, which is involved in sperm production. It’s not clear why these backups have evolved.

The new Y chromosome sequences could spell change for metagenomics research, which involves sequencing microbial genomes. Human DNA contaminants often creep into these studies.8 “You have people in the lab shedding skin cells into their reagents,” Phillippy explained, and these contaminant sequences could be incorrectly attributed to microbes. From a bioethics standpoint, contaminants could contain DNA signatures of the individuals from which they came. He added that people who donate samples in human microbiome studies, for example, are promised anonymity, and their DNA needs to be excluded from published datasets to avoid the future possibility of tracing their DNA back to them. 

The 30 million base pairs in the Y chromosome that were not sequenced until now created a blind spot and could have leaked through the filters. Using the complete Y chromosome sequence rather than previous versions, the team identified nearly 1000 more potential contaminants in these datasets. “It would be helpful and doable to go through the collection of public bacterial reference genomes we have, and maybe viruses as well, and try to flag these Y chromosome sequences,” Chrisman said.

Charles Lee, a genomics researcher at the Jackson Laboratory who led the second study, approached the problem from a different angle. Once the T2T consortium had finetuned the sequencing protocol they used for their study, Lee and his colleagues adopted it and applied it to 43 Y chromosomes from men who inhabited every continent except Australia. “They have samples from all over the world focusing a little bit more on South America, West Africa, and East Asia, which have been historically underrepresented,” Chrisman said. Half of the chromosomes came from African backgrounds, which were among the most genetically diverse because humans who migrated to other continents lost mutations along the way.9 By comparing variations across all 43 chromosomes, the researchers estimated that the most recent common ancestor lived approximately 183,000 years ago.

Each chromosome had a striking degree of variation on average, including three inverted sequences longer than 1000 base pairs, 88 large insertions or deletions longer than 50 base pairs, and beyond 3000 single-base pair mutations. Charting this diversity could help to identify genes that affect health and fertility in males. 

Sex chromosomes have been overlooked in disease research because they were not fully sequenced until recently. “Now, there’s no excuse not to include the Y chromosome in studies of human health,” said Melissa Wilson, a computational evolutionary biologist at Arizona State University and coauthor of the study by the T2T consortium. In fact, chromosome Y has recently garnered attention in cancer research because its loss in aging cells correlates with a poor prognosis of bladder cancer.10

“What I’m looking for next is the ability to do what we’ve done here at the single-cell level” to explore variation within an individual, said Lee. Although single-cell sequencing technology already exists, it cannot collect long reads from the DNA of one cell, he explained.

References

  1. Nurk S, et al. The complete sequence of a human genome. Science. 2022; 376(6588):44–53.
  2. Bachtrog D, Charlesworth B. Towards a complete sequence of the human Y chromosome. Genome Biol. 2001; 2(1016.1).
  3. Rhie A, et al. The complete sequence of a human Y chromosome. Nature. 2023; 620(7975).
  4. Hallast P, et al. Assembly of 43 human Y chromosomes reveals extensive complexity and variation. Nature. 2023; 620(7975).
  5. Goodwin S, et al. Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome. Genome Res. 2015; 25:1750–1756.
  6. Rhoads A, Au KF. PacBio Sequencing and Its Applications. Genomics Proteomics Bioinformatics. 2015; 13(5):278–289.
  7. Rautiainen M, et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat Biotechnol. 2023.
  8. Chrisman B, et al. The human “contaminome”: bacterial, viral, and computational contamination in whole genome sequences from 1000 families. Sci Rep. 2022; 12:9863.
  9. Choudhury A, et al. High-depth African genomes inform human migration and health. Nature. 2020; 586(7831):741–748.
  10. Abdel-Hafiz HA, et al. Y chromosome loss in cancer drives growth by evasion of adaptive immunity. Nature. 2023; 619(7970):624–631.

Note: August 30: This story was updated to correct the error rates for nanopore sequencing and single-molecule circular consensus sequencing.

Leave a Reply

Your email address will not be published. Required fields are marked *