I periodically draw attention on social media to the fact that the human genome sequence is not complete. I thoroughly enjoy the reaction of others who are shocked to find that this is actually true. Why should anyone care if the human genome sequence is not yet complete? Surely the few bits that have been missing from the reference genome aren’t all that significant. At least that’s how many who write (and talk) about the genome, including the few who acknowledge that the reference genome is incomplete, describe it.
But is the incompleteness really insignificant? Let’s take a look at some of what we know of our genomes and then ask that question again.
- Comparisons of “whole genome sequences” are not comparisons of “whole” genome sequences.
- No multicellular organism genome has yet been fully sequenced.
- Many comparative analyses continue to use much older reference genomes than the ones available today.
- Computational power and time is a limiting factor to whole genome comparisons, so many continue to pick and choose which sequences to use in generating phylogenetic comparisons.
Is there any impetus under the current paradigm to try to rectify this limited—and possibly biased—approach to gene or genome comparisons?
The Remarkably Complex Human Genome
DNA may be the blueprint for life, but it’s far from the simple caricature of a twisted ladder. The human genome, or our sets of autosomal chromosomes (22 pairs) and sex chromosomes (most commonly a pair of XX or XY), are extremely complex conglomerations of double-stranded DNA, RNA, and associated proteins, together known as chromatin. Chromatin is the material of our chromosomes. (See banner picture.) Every one of us has roughly 3,200 Mbp (Mbp = megabase pair; each Mbp equals one million base pairs) of DNA from both parents in each nucleated cell in our bodies. Portions of our 6,400 Mbp (diploid) cellular DNA are constantly being chemically modified, accessed, read, and transcribed into molecules (RNA transcripts) that regulate cell functions.
DNA serves as the template for dozens of different kinds of RNA transcripts in human cells. The types and numbers of transcripts vary depending on the cell-cycle and developmental stages of the organism, cell/tissue type, external stimuli (biochemical, electrochemical, mechanical, etc.), and other factors. RNA transcript levels are highly dynamic within a given cell and are highly differentiated between different cell types in a given organism. Some RNA transcripts, known as messenger RNAs (mRNAs), are translated into proteins. Proteins serve as structural and functional entities within the cell. Many proteins regulate other cellular and physiological functions. The bits of DNA that serve as the blueprints for proteins have been historically referenced as “genes,” and sequencing these bits of DNA was at the heart of the Human Genome Project (HGP).
Human Genome Project Goals
Most genes encoding cellular proteins are found in the active and accessible portions of chromatin. These regions of chromatin are known as euchromatin. In contrast, other portions of our DNA are tightly bound to proteins, condensed, and sequestered. These bits (known as heterochromatin) are thought to be functionally silenced or inactive in most cells. Heterochromatin, long thought to play critical structural roles even if silenced, is highly repetitious sequence found primarily at the telomeres and centromeres. The condensed bits of heterochromatin are hard to access. And since heterochromatin is sequestered and presumed silent or inactive, these regions have been of less interest to those scientists seeking to understand the active and accessible bits of the human genome. For these reasons, HGP teams endeavored to sequence the entire euchromatic portions of the human genome and render a reference genome for discovery, comparison, and annotation of the protein coding regions. Sequencing the heterochromatin and inaccessible bits of the human genome was not part of the original HGP goals.
The Draft Sequence
In 2001, two teams of scientists reported the completion of the first draft of the human genome1—a truly herculean accomplishment that took a little over 10 years! They reported successful coverage of ~96% of the euchromatic part of the human genome. The draft nature of the genome acknowledged gaps that existed within the euchromatic sequences—many due to copy number variations in highly repetitive sequences or segments not captured by DNA cloning—as well as gaps associated with heterochromatic and inaccessible regions.
The challenges faced in resolving or closing these gaps and correcting initial errors in the euchromatic regions required great effort over several more years—and still continues to this day.
Finishing the Human Genome
In 2003, a featured news article in Nature announced the completion of all of the original goals of the Human Genome Project.2 This announcement prompted reports in the news and elsewhere of the completion of the human genome. But completion of the original goals of the HGP was and is not the same thing as completing the human reference genome. In 2004, the International Human Genome Sequencing Consortium (IHGSC) published a follow-up report indicating ~99% completion of the euchromatic human genome.3 Commenting on the earlier reports from 2001, the authors stated, “Both draft sequences, however, had important shortcomings. The IHGSC sequence, for example, omitted ~10% of the euchromatic genome; it was interrupted by ~150,000 gaps,” and, “The draft sequence contained regions in which the local order and orientation were unknown; these have now been resolved.”
In the paragraphs that followed in their report, they compare the 2004 reference genome with the original drafts. “The number of gaps has been reduced 400-fold to only 341, most of which are associated with segmental duplications and will require new methods for resolution. The assembled near-complete genome sequence has an error rate of only ~1 event per 100,000 bases; it contains 2.85 billion nucleotides [single strand haploid number] and covers ~99% of the euchromatic genome.” And, “Additionally, the draft sequence contained substantial artifactual duplication, including local events caused by errors in merging some adjacent BAC-based sequences, made by the first-generation global assembly program, and global events caused by contamination of shotgun assemblies of some BACs with data from other clones. These artifacts have now been eliminated.”4 (BAC refers to bacterial artificial chromosome.)
The 2004 report discusses many of the challenges in finishing the human (euchromatic) reference genome, including this observation, “The goal for completion . . . was challenging because the human genome is replete with such features as dispersed repeats and large segmental duplications, which greatly complicate the determination of genome structure and sequence.“5 The human genome is actually about 50% repetitive sequences, and much of that repetition lies in the euchromatic portions, not just in heterochromatin.
By 2004, although the original goals of the HGP had seemingly been met, the human reference genome was nevertheless incomplete. The report cited completion of 2,851,330,913 nucleotides, lying almost entirely within the euchromatic portion of the genome, interrupted by only 341 gaps. Of those 341 interruptions, only 33 gaps (~198 Mb) reflect heterochromatin, and 308 gaps (~28 Mb) are euchromatic. The euchromatic genome was thus reported as ~2.88 Gb and the overall human genome was ~3.08 Gb.6
This report indicates that in 2004 the human genome in toto (euchromatin plus heterochromatin) was ~6.5% incomplete. But here’s a funny thing about gaps: as more sequences are generated and sequencing technology continues to advance, sometimes gaps multiply rather than close, and sometimes they’re bigger than imagined.
Minding the Gaps
Two years later, in 2006, as researchers reported closing more gaps and the completion of the human chromosome 1 (euchromatic) sequence,7 Helen Pearson wrote a Nature news article entitled “Human Genome Completed (Again)”.8 In it, she commented, “Haven’t scientists already announced the completion of the human genome? Well, yes. Twice. In June 2000,9 two teams declared with great fanfare that they had produced a draft copy of the human genetic code, but there were many gaps and errors in this version. Another announcement, in 2003 [and the follow-up publication in 2004],10 marked the completion of a far more accurate ‘finished’ sequence by those involved in the public-financed Human Genome Project, although there are still a few gaps and uncertain areas in this one too.”
The Genome Research Consortium
In 2009, ten years after publication of the draft genome, Nature produced a special anniversary edition, highlighting the numerous and significant reports building on the HGP reference genome, entitled “The Human Genome at Ten.”11 Tucked away at the bottom of the table of contents was one news feature, “The Genome Finishers” by Elie Dolgin.12 This article is insightful, if not very detail-oriented. One takeaway is that the Genome Reference Consortium (GRC) picked up where the IHGSC and HGP left off.
So in 2018, twelve years after the misreports of a complete human genome, why are we still addressing the status of the human reference genome? Well, for one reason because it’s still incomplete. And for another, functionality associated with heterochromatic regions continues to grow.
Current Genome Status
*Note: This section is a bit technical and one may skip ahead to the next section if technicalities presented here seem too much. As of 2017, according to the GRC, 875 gaps remain in the human reference genome. The most relevant statistics taken from the site on December 12, 2017 include information in the following tables and a necessary disclaimer: “Total lengths are calculated by summing the length of the sequenced bases and estimated gaps.”
Human assembly information
|Current major assembly
|Regions with alternate loci
|Total non-N bases
Interestingly, as sequence issues are resolved and work is done to improve the quality of the human reference genome, the number of unplaced scaffolds has risen. Comparing the reference genome GRCh37 (Feb. 2009) to the Dec. 2017 version GRCh38.p12, the total number of scaffolds dropped slightly, but the number of unplaced scaffolds (sequences localized to a chromosome, but unplaced within the chromosome) rose from ~6.1 million to ~11.5 million. The number of gaps also rose (from 372 to 875), in part, due to the addition of some assembled sequences into previously unspanned gaps. Introduction of sequence into a gap without overlapping ends at the preexisting gap site doubles the number of gaps with each sequence addition. Most importantly, the total number of bases in the genome (including estimated gaps) continues to rise (e.g. from 3,137,144,693 to 3,257,347,282 in the 6-month period between assemblies: GRCh38.p11 (Jun. 2017) and GRCh38.p12 (see table above; Dec 2017)).
Is Sequencing Heterochromatin Worth the Effort?
Thankfully, many researchers believe the difficulties with accessing and sequencing the heterochromatic portions of the human genome are worth the effort. Researchers continue to rise to the challenge—as indicated in a February 2018 publication of a Y-chromosomal centromeric (heterochromatic) sequence.13 This is the first report of completion of centromeric sequences from the shortest human chromosome. [Future blogs will cover even more recent findings.]
Heterochromatin that is tightly sequestered away in any given cell at any given time was once likely actively transcribed (e.g., during development or prior to cell differentiation). Or it may be actively transcribed during particular cellular processes (during cell division or in response to a particular physiological signal) that are not assayed nor captured in any particular analysis.
The human genome is not a static set of 23 chromosomes. Our genomes are dynamic! They vary from cell to cell in which bits are active and which are sequestered. They vary in a single cell during the cell cycle (when cells grow and divide). They vary in response to stimuli, from hormones to other biochemical and bioelectric signals. They even vary in response to mechanical signals like pressure, stretching, sheering, or torque. The twisted ladder we often see in pictures—representing our DNA—or the X-shaped structures that represent our chromosomes do not give us a good basis for thinking about the dynamic genome.
Like many other things that come to mind when we think of DNA, genomes, or cells, our visual models are extreme simplifications rendered to help us comprehend complex concepts and components in isolation or subsets. As we focus on particular components and tease out their structure and function, we lose sight of complexity at every other level. It’s no surprise, then, as we pursue reductionistic explanations of complex systems that we continually underestimate the complexity of not just the system itself but the individual components, which may have multiple forms or functions. And we often lose sight of the greater network of systems regulating, feeding, and interacting with our particular system of study or interest.
Other bits of our genomic DNA have multiple copy repeats, and still others seem to have large variations between human individuals. And there are gaps (called muted gaps) that are sequences in some individuals and are apparently absent in others and/or in the reference genome.
A Post-Genomic Era?
Our words matter. They convey meaning from which many of us draw conclusions about risks and rewards. If the genome is truly “finished,” and if we’ve moved into a “post-genomic era,” then certainly we have increased confidence to edit the text of the human genome. In the highly repetitious landscape of the human genome, we wouldn’t want to target 12.8 billion bases of information-bearing sequences if we don’t already know what the 12.8 billion bases are and where target decoys or duplications might occur. This is especially true if we intend to alter the genome for generations of offspring to follow.
The human genome isn’t complete, and how incomplete it is is hard to determine. Harvard geneticist George Church and other human genome researchers estimate 4–10%. My estimates of 8–10% fall in that range.14 The truth is, no one knows for sure.
What we do know for sure is that those who keep repeating that the human genome is complete are either uninformed, being imprecise, or propagating a meaningful (yet, maybe in their minds, insignificant) lie. Some might choose to obfuscate the clarity of what we’ve accomplished for a variety of reasons—perhaps to maintain and advance a paradigm that humans are not so unique, or to facilitate further applications (CRISPR genome editing) altering the human genome, or to not disrupt the significance of 100s to 1000s of publications making comparisons and definitive statements based on such comparisons.
In contrast, others might seek to bring these observations and omissions to light, speaking more accurately about what we have accomplished and have yet to accomplish. Such dialogue is crucial, especially as many rush to alter the human genome in irresponsible ways, similar to those claimed by Chinese fertility researcher He Jiankui last year.15 We are fearfully and wonderfully made. And the human genome is extremely complex, far more so than we imagine today. Surely we shouldn’t allow hubris to push us toward risky actions based on good intentions but incomplete information.