One of the many surprises to stem from sequencing the human genome was the revelation that protein-coding sequences make up a relatively small proportion of our DNA. These exons, collectively known as the exome, account for less than 2 percent of the human genome. Still, scientists often search through exomes for the genetic basis of diseases—and such searches have proven fruitful, identifying the culprits behind rare diseases and pathological genetic changes in tumors. But researchers are increasingly realizing that whole-exome sequencing tells only part of the story: Mutations in noncoding regions of the genome can also cause disease—for example, by affecting the transcription of a gene.
© COURTESY OF DAVID SLIPHER
To begin to uncover some of these overlooked effects, researchers recently analyzed the whole genome sequences of more than 150,000 individuals from the UK Biobank, a massive database that contains DNA samples and phenotypic data from 500,000 individuals. Their findings, published July 20 in Nature, include 12 genetic variants not detected in whole exome sequencing that influence traits such as height and age of the onset of menarche.
The Scientist spoke with Kári Stefánsson, founder of deCODE Genetics, which sequenced half of the genomes analyzed in the study, about the importance of whole genome sequencing. (Amgen, deCODE’s parent company, was one of four companies that contributed to the study’s funding; the other half of the sequencing was performed by the Wellcome Sanger Institute.)
The Scientist: What is the UK Biobank, and what is its whole genome sequencing consortium attempting to achieve?
Kári Stefánsson: What we are always aspiring to do in population studies like this is to develop understanding of human diversity. The diversity in risk of disease, response to treatment, diversity when it comes to educational attainment, socioeconomic status, et cetera.
People have been debating whether to use whole exome sequencing or whole genome sequencing, and which one of these two yields the most useful data.
When we look at these 150,000 genomes, we began to look at regions that . . . have substantial sequence conservation. The assumption is that the regions that are least tolerant of sequence diversity are the regions that must be of greatest functional significance. And when we look at the 1 percent of the genome that is least tolerant of sequence diversity . . . 83 percent of them are in the intragenic sequences, not in the exons. So it is absolutely clear that there is enormous information to be mined out [of] those regions.
The exons are only a very small part of a genome, and the rest of the genome is not useless.
In this paper, we also . . . listed about 12 phenotypes where we found variants in the genome that associate with them, where we could not find the same by using whole exome sequencing. It is absolutely clear . . . that whole exome sequencing was extremely valuable, gave us a spectacular insight into the role of coding sequences in the pathogenesis of all kinds of diseases, but that whole exome sequencing doesn’t suffice.
TS: So whole genome sequencing was attempted because whole exome sequencing has not captured the whole picture?
KS: Evolution is absolutely ruthless and sheds everything that we don’t need. The exons are only a very small part of a genome, and the rest of the genome is not useless. It’s absolutely clear that the rest of the genome is functionally very important and therefore does not allow for boundless sequence diversity.
TS: What were the technical challenges in doing whole genome sequencing on this very large scale?
KS: There are all kinds of challenges, but we are fairly used to scaling up and taking processes that are usually done on a relatively small scale and do them on a large scale. . . . Certainly there’s an enormous amount of data that comes out of 150,000 genomes. There is a challenge, for example, in joint variant calling [the process to identify genetic variants from sequence data], when you’re calling the variants in all of these genomes simultaneously. There’s a challenge when it comes to just scoring and managing and mining these data. This is becoming, first and foremost, an informatics challenge.
TS: What are the remaining challenges?
KS: We are all of us aspiring to understand human diversity. And if you look at the data coming out of the UK Biobank, it is not an unbiased sample of the population of Great Britain. There is an overrepresentation of people of European descent. And what we have of sequence diversity from people of African descent, of Asian descent, et cetera, is much less than we need.
It’s incredibly important . . . from a scientific point of view, to get more representation of people of other ethnic groups. It is also, from a societal point of view, unacceptable to have this little information on people of other descents. The health care disparity in the world begins with the fact that we know less about the nature of diseases in people of other origins than European. . . . So one of the challenges is to make sure that we have formidable cohorts of people of other descents to work with.
TS: What did you learn from the whole genome sequencing published in the paper?
KS: The main, most important lesson is . . . how [an] incredibly large percentage of the regions with high sequence conservation are outside of the exons. . . . It means that we have a formidable task in front of us to annotate the regions with low depletion score or little tolerance for sequence diversity.
TS: And you identified several variants associated with phenotypic diversity?
KS: That is just the first step. We listed about 12 associations, but this is sequence diversity for the rest of the world to work on, to look for correlations between variants in the sequence and phenotypes. And we just put a few examples of how we could do this with whole genome sequencing where we could not find this with the whole exome sequencing.
TS: The genome sequences are available online, for other researchers to work on?
KS: They will be available through the UK Biobank. We also put on our website a database of allelic frequencies. The reason we did that is that when you do whole genome sequencing for diagnostic purposes, it is extremely important to have a reference that you can go to, to make sure if you are sequencing someone with a particular disease and you find a rare variant . . . that the variant that you find in the unfortunate child isn’t found in a bunch of healthy individuals. So it is a valuable resource for those who want to work on diagnostic sequencing. . . . We felt it was our duty to make it available to everyone who’s working on diagnostic sequencing.
Editor’s note: This interview has been edited for brevity.