Skip to main content

whole genome sequencing

All of Us: Release of Nearly 100,000 Whole Genome Sequences Sets Stage for New Discoveries

Posted on by

Diverse group of cartoon people with associated DNA

Nearly four years ago, NIH opened national enrollment for the All of Us Research Program. This historic program is building a vital research community within the United States of at least 1 million participant partners from all backgrounds. Its unifying goal is to advance precision medicine, an emerging form of health care tailored specifically to the individual, not the average patient as is now often the case. As part of this historic effort, many participants have offered DNA samples for whole genome sequencing, which provides information about almost all of an individual’s genetic makeup.

Earlier this month, the All of Us Research Program hit an important milestone. We released the first set of nearly 100,000 whole genome sequences from our participant partners. The sequences are stored in the All of Us Researcher Workbench, a powerful, cloud-based analytics platform that makes these data broadly accessible to registered researchers.

The All of Us Research Program and its many participant partners are leading the way toward more equitable representation in medical research. About half of this new genomic information comes from people who self-identify with a racial or ethnic minority group. That’s extremely important because, until now, over 90 percent of participants in large genomic studies were of European descent. This lack of diversity has had huge impacts—deepening health disparities and hindering scientific discovery from fully benefiting everyone.

The Researcher Workbench also contains information from many of the participants’ electronic health records, Fitbit devices, and survey responses. Another neat feature is that the platform links to data from the U.S. Census Bureau’s American Community Survey to provide more details about the communities where participants live.

This unique and comprehensive combination of data will be key in transforming our understanding of health and disease. For example, given the vast amount of data and diversity in the Researcher Workbench, new diseases are undoubtedly waiting to be uncovered and defined. Many new genetic variants are also waiting to be identified that may better predict disease risk and response to treatment.

To speed up the discovery process, these data are being made available, both widely and wisely. To protect participants’ privacy, the program has removed all direct identifiers from the data and upholds strict requirements for researchers seeking access. Already, more than 1,500 scientists across the United States have gained access to the Researcher Workbench through their institutions after completing training and agreeing to the program’s strict rules for responsible use. Some of these researchers are already making discoveries that promote precision medicine, such as finding ways to predict how to best to prevent vision loss in patients with glaucoma.

Beyond making genomic data available for research, All of Us participants have the opportunity to receive their personal DNA results, at no cost to them. So far, the program has offered genetic ancestry and trait results to more than 100,000 participants. Plans are underway to begin sharing health-related DNA results on hereditary disease risk and medication-gene interactions later this year.

This first release of genomic data is a huge milestone for the program and for health research more broadly, but it’s also just the start. The program’s genome centers continue to generate the genomic data and process about 5,000 additional participant DNA samples every week.

The ultimate goal is to gather health data from at least 1 million or more people living in the United States, and there’s plenty of time to join the effort. Whether you would like to contribute your own DNA and health information, engage in research, or support the All of Us Research Program as a partner, it’s easy to get involved. By taking part in this historic program, you can help to build a better and more equitable future for health research and precision medicine.

Note: Joshua Denny, M.D., M.S., is the Chief Executive Officer of NIH’s All of Us Research Program.


All of Us Research Program (NIH)

All of Us Research Hub

Join All of Us (NIH)

A Global Look at Cancer Genomes

Posted on by

Cancer Genomics

Cancer is a disease of the genome. It can be driven by many different types of DNA misspellings and rearrangements, which can cause cells to grow uncontrollably. While the first oncogenes with the potential to cause cancer were discovered more than 35 years ago, it’s been a long slog to catalog the universe of these potential DNA contributors to malignancy, let alone explore how they might inform diagnosis and treatment. So, I’m thrilled that an international team has completed the most comprehensive study to date of the entire genomes—the complete sets of DNA—of 38 different types of cancer.

Among the team’s most important discoveries is that the vast majority of tumors—about 95 percent—contained at least one identifiable spelling change in their genomes that appeared to drive the cancer [1]. That’s significantly higher than the level of “driver mutations” found in past studies that analyzed only a tumor’s exome, the small fraction of the genome that codes for proteins. Because many cancer drugs are designed to target specific proteins affected by driver mutations, the new findings indicate it may be worthwhile, perhaps even life-saving in many cases, to sequence the entire tumor genomes of a great many more people with cancer.

The latest findings, detailed in an impressive collection of 23 papers published in Nature and its affiliated journals, come from the international Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium. Also known as the Pan-Cancer Project for short, it builds on earlier efforts to characterize the genomes of many cancer types, including NIH’s The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC).

In these latest studies, a team including more than 1,300 researchers from around the world analyzed the complete genomes of more than 2,600 cancer samples. Those samples included tumors of the brain, skin, esophagus, liver, and more, along with matched healthy cells taken from the same individuals.

In each of the resulting new studies, teams of researchers dug deep into various aspects of the cancer DNA findings to make a series of important inferences and discoveries. Here are a few intriguing highlights:

• The average cancer genome was found to contain not just one driver mutation, but four or five.

• About 13 percent of those driver mutations were found in so-called non-coding DNA, portions of the genome that don’t code for proteins [2].

• The mutations arose within about 100 different molecular processes, as indicated by their unique patterns or “mutational signatures.” [3,4].

• Some of those signatures are associated with known cancer causes, including aberrant DNA repair and exposure to known carcinogens, such as tobacco smoke or UV light. Interestingly, many others are as-yet unexplained, suggesting there’s more to learn with potentially important implications for cancer prevention and drug development.

• A comprehensive analysis of 47 million genetic changes pieced together the chronology of cancer-causing mutations. This work revealed that many driver mutations occur years, if not decades, prior to a cancer’s diagnosis, a discovery with potentially important implications for early cancer detection [5].

The findings represent a big step toward cataloging all the major cancer-causing mutations with important implications for the future of precision cancer care. And yet, the fact that the drivers in 5 percent of cancers continue to remain mysterious (though they do have RNA abnormalities) comes as a reminder that there’s still a lot more work to do. The challenging next steps include connecting the cancer genome data to treatments and building meaningful predictors of patient outcomes.

To help in these endeavors, the Pan-Cancer Project has made all of its data and analytic tools available to the research community. As researchers at NIH and around the world continue to detail the diverse genetic drivers of cancer and the molecular processes that contribute to them, there is hope that these findings and others will ultimately vanquish, or at least rein in, this Emperor of All Maladies.


[1] Pan-Cancer analysis of whole genomes. ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Nature. 2020 Feb;578(7793):82-93.

[2] Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Rheinbay E et al; PCAWG Consortium. Nature. 2020 Feb;578(7793):102-111.

[3] The repertoire of mutational signatures in human cancer. Alexandrov LB et al; PCAWG Consortium. Nature. 2020 Feb;578(7793):94-101.

[4] Patterns of somatic structural variation in human cancer genomes. Li Y et al; PCAWG Consortium. Nature. 2020 Feb;578(7793):112-121.

[5] The evolutionary history of 2,658 cancers. Gerstung M, Jolly C, Leshchiner I, Dentro SC et al; PCAWG Consortium. Nature. 2020 Feb;578(7793):122-128.


The Genetics of Cancer (National Cancer Institute/NIH)

Precision Medicine in Cancer Treatment (NCI)

ICGC/TCGA Pan-Cancer Project

The Cancer Genome Atlas Program (NIH)

NCI and the Precision Medicine Initiative (NCI)

NIH Support: National Cancer Institute, National Human Genome Research Institute

Whole-Genome Sequencing Plus AI Yields Same-Day Genetic Diagnoses

Posted on by

Caption: Rapid whole-genome sequencing helped doctors diagnose Sebastiana Manuel with Ohtahara syndrome, a neurological condition that causes seizures. Her data are now being used as part of an effort to speed the diagnosis of other children born with unexplained illnesses. Credits: Getty Images (left); Jenny Siegwart (right).

Back in April 2003, when the international Human Genome Project successfully completed the first reference sequence of the human DNA blueprint, we were thrilled to have achieved that feat in just 13 years. Sure, the U.S. contribution to that first human reference sequence cost an estimated $400 million, but we knew (or at least we hoped) that the costs would come down quickly, and the speed would accelerate. How far we’ve come since then! A new study shows that whole genome sequencing—combined with artificial intelligence (AI)—can now be used to diagnose genetic diseases in seriously ill babies in less than 24 hours.

Take a moment to absorb this. I would submit that there is no other technology in the history of planet Earth that has experienced this degree of progress in speed and affordability. And, at the same time, DNA sequence technology has achieved spectacularly high levels of accuracy. The time-honored adage that you can only get two out of three for “faster, better, and cheaper” has been broken—all three have been dramatically enhanced by the advances of the last 16 years.

Rapid diagnosis is critical for infants born with mysterious conditions because it enables them to receive potentially life-saving interventions as soon as possible after birth. In a study in Science Translational Medicine, NIH-funded researchers describe development of a highly automated, genome-sequencing pipeline that’s capable of routinely delivering a diagnosis to anxious parents and health-care professionals dramatically earlier than typically has been possible [1].

While the cost of rapid DNA sequencing continues to fall, challenges remain in utilizing this valuable tool to make quick diagnostic decisions. In most clinical settings, the wait for whole-genome sequencing results still runs more than two weeks. Attempts to obtain faster results also have been labor intensive, requiring dedicated teams of experts to sift through the data, one sample at a time.

In the new study, a research team led by Stephen Kingsmore, Rady Children’s Institute for Genomic Medicine, San Diego, CA, describes a streamlined approach that accelerates every step in the process, making it possible to obtain whole-genome test results in a median time of about 20 hours and with much less manual labor. They propose that the system could deliver answers for 30 patients per week using a single genome sequencing instrument.

Here’s how it works: Instead of manually preparing blood samples, his team used special microbeads to isolate DNA much more rapidly with very little labor. The approach reduced the time for sample preparation from 10 hours to less than three. Then, using a state-of-the-art DNA sequencer, they sequence those samples to obtain good quality whole genome data in just 15.5 hours.

The next potentially time-consuming challenge is making sense of all that data. To speed up the analysis, Kingsmore’s team took advantage of a machine-learning system called MOON. The automated platform sifts through all the data using artificial intelligence to search for potentially disease-causing variants.

The researchers paired MOON with a clinical language processing system, which allowed them to extract relevant information from the child’s electronic health records within seconds. Teaming that patient-specific information with data on more than 13,000 known genetic diseases in the scientific literature, the machine-learning system could pick out a likely disease-causing mutation out of 4.5 million potential variants in an impressive 5 minutes or less!

To put the system to the test, the researchers first evaluated its ability to reach a correct diagnosis in a sample of 101 children with 105 previously diagnosed genetic diseases. In nearly every case, the automated diagnosis matched the opinions reached previously via the more lengthy and laborious manual interpretation of experts.

Next, the researchers tested the automated system in assisting diagnosis of seven seriously ill infants in the intensive care unit, and three previously diagnosed infants. They showed that their automated system could reach a diagnosis in less than 20 hours. That’s compared to the fastest manual approach, which typically took about 48 hours. The automated system also required about 90 percent less manpower.

The system nailed a rapid diagnosis for 3 of 7 infants without returning any false-positive results. Those diagnoses were made with an average time savings of more than 22 hours. In each case, the early diagnosis immediately influenced the treatment those children received. That’s key given that, for young children suffering from serious and unexplained symptoms such as seizures, metabolic abnormalities, or immunodeficiencies, time is of the essence.

Of course, artificial intelligence may never replace doctors and other healthcare providers. Kingsmore notes that 106 years after the invention of the autopilot, two pilots are still required to fly a commercial aircraft. Likewise, health care decisions based on genome interpretation also will continue to require the expertise of skilled physicians.

Still, such a rapid automated system will prove incredibly useful. For instance, this system can provide immediate provisional diagnosis, allowing the experts to focus their attention on more difficult unsolved cases or other needs. It may also prove useful in re-evaluating the evidence in the many cases in which manual interpretation by experts fails to provide an answer.

The automated system may also be useful for periodically reanalyzing data in the many cases that remain unsolved. Keeping up with such reanalysis is a particular challenge considering that researchers continue to discover hundreds of disease-associated genes and thousands of variants each and every year. The hope is that in the years ahead, the combination of whole genome sequencing, artificial intelligence, and expert care will make all the difference in the lives of many more seriously ill babies and their families.


[1] Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation. Clark MM, Hildreth A, Batalov S, Ding Y, Chowdhury S, Watkins K, Ellsworth K, Camp B, Kint CI, Yacoubian C, Farnaes L, Bainbridge MN, Beebe C, Braun JJA, Bray M, Carroll J, Cakici JA, Caylor SA, Clarke C, Creed MP, Friedman J, Frith A, Gain R, Gaughran M, George S, Gilmer S, Gleeson J, Gore J, Grunenwald H, Hovey RL, Janes ML, Lin K, McDonagh PD, McBride K, Mulrooney P, Nahas S, Oh D, Oriol A, Puckett L, Rady Z, Reese MG, Ryu J, Salz L, Sanford E, Stewart L, Sweeney N, Tokita M, Van Der Kraan L, White S, Wigby K, Williams B, Wong T, Wright MS, Yamada C, Schols P, Reynders J, Hall K, Dimmock D, Veeraraghavan N, Defay T, Kingsmore SF. Sci Transl Med. 2019 Apr 24;11(489).


DNA Sequencing Fact Sheet (National Human Genome Research Institute/NIH)

Genomics and Medicine (NHGRI/NIH)

Genetic and Rare Disease Information Center (National Center for Advancing Translational Sciences/NIH)

Stephen Kingsmore (Rady Children’s Institute for Genomic Medicine, San Diego, CA)

NIH Support: National Institute of Child Health and Human Development; National Human Genome Research Institute; National Center for Advancing Translational Sciences

International “Big Data” Study Offers Fresh Insights into T2D

Posted on by

World map
Caption: This international “Big Data” study involved hundreds of researchers in 22 countries (red).

It’s estimated that about 10 percent of the world’s population either has type 2 diabetes (T2D) or will develop the disease during their lives [1]. Type 2 diabetes (formerly called “adult-onset”) happens when the body doesn’t produce or use insulin properly, causing glucose levels to rise. While diet and exercise are critical contributory factors to this potentially devastating disease, genetic factors are also important. In fact, over the last decade alone, studies have turned up more than 80 genetic regions that contribute to T2D risk, with much more still to be discovered.

Now, a major international effort, which includes work from my own NIH intramural research laboratory, has published new data that accelerate understanding of how a person’s genetic background contributes to T2D risk. The new study, reported in Nature and unprecedented in its investigative scale and scope, pulled together the largest-ever inventory of DNA sequence changes involved in T2D, and compared their distribution in people from around the world [2]. This “Big Data” strategy has already yielded important new insights into the biology underlying the disease, some of which may yield novel approaches to diabetes treatment and prevention.

The study, led by Michael Boehnke at the University of Michigan, Ann Arbor, Mark McCarthy at the University of Oxford, England, and David Altshuler, until recently at the Broad Institute, Cambridge, MA, involved more than 300 scientists in 22 countries.