Skip to main content

big data

All of Us: Release of Nearly 100,000 Whole Genome Sequences Sets Stage for New Discoveries

Posted on by Joshua Denny, M.D., M.S., and Lawrence Tabak, D.D.S., Ph.D.

Diverse group of cartoon people with associated DNA

Nearly four years ago, NIH opened national enrollment for the All of Us Research Program. This historic program is building a vital research community within the United States of at least 1 million participant partners from all backgrounds. Its unifying goal is to advance precision medicine, an emerging form of health care tailored specifically to the individual, not the average patient as is now often the case. As part of this historic effort, many participants have offered DNA samples for whole genome sequencing, which provides information about almost all of an individual’s genetic makeup.

Earlier this month, the All of Us Research Program hit an important milestone. We released the first set of nearly 100,000 whole genome sequences from our participant partners. The sequences are stored in the All of Us Researcher Workbench, a powerful, cloud-based analytics platform that makes these data broadly accessible to registered researchers.

The All of Us Research Program and its many participant partners are leading the way toward more equitable representation in medical research. About half of this new genomic information comes from people who self-identify with a racial or ethnic minority group. That’s extremely important because, until now, over 90 percent of participants in large genomic studies were of European descent. This lack of diversity has had huge impacts—deepening health disparities and hindering scientific discovery from fully benefiting everyone.

The Researcher Workbench also contains information from many of the participants’ electronic health records, Fitbit devices, and survey responses. Another neat feature is that the platform links to data from the U.S. Census Bureau’s American Community Survey to provide more details about the communities where participants live.

This unique and comprehensive combination of data will be key in transforming our understanding of health and disease. For example, given the vast amount of data and diversity in the Researcher Workbench, new diseases are undoubtedly waiting to be uncovered and defined. Many new genetic variants are also waiting to be identified that may better predict disease risk and response to treatment.

To speed up the discovery process, these data are being made available, both widely and wisely. To protect participants’ privacy, the program has removed all direct identifiers from the data and upholds strict requirements for researchers seeking access. Already, more than 1,500 scientists across the United States have gained access to the Researcher Workbench through their institutions after completing training and agreeing to the program’s strict rules for responsible use. Some of these researchers are already making discoveries that promote precision medicine, such as finding ways to predict how to best to prevent vision loss in patients with glaucoma.

Beyond making genomic data available for research, All of Us participants have the opportunity to receive their personal DNA results, at no cost to them. So far, the program has offered genetic ancestry and trait results to more than 100,000 participants. Plans are underway to begin sharing health-related DNA results on hereditary disease risk and medication-gene interactions later this year.

This first release of genomic data is a huge milestone for the program and for health research more broadly, but it’s also just the start. The program’s genome centers continue to generate the genomic data and process about 5,000 additional participant DNA samples every week.

The ultimate goal is to gather health data from at least 1 million or more people living in the United States, and there’s plenty of time to join the effort. Whether you would like to contribute your own DNA and health information, engage in research, or support the All of Us Research Program as a partner, it’s easy to get involved. By taking part in this historic program, you can help to build a better and more equitable future for health research and precision medicine.

Note: Joshua Denny, M.D., M.S., is the Chief Executive Officer of NIH’s All of Us Research Program.

Links:

All of Us Research Program (NIH)

All of Us Research Hub

Join All of Us (NIH)


Preventing Glaucoma Vision Loss with ‘Big Data’

Posted on by Dr. Francis Collins

Credit: University of California San Diego

Each morning, more than 2 million Americans start their rise-and-shine routine by remembering to take their eye drops. The drops treat their open-angle glaucoma, the most-common form of the disease, caused by obstructed drainage of fluid where the eye’s cornea and iris meet. The slow drainage increases fluid pressure at the front of the eye. Meanwhile, at the back of the eye, fluid pushes on the optic nerve, causing its bundled fibers to fray and leading to gradual loss of side vision.

For many, the eye drops help to lower intraocular pressure and prevent vision loss. But for others, the drops aren’t sufficient and their intraocular pressure remains high. Such people will need next-level care, possibly including eye surgery, to reopen the clogged drainage ducts and slow this disease that disproportionately affects older adults and African Americans over age 40.

Sally Baxter
Credit: University of California San Diego

Sally Baxter, a physician-scientist with expertise in ophthalmology at the University of California, San Diego (UCSD), wants to learn how to predict who is at greatest risk for serious vision loss from open-angle and other forms of glaucoma. That way, they can receive more aggressive early care to protect their vision from this second-leading cause of blindness in the U.S..

To pursue this challenging research goal, Baxter has received a 2020 NIH Director’s Early Independence Award. Her research will build on the clinical observation that people with glaucoma frequently battle other chronic health problems, such as high blood pressure, diabetes, and heart disease. To learn more about how these and other chronic health conditions might influence glaucoma outcomes, Baxter has begun mining a rich source of data: electronic health records (EHRs).

In an earlier study of patients at UCSD, Baxter showed that EHR data helped to predict which people would need glaucoma surgery within the next six months [1]. The finding suggested that the EHR, especially information on a patient’s blood pressure and medications, could predict the risk for worsening glaucoma.

In her NIH-supported work, she’s already extended this earlier “Big Data” finding by analyzing data from more than 1,200 people with glaucoma who participate in NIH’s All of Us Research Program [2]. With consent from the participants, Baxter used their EHRs to train a computer to find telltale patterns within the data and then predict with 80 to 99 percent accuracy who would later require eye surgery.

The findings confirm that machine learning approaches and EHR data can indeed help in managing people with glaucoma. That’s true even when the EHR data don’t contain any information specific to a person’s eye health.

In fact, the work of Baxter and other groups have pointed to an especially important role for blood pressure in shaping glaucoma outcomes. Hoping to explore this lead further with the support of her Early Independence Award, Baxter also will enroll patients in a study to test whether blood-pressure monitoring smart watches can add important predictive information on glaucoma progression. By combining round-the-clock blood pressure data with EHR data, she hopes to predict glaucoma progression with even greater precision. She’s also exploring innovative ways to track whether people with glaucoma use their eye drops as prescribed, which is another important predictor of the risk of irreversible vision loss [3].

Glaucoma research continues to undergo great progress. This progress ranges from basic research to the development of new treatments and high-resolution imaging technologies to improve diagnostics. But Baxter’s quest to develop practical clinical tools hold great promise, too, and hopefully will help one day to protect the vision of millions of people with glaucoma around the world.

References:

[1] Machine learning-based predictive modeling of surgical intervention in glaucoma using systemic data from electronic health records. Baxter SL, Marks C, Kuo TT, Ohno-Machado L, Weinreb RN. Am J Ophthalmol. 2019 Dec; 208:30-40.

[2] Predictive analytics for glaucoma using data from the All of Us Research Program. Baxter SL, Saseendrakumar BR, Paul P, Kim J, Bonomi L, Kuo TT, Loperena R, Ratsimbazafy F, Boerwinkle E, Cicek M, Clark CR, Cohn E, Gebo K, Mayo K, Mockrin S, Schully SD, Ramirez A, Ohno-Machado L; All of Us Research Program Investigators. Am J Ophthalmol. 2021 Jul;227:74-86.

[3] Smart electronic eyedrop bottle for unobtrusive monitoring of glaucoma medication adherence. Aguilar-Rivera M, Erudaitius DT, Wu VM, Tantiongloc JC, Kang DY, Coleman TP, Baxter SL, Weinreb RN. Sensors (Basel). 2020 Apr 30;20(9):2570.

Links:

Glaucoma (National Eye Institute/NIH)

All of Us Research Program (NIH)

Video: Sally Baxter (All of Us Research Program)

Sally Baxter (University of California San Diego)

Baxter Project Information (NIH RePORTER)

NIH Director’s Early Independence Award (Common Fund)

NIH Support: Common Fund


Using Artificial Intelligence to Catch Irregular Heartbeats

Posted on by Dr. Francis Collins

ECG Readout
Credit: gettyimages/enot-poloskun

Thanks to advances in wearable health technologies, it’s now possible for people to monitor their heart rhythms at home for days, weeks, or even months via wireless electrocardiogram (EKG) patches. In fact, my Apple Watch makes it possible to record a real-time EKG whenever I want. (I’m glad to say I am in normal sinus rhythm.)

For true medical benefit, however, the challenge lies in analyzing the vast amounts of data—often hundreds of hours worth per person—to distinguish reliably between harmless rhythm irregularities and potentially life-threatening problems. Now, NIH-funded researchers have found that artificial intelligence (AI) can help.

A powerful computer “studied” more than 90,000 EKG recordings, from which it “learned” to recognize patterns, form rules, and apply them accurately to future EKG readings. The computer became so “smart” that it could classify 10 different types of irregular heart rhythms, including atrial fibrillation (AFib). In fact, after just seven months of training, the computer-devised algorithm was as good—and in some cases even better than—cardiology experts at making the correct diagnostic call.

EKG tests measure electrical impulses in the heart, which signal the heart muscle to contract and pump blood to the rest of the body. The precise, wave-like features of the electrical impulses allow doctors to determine whether a person’s heart is beating normally.

For example, in people with AFib, the heart’s upper chambers (the atria) contract rapidly and unpredictably, causing the ventricles (the main heart muscle) to contract irregularly rather than in a steady rhythm. This is an important arrhythmia to detect, even if it may only be present occasionally over many days of monitoring. That’s not always easy to do with current methods.

Here’s where the team, led by computer scientists Awni Hannun and Andrew Ng, Stanford University, Palo Alto, CA, saw an AI opportunity. As published in Nature Medicine, the Stanford team started by assembling a large EKG dataset from more than 53,000 people [1]. The data included various forms of arrhythmia and normal heart rhythms from people who had worn the FDA-approved Zio patch for about two weeks.

The Zio patch is a 2-by-5-inch adhesive patch, worn much like a bandage, on the upper left side of the chest. It’s water resistant and can be kept on around the clock while a person sleeps, exercises, or takes a shower. The wireless patch continuously monitors heart rhythms, storing EKG data for later analysis.

The Stanford researchers looked to machine learning to process all the EKG data. In machine learning, computers rely on large datasets of examples in order to learn how to perform a given task. The accuracy improves as the machine “sees” more data.

But the team’s real interest was in utilizing a special class of machine learning called deep neural networks, or deep learning. Deep learning is inspired by how our own brain’s neural networks process information, learning to focus on some details but not others.

In deep learning, computers look for patterns in data. As they begin to “see” complex relationships, some connections in the network are strengthened while others are weakened. The network is typically composed of multiple information-processing layers, which operate on the data and compute increasingly complex and abstract representations.

Those data reach the final output layer, which acts as a classifier, assigning each bit of data to a particular category or, in the case of the EKG readings, a diagnosis. In this way, computers can learn to analyze and sort highly complex data using both more obvious and hidden features.

Ultimately, the computer in the new study could differentiate between EKG readings representing 10 different arrhythmias as well as a normal heart rhythm. It could also tell the difference between irregular heart rhythms and background “noise” caused by interference of one kind or another, such as a jostled or disconnected Zio patch.

For validation, the computer attempted to assign a diagnosis to the EKG readings of 328 additional patients. Independently, several expert cardiologists also read those EKGs and reached a consensus diagnosis for each patient. In almost all cases, the computer’s diagnosis agreed with the consensus of the cardiologists. The computer also made its calls much faster.

Next, the researchers compared the computer’s diagnoses to those of six individual cardiologists who weren’t part of the original consensus committee. And, the results show that the computer actually outperformed these experienced cardiologists!

The findings suggest that artificial intelligence can be used to improve the accuracy and efficiency of EKG readings. In fact, Hannun reports that iRhythm Technologies, maker of the Zio patch, has already incorporated the algorithm into the interpretation now being used to analyze data from real patients.

As impressive as this is, we are surely just at the beginning of AI applications to health and health care. In recognition of the opportunities ahead, NIH has recently launched a working group on AI to explore ways to make the best use of existing data, and harness the potential of artificial intelligence and machine learning to advance biomedical research and the practice of medicine.

Meanwhile, more and more impressive NIH-supported research featuring AI is being published. In my next blog, I’ll highlight a recent paper that uses AI to make a real difference for cervical cancer, particularly in low resource settings.

Reference:

[1] Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Hannun AY, Rajpurkar P, Haghpanahi M, Tison GH, Bourn C, Turakhia MP, Ng AY.
Nat Med. 2019 Jan;25(1):65-69.

Links:

Arrhythmia (National Heart, Lung, and Blood Institute/NIH)

Video: Artificial Intelligence: Collecting Data to Maximize Potential (NIH)

Andrew Ng (Palo Alto, CA)

NIH Support: National Heart, Lung, and Blood Institute


Meeting with Congressman Ro Khanna

Posted on by Dr. Francis Collins

Larry Tabak, Congressman Ro Khanna and Francis Collins at the NIH Clinical Center

We had a great visit with Congressman Ro Khanna (center) of California. Our discussion included recent advances in neuroscience, genomics, Big Data, and research on food allergies. NIH Deputy Director Larry Tabak (left) and I welcomed Congressman Khanna to the NIH Clinical Center on July 30, 2018.


Crowdsourcing 600 Years of Human History

Posted on by Dr. Francis Collins

Family Tree

Caption: A 6,000-person family tree, showing individuals spanning seven generations (green) and their marital links (red).
Credit: Columbia University, New York City

You may have worked on constructing your family tree, perhaps listing your ancestry back to your great-grandparents. Or with so many public records now available online, you may have even uncovered enough information to discover some unexpected long-lost relatives. Or maybe you’ve even submitted a DNA sample to one of the commercial sources to see what you could learn about your ancestry. But just how big can a family tree grow using today’s genealogical tools?

A recent paper offers a truly eye-opening answer. With permission to download the publicly available, online profiles of 86 million genealogy hobbyists, most of European descent, the researchers assembled more than 5 million family trees. The largest totaled more than 13 million people! By merging each tree from the crowd-sourced and public data, including the relatively modest 6,000-person seedling shown above, the researchers were able to go back 11 generations on average to the 15th century and the days of Christopher Columbus. Doubly exciting, these large datasets offer a powerful new resource to study human health, having already provided some novel insights into our family structures, genes, and longevity.


Creative Minds: Looking for Common Threads in Rare Diseases

Posted on by Dr. Francis Collins

Valerie Arboleda

Valerie Arboleda
Credit: UCLA/Margaret Sison Photography

Four years ago, Valerie Arboleda accomplished something most young medical geneticists rarely do. She helped discover a rare congenital disease now known as KAT6A syndrome [1]. From the original 10 cases to the more than 100 diagnosed today, KAT6A kids share a single altered gene that causes neuro-developmental delays, most prominently in learning to walk and talk, plus a spectrum of possible abnormalities involving the head, face, heart, and immune system.

Now, Arboleda wants to accomplish something even more groundbreaking. With a 2017 NIH Director’s Early Independence Award, she will develop ways to mine Big Data—the voluminous amounts of DNA sequence and other biological information now stored in public databases—to unearth new clues into the biology of rare disorders like KAT6A syndrome. If successful, Arboleda’s work could bring greater precision to the diagnosis and potentially treatment of Mendelian disorders, as well as provide greater clarity into the specific challenges that might lie ahead for an affected child.


Creative Minds: Building Better Computational Models of Common Disease

Posted on by Dr. Francis Collins

Hilary Finucane

Hilary Finucane

Not so long ago, Hilary Finucane was a talented young mathematician about to complete a master’s degree in theoretical computer science. As much as she enjoyed exploring pure mathematics, Finucane had begun having second thoughts about her career choice. She wanted to use her gift for numbers in a way that would have more real-world impact.

The solution to her dilemma was, literally, standing right by her side. Her husband Yakir Reshef, also a mathematician, was developing a new algorithm at the Broad Institute of MIT and Harvard, Cambridge, MA, to improve detection of unexpected associations in large data sets. So, Finucane helped the Broad team with modeling biomedical topics ranging from the gut microbiome to global health. That work led to her co-authoring a paper in the journal Science [1], providing a strong start to what’s shaping up to be a rewarding career in computational biology.


Cardiometabolic Disease: Big Data Tackles a Big Health Problem

Posted on by Dr. Francis Collins

Cardiometabolic risk loci

More and more studies are popping up that demonstrate the power of Big Data analyses to get at the underlying molecular pathology of some of our most common diseases. A great example, which may have flown a bit under the radar during the summer holidays, involves cardiometabolic disease. It’s an umbrella term for common vascular and metabolic conditions, including hypertension, impaired glucose and lipid metabolism, excess belly fat, and inflammation. All of these components of cardiometabolic disease can increase a person’s risk for a heart attack or stroke.

In the study, an international research team tapped into the power of genomic data to develop clearer pictures of the complex biocircuitry in seven types of vascular and metabolic tissue known to be affected by cardiometabolic disease: the liver, the heart’s aortic root, visceral abdominal fat, subcutaneous fat, internal mammary artery, skeletal muscle, and blood. The researchers found that while some circuits might regulate the level of gene expression in just one tissue, that’s often not the case. In fact, the researchers’ computational models show that such genetic circuitry can be organized into super networks that work together to influence how multiple tissues carry out fundamental life processes, such as metabolizing glucose or regulating lipid levels. When these networks are perturbed, perhaps by things like inherited variants that affect gene expression, or environmental influences such as a high-carb diet, sedentary lifestyle, the aging process, or infectious disease, the researchers’ modeling work suggests that multiple tissues can be affected, resulting in chronic, systemic disorders including cardiometabolic disease.


Big Data and Imaging Analysis Yields High-Res Brain Map

Posted on by Dr. Francis Collins

The HCP’s multi-modal cortical parcellation

Caption: Map of 180 areas in the left and right hemispheres of the cerebral cortex.
Credit: Matthew F. Glasser, David C. Van Essen, Washington University Medical School, Saint Louis, Missouri

Neuroscientists have been working for a long time to figure out how the human brain works, and that has led many through the years to attempt to map its various regions and create a detailed atlas of their complex geography and functions. While great progress has been made in recent years, existing brain maps have remained relatively blurry and incomplete, reflecting only limited aspects of brain structure or function and typically in just a few people.

In a study reported recently in the journal Nature, an NIH-funded team of researchers has begun to bring this map of the human brain into much sharper focus [1]. By combining multiple types of cutting-edge brain imaging data from more than 200 healthy young men and women, the researchers were able to subdivide the cerebral cortex, the brain’s outer layer, into 180 specific areas in each hemisphere. Remarkably, almost 100 of those areas had never before been described. This new high-resolution brain map will advance fundamental understanding of the human brain and will help to bring greater precision to the diagnosis and treatment of many brain disorders.


International “Big Data” Study Offers Fresh Insights into T2D

Posted on by Dr. Francis Collins

World map
Caption: This international “Big Data” study involved hundreds of researchers in 22 countries (red).

It’s estimated that about 10 percent of the world’s population either has type 2 diabetes (T2D) or will develop the disease during their lives [1]. Type 2 diabetes (formerly called “adult-onset”) happens when the body doesn’t produce or use insulin properly, causing glucose levels to rise. While diet and exercise are critical contributory factors to this potentially devastating disease, genetic factors are also important. In fact, over the last decade alone, studies have turned up more than 80 genetic regions that contribute to T2D risk, with much more still to be discovered.

Now, a major international effort, which includes work from my own NIH intramural research laboratory, has published new data that accelerate understanding of how a person’s genetic background contributes to T2D risk. The new study, reported in Nature and unprecedented in its investigative scale and scope, pulled together the largest-ever inventory of DNA sequence changes involved in T2D, and compared their distribution in people from around the world [2]. This “Big Data” strategy has already yielded important new insights into the biology underlying the disease, some of which may yield novel approaches to diabetes treatment and prevention.

The study, led by Michael Boehnke at the University of Michigan, Ann Arbor, Mark McCarthy at the University of Oxford, England, and David Altshuler, until recently at the Broad Institute, Cambridge, MA, involved more than 300 scientists in 22 countries.


Next Page