Skip to main content

machine learning

Using AI to Advance Understanding of Long COVID Syndrome

Posted on by

The COVID-19 pandemic continues to present considerable public health challenges in the United States and around the globe. One of the most puzzling is why many people who get over an initial and often relatively mild COVID illness later develop new and potentially debilitating symptoms. These symptoms run the gamut including fatigue, shortness of breath, brain fog, anxiety, and gastrointestinal trouble.

People understandably want answers to help them manage this complex condition referred to as Long COVID syndrome. But because Long COVID is so variable from person to person, it’s extremely difficult to work backwards and determine what these people had in common that might have made them susceptible to Long COVID. The variability also makes it difficult to identify all those who have Long COVID, whether they realize it or not. But a recent study, published in the journal Lancet Digital Health, shows that a well-trained computer and its artificial intelligence can help.

Researchers found that computers, after scanning thousands of electronic health records (EHRs) from people with Long COVID, could reliably make the call. The results, though still preliminary and in need of further validation, point the way to developing a fast, easy-to-use computer algorithm to help determine whether a person with a positive COVID test is likely to battle Long COVID.

In this groundbreaking study, NIH-supported researchers led by Emily Pfaff, University of North Carolina, Chapel Hill, and Melissa Haendel, the University of Colorado Anschutz Medical Campus, Aurora, relied on machine learning. In machine learning, a computer sifts through vast amounts of data to look for patterns. One reason machine learning is so powerful is that it doesn’t require humans to tell the computer which features it should look for. As such, machine learning can pick up on subtle patterns that people would otherwise miss.

In this case, Pfaff, Haendel, and team decided to “train” their computer on EHRs from people who had reported a COVID-19 infection. (The records are de-identified to protect patient privacy.) The researchers found just what they needed in the National COVID Cohort Collaborative (N3C), a national, publicly available data resource sponsored by NIH’s National Center for Advancing Translational Sciences. It is part of NIH’s Researching COVID to Enhance Recovery (RECOVER) initiative, which aims to improve understanding of Long COVID.

The researchers defined a group of more than 1.5 million adults in N3C who either had been diagnosed with COVID-19 or had a record of a positive COVID-19 test at least 90 days prior. Next, they examined common features, including any doctor visits, diagnoses, or medications, from the group’s roughly 100,000 adults.

They fed that EHR data into a computer, along with health information from almost 600 patients who’d been seen at a Long COVID clinic. They developed three machine learning models: one to identify potential long COVID patients across the whole dataset and two others that focused separately on people who had or hadn’t been hospitalized.

All three models proved effective for identifying people with potential Long-COVID. Each of the models had an 85 percent or better discrimination threshold, indicating they are highly accurate. That’s important because, once researchers can identify those with Long COVID in a large database of people such as N3C, they can begin to ask and answer many critical questions about any differences in an individual’s risk factors or treatment that might explain why some get Long COVID and others don’t.

This new study is also an excellent example of N3C’s goal to assemble data from EHRs that enable researchers around the world to get rapid answers and seek effective interventions for COVID-19, including its long-term health effects. It’s also made important progress toward the urgent goal of the RECOVER initiative to identify people with or at risk for Long COVID who may be eligible to participate in clinical trials of promising new treatment approaches.

Long COVID remains a puzzling public health challenge. Another recent NIH study published in the journal Annals of Internal Medicine set out to identify people with symptoms of Long COVID, most of whom had recovered from mild-to-moderate COVID-19 [2]. More than half had signs of Long COVID. But, despite extensive testing, the NIH researchers were unable to pinpoint any underlying cause of the Long COVID symptoms in most cases.

So if you’d like to help researchers solve this puzzle, RECOVER is now enrolling adults and kids—including those who have and have not had COVID—at more than 80 study sites around the country.

References:

[1] Identifying who has long COVID in the USA: a machine learning approach using N3C data. Pfaff ER, Girvin AT, Bennett TD, Bhatia A, Brooks IM, Deer RR, Dekermanjian JP, Jolley SE, Kahn MG, Kostka K, McMurry JA, Moffitt R, Walden A, Chute CG, Haendel MA; N3C Consortium. Lancet Digit Health. 2022 May 16:S2589-7500(22)00048-6.

[2] A longitudinal study of COVID-19 sequelae and immunity: baseline findings. Sneller MC, Liang CJ, Marques AR, Chung JY, Shanbhag SM, Fontana JR, Raza H, Okeke O, Dewar RL, Higgins BP, Tolstenko K, Kwan RW, Gittens KR, Seamon CA, McCormack G, Shaw JS, Okpali GM, Law M, Trihemasava K, Kennedy BD, Shi V, Justement JS, Buckner CM, Blazkova J, Moir S, Chun TW, Lane HC. Ann Intern Med. 2022 May 24:M21-4905.

Links:

COVID-19 Research (NIH)

National COVID Cohort Collaborative (N3C) (National Center for Advancing Translational Sciences/NIH)

RECOVER Initiative

Emily Pfaff (University of North Carolina, Chapel Hill)

Melissa Haendel (University of Colorado, Aurora)

NIH Support: National Center for Advancing Translational Sciences; National Institute of General Medical Sciences; National Institute of Allergy and Infectious Diseases


Millions of Single-Cell Analyses Yield Most Comprehensive Human Cell Atlas Yet

Posted on by

A field of playing cards showing different body tissues

There are 37 trillion or so cells in our bodies that work together to give us life. But it may surprise you that we still haven’t put a good number on how many distinct cell types there are within those trillions of cells.

That’s why in 2016, a team of researchers from around the globe launched a historic project called the Human Cell Atlas (HCA) consortium to identify and define the hundreds of presumed distinct cell types in our bodies. Knowing where each cell type resides in the body, and which genes each one turns on or off to create its own unique molecular identity, will revolutionize our studies of human biology and medicine across the board.

Since its launch, the HCA has progressed rapidly. In fact, it has already reached an important milestone with the recent publication in the journal Science of four studies that, together, comprise the first multi-tissue drafts of the human cell atlas. This draft, based on analyses of millions of cells, defines more than 500 different cell types in more than 30 human tissues. A second draft, with even finer definition, is already in the works.

Making the HCA possible are recent technological advances in RNA sequencing. RNA sequencing is a topic that’s been mentioned frequently on this blog in a range of research areas, from neuroscience to skin rashes. Researchers use it to detect and analyze all the messenger RNA (mRNA) molecules in a biological sample, in this case individual human cells from a wide range of tissues, organs, and individuals who voluntarily donated their tissues.

By quantifying these RNA messages, researchers can capture the thousands of genes that any given cell actively expresses at any one time. These precise gene expression profiles can be used to catalogue cells from throughout the body and understand the important similarities and differences among them.

In one of the published studies, funded in part by the NIH, a team co-led by Aviv Regev, a founding co-chair of the consortium at the Broad Institute of MIT and Harvard, Cambridge, MA, established a framework for multi-tissue human cell atlases [1]. (Regev is now on leave from the Broad Institute and MIT and has recently moved to Genentech Research and Early Development, South San Francisco, CA.)

Among its many advances, Regev’s team optimized single-cell RNA sequencing for use on cell nuclei isolated from frozen tissue. This technological advance paved the way for single-cell analyses of the vast numbers of samples that are stored in research collections and freezers all around the world.

Using their new pipeline, Regev and team built an atlas including more than 200,000 single-cell RNA sequence profiles from eight tissue types collected from 16 individuals. These samples were archived earlier by NIH’s Genotype-Tissue Expression (GTEx) project. The team’s data revealed unexpected differences among cell types but surprising similarities, too.

For example, they found that genetic profiles seen in muscle cells were also present in connective tissue cells in the lungs. Using novel machine learning approaches to help make sense of their data, they’ve linked the cells in their atlases with thousands of genetic diseases and traits to identify cell types and genetic profiles that may contribute to a wide range of human conditions.

By cross-referencing 6,000 genes previously implicated in causing specific genetic disorders with their single-cell genetic profiles, they identified new cell types that may play unexpected roles. For instance, they found some non-muscle cells that may play a role in muscular dystrophy, a group of conditions in which muscles progressively weaken. More research will be needed to make sense of these fascinating, but vital, discoveries.

The team also compared genes that are more active in specific cell types to genes with previously identified links to more complex conditions. Again, their data surprised them. They identified new cell types that may play a role in conditions such as heart disease and inflammatory bowel disease.

Two of the other papers, one of which was funded in part by NIH, explored the immune system, especially the similarities and differences among immune cells that reside in specific tissues, such as scavenging macrophages [2,3] This is a critical area of study. Most of our understanding of the immune system comes from immune cells that circulate in the bloodstream, not these resident macrophages and other immune cells.

These immune cell atlases, which are still first drafts, already provide an invaluable resource toward designing new treatments to bolster immune responses, such as vaccines and anti-cancer treatments. They also may have implications for understanding what goes wrong in various autoimmune conditions.

Scientists have been working for more than 150 years to characterize the trillions of cells in our bodies. Thanks to this timely effort and its advances in describing and cataloguing cell types, we now have a much better foundation for understanding these fundamental units of the human body.

But the latest data are just the tip of the iceberg, with vast flows of biological information from throughout the human body surely to be released in the years ahead. And while consortium members continue making history, their hard work to date is freely available to the scientific community to explore critical biological questions with far-reaching implications for human health and disease.

References:

[1] Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Eraslan G, Drokhlyansky E, Anand S, Fiskin E, Subramanian A, Segrè AV, Aguet F, Rozenblatt-Rosen O, Ardlie KG, Regev A, et al. Science. 2022 May 13;376(6594):eabl4290.

[2] Cross-tissue immune cell analysis reveals tissue-specific features in humans. Domínguez Conde C, Xu C, Jarvis LB, Rainbow DB, Farber DL, Saeb-Parsy K, Jones JL,Teichmann SA, et al. Science. 2022 May 13;376(6594):eabl5197.

[3] Mapping the developing human immune system across organs. Suo C, Dann E, Goh I, Jardine L, Marioni JC, Clatworthy MR, Haniffa M, Teichmann SA, et al. Science. 2022 May 12:eabo0510.

Links:

Ribonucleic acid (RNA) (National Human Genome Research Institute/NIH)

Studying Cells (National Institute of General Medical Sciences/NIH)

Human Cell Atlas

Regev Lab (Broad Institute of MIT and Harvard, Cambridge, MA)

NIH Support: Common Fund; National Cancer Institute; National Human Genome Research Institute; National Heart, Lung, and Blood Institute; National Institute on Drug Abuse; National Institute of Mental Health; National Institute on Aging; National Institute of Allergy and Infectious Diseases; National Institute of Neurological Disorders and Stroke; National Eye Institute


Artificial Intelligence Accurately Predicts RNA Structures, Too

Posted on by

A mechanical claw grabs molecular models
Credit: Camille L.L. Townshend

Researchers recently showed that a computer could “learn” from many examples of protein folding to predict the 3D structure of proteins with great speed and precision. Now a recent study in the journal Science shows that a computer also can predict the 3D shapes of RNA molecules [1]. This includes the mRNA that codes for proteins and the non-coding RNA that performs a range of cellular functions.

This work marks an important basic science advance. RNA therapeutics—from COVID-19 vaccines to cancer drugs—have already benefited millions of people and will help many more in the future. Now, the ability to predict RNA shapes quickly and accurately on a computer will help to accelerate understanding these critical molecules and expand their healthcare uses.

Like proteins, the shapes of single-stranded RNA molecules are important for their ability to function properly inside cells. Yet far less is known about these RNA structures and the rules that determine their precise shapes. The RNA elements (bases) can form internal hydrogen-bonded pairs, but the number of possible combinations of pairings is almost astronomical for any RNA molecule with more than a few dozen bases.

In hopes of moving the field forward, a team led by Stephan Eismann and Raphael Townshend in the lab of Ron Dror, Stanford University, Palo Alto, CA, looked to a machine learning approach known as deep learning. It is inspired by how our own brain’s neural networks process information, learning to focus on some details but not others.

In deep learning, computers look for patterns in data. As they begin to “see” complex relationships, some connections in the network are strengthened while others are weakened.

One of the things that makes deep learning so powerful is it doesn’t rely on any preconceived notions. It also can pick up on important features and patterns that humans can’t possibly detect. But, as successful as this approach has been in solving many different kinds of problems, it has primarily been applied to areas of biology, such as protein folding, in which lots of data were available for researchers to train the computers.

That’s not the case with RNA molecules. To work around this problem, Dror’s team designed a neural network they call ARES. (No, it’s not the Greek god of war. It’s short for Atomic Rotationally Equivariant Scorer.)

To start, the researchers trained ARES on just 18 small RNA molecules for which structures had been experimentally determined. They gave ARES these structural models specified only by their atomic structure and chemical elements.

The next test was to see if ARES could determine from this small training set the best structural model for RNA sequences it had never seen before. The researchers put it to the test with RNA molecules whose structures had been determined more recently.

ARES, however, doesn’t come up with the structures itself. Instead, the researchers give ARES a sequence and at least 1,500 possible 3D structures it might take, all generated using another computer program. Based on patterns in the training set, ARES scores each of the possible structures to find the one it predicts is closest to the actual structure. Remarkably, it does this without being provided any prior information about features important for determining RNA shapes, such as nucleotides, steric constraints, and hydrogen bonds.

It turns out that ARES consistently outperforms humans and all other previous methods to produce the best results. In fact, it outperformed at least nine other methods to come out on top in a community-wide RNA-puzzles contest. It also can make predictions about RNA molecules that are significantly larger and more complex than those upon which it was trained.

The success of ARES and this deep learning approach will help to elucidate RNA molecules with potentially important implications for health and disease. It’s another compelling example of how deep learning promises to solve many other problems in structural biology, chemistry, and the material sciences when—at the outset—very little is known.

Reference:

[1] Geometric deep learning of RNA structure. Townshend RJL, Eismann S, Watkins AM, Rangan R, Karelina M, Das R, Dror RO. Science. 2021 Aug 27;373(6558):1047-1051.

Links:

Structural Biology (National Institute of General Medical Sciences/NIH)

The Structures of Life (National Institute of General Medical Sciences/NIH)

RNA Biology (NIH)

RNA Puzzles

Dror Lab (Stanford University, Palo Alto, CA)

NIH Support: National Cancer Institute; National Institute of General Medical Sciences


First Molecular Profiles of Severe COVID-19 Infections

Posted on by

COVID-19 Severity Test
Credit: NIH

To ensure that people with coronavirus disease 2019 (COVID-19) get the care they need, it would help if a simple blood test could predict early on which patients are most likely to progress to severe and life-threatening illness—and which are more likely to recover without much need for medical intervention. Now, researchers have provided some of the first evidence that such a test might be possible.

This tantalizing possibility comes from a study reported recently in the journal Cell. In this study, researchers took blood samples from people with mild to severe COVID-19 and analyzed them for nearly 2,000 proteins and metabolites [1]. Their detailed analyses turned up hundreds of molecular changes in blood that differentiated milder COVID-19 symptoms from more severe illness. What’s more, they found that they could train a computer to use the most informative of the proteins and predict the disease severity with a high degree of accuracy.

The findings come from the lab of Tiannan Guo, Westlake University, Zhejiang Province, China. His team recognized that, while we’ve learned a lot about the clinical symptoms of COVID-19 and the spread of the illness around the world, much less is known about the condition’s underlying molecular features. It also remains mysterious what distinguishes the 80 percent of symptomatic infected people who recover with little to no need for medical care from the other 20 percent, who suffer from much more serious illness, including respiratory distress requiring oxygen or even more significant medical interventions.

In search of clues, Guo and colleagues analyzed hundreds of molecular changes in blood samples collected from 53 healthy people and 46 people with COVID-19, including 21 with severe disease involving respiratory distress and decreased blood-oxygen levels. Their studies turned up more than 470 proteins and metabolites that differed in people with COVID-19 compared to healthy people. Of those, levels of about 300 were associated with disease severity.

Further analysis revealed that the majority of proteins and metabolites on the list are associated with the suppression or dysregulation of one of three biological processes. Two processes are related to the immune system, including early immune responses and the function of particular scavenging immune cells called macrophages. The third relates to the function of platelets, which are sticky, disc-shaped cell fragments that play an essential role in blood clotting. Such biological insights might help pave the way for potentially effective new ways to treat COVID-19 down the road.

Next, the researchers turned to “machine learning” to explore the possibility that such molecular changes also might be used to predict mild versus severe COVID-19. Machine learning involves the use of computers to discern patterns, or molecular signatures, in large data sets that a human being couldn’t readily pick out. In this case, the question was whether the computer could “learn” to tell the difference between mild and severe COVID-19 based on molecular data alone.

Their analyses showed that a computer, once trained, could differentiate mild and severe COVID-19 based on just 22 proteins and 7 metabolites. Their model correctly classified all but one person in the original training set, for an accuracy of about 94 percent. And importantly, in further prospective validation tests, they confirmed that this model accurately identified mild versus severe COVID-19 in most cases.

While these findings are certainly encouraging, there’s much more work to do. It will be important to explore these molecular signatures in many more people. It also will be critical to find out how early in the course of the disease such telltale signatures arise. While we await those answers, I find encouragement in all that we’re learning—and will continue to learn—about COVID-19 each day.

Reference:

[1] Proteomic and metabolomic characterization of COVID-19 patient sera. Shen B et al. Cell. 28 May 2020. [Epub ahead of publication]

Links:

Coronavirus (COVID-19) (NIH)

Blood Tests (National Heart, Lung, and Blood Institute/NIH)

Tiannan Guo Lab (Westlake University, Zhejiang Province, China)


Whole-Genome Sequencing Plus AI Yields Same-Day Genetic Diagnoses

Posted on by

Sebastiana
Caption: Rapid whole-genome sequencing helped doctors diagnose Sebastiana Manuel with Ohtahara syndrome, a neurological condition that causes seizures. Her data are now being used as part of an effort to speed the diagnosis of other children born with unexplained illnesses. Credits: Getty Images (left); Jenny Siegwart (right).



Back in April 2003, when the international Human Genome Project successfully completed the first reference sequence of the human DNA blueprint, we were thrilled to have achieved that feat in just 13 years. Sure, the U.S. contribution to that first human reference sequence cost an estimated $400 million, but we knew (or at least we hoped) that the costs would come down quickly, and the speed would accelerate. How far we’ve come since then! A new study shows that whole genome sequencing—combined with artificial intelligence (AI)—can now be used to diagnose genetic diseases in seriously ill babies in less than 24 hours.

Take a moment to absorb this. I would submit that there is no other technology in the history of planet Earth that has experienced this degree of progress in speed and affordability. And, at the same time, DNA sequence technology has achieved spectacularly high levels of accuracy. The time-honored adage that you can only get two out of three for “faster, better, and cheaper” has been broken—all three have been dramatically enhanced by the advances of the last 16 years.

Rapid diagnosis is critical for infants born with mysterious conditions because it enables them to receive potentially life-saving interventions as soon as possible after birth. In a study in Science Translational Medicine, NIH-funded researchers describe development of a highly automated, genome-sequencing pipeline that’s capable of routinely delivering a diagnosis to anxious parents and health-care professionals dramatically earlier than typically has been possible [1].

While the cost of rapid DNA sequencing continues to fall, challenges remain in utilizing this valuable tool to make quick diagnostic decisions. In most clinical settings, the wait for whole-genome sequencing results still runs more than two weeks. Attempts to obtain faster results also have been labor intensive, requiring dedicated teams of experts to sift through the data, one sample at a time.

In the new study, a research team led by Stephen Kingsmore, Rady Children’s Institute for Genomic Medicine, San Diego, CA, describes a streamlined approach that accelerates every step in the process, making it possible to obtain whole-genome test results in a median time of about 20 hours and with much less manual labor. They propose that the system could deliver answers for 30 patients per week using a single genome sequencing instrument.

Here’s how it works: Instead of manually preparing blood samples, his team used special microbeads to isolate DNA much more rapidly with very little labor. The approach reduced the time for sample preparation from 10 hours to less than three. Then, using a state-of-the-art DNA sequencer, they sequence those samples to obtain good quality whole genome data in just 15.5 hours.

The next potentially time-consuming challenge is making sense of all that data. To speed up the analysis, Kingsmore’s team took advantage of a machine-learning system called MOON. The automated platform sifts through all the data using artificial intelligence to search for potentially disease-causing variants.

The researchers paired MOON with a clinical language processing system, which allowed them to extract relevant information from the child’s electronic health records within seconds. Teaming that patient-specific information with data on more than 13,000 known genetic diseases in the scientific literature, the machine-learning system could pick out a likely disease-causing mutation out of 4.5 million potential variants in an impressive 5 minutes or less!

To put the system to the test, the researchers first evaluated its ability to reach a correct diagnosis in a sample of 101 children with 105 previously diagnosed genetic diseases. In nearly every case, the automated diagnosis matched the opinions reached previously via the more lengthy and laborious manual interpretation of experts.

Next, the researchers tested the automated system in assisting diagnosis of seven seriously ill infants in the intensive care unit, and three previously diagnosed infants. They showed that their automated system could reach a diagnosis in less than 20 hours. That’s compared to the fastest manual approach, which typically took about 48 hours. The automated system also required about 90 percent less manpower.

The system nailed a rapid diagnosis for 3 of 7 infants without returning any false-positive results. Those diagnoses were made with an average time savings of more than 22 hours. In each case, the early diagnosis immediately influenced the treatment those children received. That’s key given that, for young children suffering from serious and unexplained symptoms such as seizures, metabolic abnormalities, or immunodeficiencies, time is of the essence.

Of course, artificial intelligence may never replace doctors and other healthcare providers. Kingsmore notes that 106 years after the invention of the autopilot, two pilots are still required to fly a commercial aircraft. Likewise, health care decisions based on genome interpretation also will continue to require the expertise of skilled physicians.

Still, such a rapid automated system will prove incredibly useful. For instance, this system can provide immediate provisional diagnosis, allowing the experts to focus their attention on more difficult unsolved cases or other needs. It may also prove useful in re-evaluating the evidence in the many cases in which manual interpretation by experts fails to provide an answer.

The automated system may also be useful for periodically reanalyzing data in the many cases that remain unsolved. Keeping up with such reanalysis is a particular challenge considering that researchers continue to discover hundreds of disease-associated genes and thousands of variants each and every year. The hope is that in the years ahead, the combination of whole genome sequencing, artificial intelligence, and expert care will make all the difference in the lives of many more seriously ill babies and their families.

Reference:

[1] Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation. Clark MM, Hildreth A, Batalov S, Ding Y, Chowdhury S, Watkins K, Ellsworth K, Camp B, Kint CI, Yacoubian C, Farnaes L, Bainbridge MN, Beebe C, Braun JJA, Bray M, Carroll J, Cakici JA, Caylor SA, Clarke C, Creed MP, Friedman J, Frith A, Gain R, Gaughran M, George S, Gilmer S, Gleeson J, Gore J, Grunenwald H, Hovey RL, Janes ML, Lin K, McDonagh PD, McBride K, Mulrooney P, Nahas S, Oh D, Oriol A, Puckett L, Rady Z, Reese MG, Ryu J, Salz L, Sanford E, Stewart L, Sweeney N, Tokita M, Van Der Kraan L, White S, Wigby K, Williams B, Wong T, Wright MS, Yamada C, Schols P, Reynders J, Hall K, Dimmock D, Veeraraghavan N, Defay T, Kingsmore SF. Sci Transl Med. 2019 Apr 24;11(489).

Links:

DNA Sequencing Fact Sheet (National Human Genome Research Institute/NIH)

Genomics and Medicine (NHGRI/NIH)

Genetic and Rare Disease Information Center (National Center for Advancing Translational Sciences/NIH)

Stephen Kingsmore (Rady Children’s Institute for Genomic Medicine, San Diego, CA)

NIH Support: National Institute of Child Health and Human Development; National Human Genome Research Institute; National Center for Advancing Translational Sciences


Next Page