Creative Minds: Using Machine Learning to Understand Genome Function
Posted on by Dr. Francis Collins
Science has always fascinated Anshul Kundaje, whether it was biology, physics, or chemistry. When he left his home country of India to pursue graduate studies in electrical engineering at Columbia University, New York, his plan was to focus on telecommunications and computer networks. But a course in computational genomics during his first semester showed him he could follow his interest in computing without giving up his love for biology.
Now an assistant professor of genetics and computer science at Stanford University, Palo Alto, CA, Kundaje has received a 2016 NIH Director’s New Innovator Award to explore not just how the human genome sequence encodes function, but also why it functions in the way that it does. Kundaje even envisions a time when it might be possible to use sophisticated computational approaches to predict the genomic basis of many human diseases.
Kundaje began studying genome function as a graduate student. He first focused on how gene activity is regulated in laboratory yeast before moving on to humans. After earning his Ph.D., he worked on the NIH-funded ENCODE Consortium, a collaborative effort to build a comprehensive parts list of functional elements in the human genome .
When that project was completed, he continued to pursue his interest in genome function through the NIH-funded Roadmap Epigenomics Project . Epigenomics refers to chemical modifications to DNA and the proteins that package DNA. It can explain how cells making up different tissues—for instance, the brain, pancreas, and liver—can function so differently despite having precisely the same DNA sequences.
Epigenomics also has important implications for the understanding of genome function in the context of human disease. As Kundaje has pointed out, researchers have identified long lists of genetic variations with links to one disease or another. But it’s often unclear exactly where in the body those genetic spelling differences are having their detrimental effects.
Sometimes, the answers can be unanticipated. For example, you might expect that mutations tied to Alzheimer’s disease would influence the activity of genes in the brain’s neurons. But in a study of epigenomic signals in the brains of mice and humans, Kundaje and his colleagues recently came to a different conclusion . According to their analysis, variants associated with Alzheimer’s disease that are found in the noncoding DNA (portions of the genome that don’t encode proteins) appear to alter the function of immune cells that are responsible for removing potentially damaging plaques from the brain. Their analysis revealed that Alzheimer’s disease has a strong—and potentially causal—immune component.
As another example of the disease implications of his work, Kundaje points to the gene MYC. It encodes a transcription factor with known links to cancer. But, Kundaje asks, why does the MYC protein bind certain DNA sequences in some cell types and not others? And what would happen if MYC were mutated in a particular way? Kundaje hopes one day to be able to answer such questions.
To get there, Kundaje will apply his computational skills through an approach known as “machine learning.” As I’ve highlighted previously on the blog, machine learning is a powerful way for discerning patterns in large data sets that may not be immediately apparent. For instance, machine learning has been recently used to identify patterns in brain scans that accurately predict whether a child will go on to develop autism.
Kundaje is using an even more sophisticated version of machine learning that is inspired by neural networks of the brain. As he explains it, each individual neuron of the brain is attuned to a simple pattern. It’s only when you combine many thousands of neurons together that it becomes possible to recognize complex patterns that differentiate, say, a cat from a dog or a rock song from a lullaby. Consequently, he is developing algorithms involving artificial neural networks to unravel complex patterns in DNA sequences associated with genome function.
While complex machine learning models are often used as “black box” predictors, Kundaje also plans to look inside the “box.” His goal is to make sense of the complex patterns learned by the models and the rules that govern them .
Such an approach promises to yield many new hypotheses that could be experimentally tested in the lab, perhaps leading to new ways of understanding health and disease. While there’s still a long way to go, Kundaje says he’s already working with collaborators to explore his ideas in the context of heart disease and colorectal cancer. So, we all should be grateful that this highly creative mind has found a way to team his love of computers with his love of biology!
 An integrated encyclopedia of DNA elements in the human genome. ENCODE Project Consortium, Dunham I, Kundaje A, et al. Nature. 2012 Sep 6;489(7414):57-74.
 Integrative analysis of 111 reference human epigenomes. Roadmap Epigenomics Consortium, Kundaje A, et al. Nature. 2015 Feb 19;518(7539):317-330.
 Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease. Gjoneska E, Pfenning AR, Mathys H, Quon G, Kundaje A, Tsai LH, Kellis M. Nature. 2015 Feb 19;518(7539):365-369.
 Learning important features through propagating activation differences. Shrikumar A, Greenside PG, Kundaje A. Proceedings of the 34th International Conference on Machine Learning, PMLR 70:3145-3153, 2017
ENCODE: Encyclopedia of DNA Elements (ENCODE Project Consortium)
Epigenomics (National Human Genome Research Institute/NIH)
Anshul Kundaje (Stanford University, Palo Alto, CA)
Kundaje NIH Project Information (NIH RePORTER)
NIH Director’s New Innovator Award (Common Fund)
NIH Support: Common Fund; National Human Genome Research Institute
Tags: 2016 NIH Director’s New Innovator Award, Alzheimer’s disease, artificial neural networks, cancer, colorectal cancer, computational genomics, computer science, DNA, DNA elements, ENCODE, epigenomics, gene function, gene variants, genomics, heart disease, machine learning, MYC, noncoding DNA, Roadmap Epigenomics Project, transcription factor, yeast