Researchers need guidance as they untangle a massive jungle of biomedical data in their search for therapies, prevention techniques and cures to some of today’s most enigmatic diseases.
To assist them, the National Institutes of Health has awarded USC a three-year, $6.3 million grant to build Big Data U, the nation’s first Training Coordination Center aimed at teaching people with different backgrounds how to assemble astronomical amounts of data into compatible and comparable statistics. The goal is to find trends, interesting relationships and clustering effects.
“A lot of the big data we are dealing with haven’t even been collected yet,” said John Van Horn, the project’s lead investigator, associate professor of neurology and education, and director of a new Master of Science in neuroimaging and informatics at the Keck School of Medicine of USC. “It’s still off in the future. What we do now and how we train people to be able to deal with that will prepare us for the time when getting many terabytes worth of data is considered trivial — a relatively small or even ‘cute’ little study.”
A scientific revolution
Big data science has moved away from a traditional reductionist model, where a hypothesis is formed and tested by including a single variable in a controlled experiment.
Disorders such as Alzheimer’s disease, estimated to be the third leading cause of death in the United States, according to the NIH, involve intricate components.
Isolating a single variable when it comes to conditions involving the brain may provide one answer, but not necessarily the complete one, said Arthur Toga, a provost professor with joint appointments at the Keck School of Medicine and the USC Viterbi School of Engineering.
We’re letting the data lead us to the discovery.
“We’re letting the data lead us to the discovery. It’s kind of an upside down way of thinking about things,” he said. “Big data allows us to look at all these variables simultaneously and put together a comprehensive picture. Only in concert do they produce the function and structure that you’re trying to understand. If you study only one variable at a time, you may never fully understand how it works.”
What is a Training Coordination Center?
Big Data U, tentatively set to launch in the spring of next year, will be a hybrid of massive open online courses (MOOCs) and YouTube video tutorials. It’s a free resource for anyone who wants a self-guided or semi-structured study of topics relevant to biomedical science. Social media tools will provide ratings for course content and guide the selection of relevant training media.
“We will promote opportunities for big data research rotations, host ‘innovation labs’ for new grant proposal development, develop hackathons and other training activities,” Van Horn said. “Some of these activities will be up to the user to complete, but others will have an expectation of required completion and will entail a report or tangible product.”
Special tools need to be created because traditional ones such as Excel do not scale when astronomical collections of data points have to be crunched, Van Horn said.
The Training Coordination Center is a part of the NIH’s Big Data to Knowledge (BD2K) initiative, launched in 2012 to transform how science is done. The movement harvests biomedical big data to advance science’s understanding of human health and disease.
USC is at the forefront of biomedical big science and hopes to use it to address “wicked problems” — complex, 21st century dilemmas such as Alzheimer’s or traumatic brain injury.
“The purpose of the Training Coordination Center is to coordinate training activities both among the BD2K consortium members and with others engaging in similar efforts,” said Michelle Dunn, NIH senior adviser for data science training, diversity and outreach. “The outreach aspect of the TCC is important because BD2K awardees need to be aware of other efforts, whether funded by NIH or not, in order to make best use of limited funds. In addition to coordination, the TCC will develop resources to enable others to discover educational resources needed for biomedical data science.”
BD2K has 11 Centers of Excellence for Big Data Computing, two of which are at USC: the Big Data for Discovery Science Center with Toga as principal investigator and ENIGMA Consortium with Paul Thompson as principal investigator. Stanford University, Harvard University Medical School and UCLA also host Centers of Excellence.
While each Center of Excellence has its own training responsibilities, Big Data U at USC is the only center tasked with harmonizing these efforts into a concerted action.
Big Data U, which will have a major impact on all 11 of NIH’s Centers of Excellence, will include collaborators from USC’s Information Sciences Institute and the USC School of Cinematic Arts. Participating faculty include José Luis Ambite, Kristina Lerman and Michael Taylor.
“The DC Office of Research Advancement and Steve Moldin at USC were instrumental in obtaining this grant money, as the development of a proposal such as this is enormously complicated,” Toga said. “Their effort places USC in the coveted position of creating a free, online biomedical training center.”
How Big Data U will work
Part of the Training Coordination Center project includes harvesting the Web to automatically organize online resources into an Educational Resource Discovery Index (ERuDite).
Users will create free profiles on Big Data U, which will generate a personalized set of lessons to help scientists and other learners reach their learning goals. Senior investigators such as professors, junior investigators such as postdoctoral students, and graduate and undergraduate students alike will be able to hone their fluency in things such as genomics, the mapping genomes and phenomics — the measuring of physical and biochemical traits in organisms.
The Training Coordination Center will have boot camps, MOOCs, videos and one-off lessons in math, statistics, informatics, computer science and biomedical science — many of which will be created with a spectrum of learners in mind.
An intuitive and intelligent website will identify prerequisites needed before users could graduate to more complex topics and will suggest new topics they may be interested in based on the profiles of learners like them, Van Horn said.
Big Data U will also be a coordination center in that it will advertise training opportunities at any of the BD2K centers and live-stream events. It could even facilitate and partially finance big data-focused mini-projects that last for a few weeks, Van Horn said.
Science is now a digital enterprise.
John Van Horn
“Science is now a digital enterprise,” he added. “Big data sets are pretty much how science is being done. How you share and exchange that data to get as many eyes looking at these sets as possible will lead to new discoveries. It will lead to new insights into disease and hopefully help treat and cure them.”
History of commitment to big data research
In the past decade, USC has shown a commitment to informatics — the science of processing data for storage and retrieval.
In addition to having two BD2K Centers of Excellence, it also has recently advanced two new master’s programs relevant to big data biomedicine and founded the newly named USC Mark and Mary Stevens Neuroimaging and Informatics Institute, which includes the Laboratory of Neuro Imaging and the Imaging Genetics Center. Based on the Health Sciences Campus, the institute will be equipped with state-of-the-art computing systems as well as sophisticated MRI brain imaging systems.
USC’s comprehensive experience in biomedical big data, existing infrastructure and multidisciplinary teamwork mentality will aid in the success of Big Data U. The project is set to run through 2018.