There are 1,800 transcription factors encoded in the human genome — and Peggy Farnham wants to understand the function of each one.
Farnham, professor of biochemistry and molecular biology at the Keck School of Medicine of USC, is one of more than 440 researchers participating in the Encyclopedia of DNA Elements project (ENCODE), which is dedicated to finding all functional elements in the human genome. These basic scientists hope that by mining the human genome for gene expression, chromatin structure and transcription factor binding sites, they will help clinical researchers reach an understanding of human disease.
“It has become clear that many diseases are caused by deregulation of a critical gene,” said Farnham, who has been working on the project since 2004. “To understand how gene regulation occurs, we need to identify the transcription factor binding sites. Which genes are controlled by these sites? What signaling pathways are they involved in? The answers are in the regulatory regions, and ENCODE information combined with disease information will provide those answers.”
Thus far, the ENCODE Consortium has studied approximately 200 transcription factors, “so there’s a long way to go,” Farnham said.
The project marked the progress of Farnham and other researchers by releasing a set of more than 30 papers in journals, including Nature and Genome Biology, on Sept. 5, providing more insight into the workings of the human genome.
During the study, researchers found that more than 80 percent of the human genome sequence is linked to biological function, and they mapped more than 4 million regulatory regions where proteins specifically interact with the DNA with exquisite specificity. These findings represent a significant advance in understanding the precise and complex controls over the expression of genetic information within a cell.
The findings bring into much sharper focus the continually active genome in which proteins routinely turn genes on and off using sites that are sometimes at great distances from the genes they regulate; where sites on a chromosome interact with each other, also sometimes at great distances; where chemical modifications of DNA influence gene expression; and where various functional forms of RNA, a form of nucleic acid related to DNA, help regulate the whole system.
Farnham’s team was among the first to adapt the ChIP (chromatin immunoprecipitation) assay to find all the places within the human genome where proteins bind — the transcription factor binding sites. Understanding what happens when a binding site is altered by mutation or natural human variation — what the consequences are, which genes are deregulated as a result of the loss, etc. — could help a clinical researcher develop a therapy that reverses the deregulation, Farnham said.
The ever-growing cache of data collected as part of ENCODE has been publicly available to clinical scientists since the early days of the project, and now that much more is available, the researchers want to make their clinical colleagues aware of the new data release, Farnham said.
“We hope this sets up a dialogue between basic and clinical scientists,” she said.
A new phase of the project will soon begin with a goal of increasing the depth of the ENCODE catalog with respect to the types of functional elements and cell types studied. New tools for more sophisticated analyses of the data will also be developed as part of this phase.
The data sets are available in several databases that can be accessed on the Internet through the ENCODE project portal, as well as at the University of California, Santa Cruz, genome browser; the National Center for Biotechnology Information; and the European Bioinformatics Institute.
The coordinated publication set includes one main integrative paper and five other papers in Nature; 18 papers in Genome Research; and six papers in Genome Biology. The three journals developed a pioneering way to present the information in an integrated form that they call “threads.”
Since the same topics were addressed in different ways in different papers, a new website enables users to follow a topic through all of the papers in the ENCODE publication set in which it appears, by clicking on the relevant “thread” at the Nature ENCODE explorer page. For example, thread No. 1 compiles figures, tables and text relevant to genetic variation and disease from several papers and displays them all on one page. ENCODE scientists believe this will illuminate many biological themes emerging from the analyses.
In addition to the “threaded papers,” six review articles are being published in The Journal of Biological Chemistry and other, affiliated papers in Science, Cell and other journals.
ENCODE, supported by the National Human Genome Research Institute, was launched as a pilot project in 2003 to develop the methods and strategies that would be needed to map the human genome, focusing on only 1 percent of the genome.
By 2007, the pilot became a full-scale project in which the institute invested approximately $123 million over five years. In addition, the institute devoted about $40 million in the ENCODE pilot project plus approximately $125 million in ENCODE-related technology development and model organism research since 2003.
Researchers across the United States, United Kingdom, Spain, Singapore and Japan performed more than 1,600 sets of experiments on 147 types of tissue using numerous technologies standardized across the consortium.
The experiments relied on innovative uses of next-generation sequencing technologies, which had only become available around the start of the ENCODE production effort five years ago, due in large part to advances enabled by the institute’s DNA sequencing technology initiative. In total, ENCODE generated more than 15 trillion bytes of raw data after the equivalent of more than 300 years of compute time for analysis.