Despite having sequenced the human genome over fifteen years ago, much is still unknown about how it functions. With the advent of high-throughput genomics technologies, it is now possible to measure properties of the genome across the entire genome in a single experiment, such as measuring where a given protein binds to the DNA or what genes are expressed. However, the complexity and massive scale of these data sets--billions of base pairs with thousands of measurements each--pose challenges to their analysis. My research focuses on the development of new machine learning methods that address the challenges posed by genomics data sets.
I will focus on two projects. First, I will present on chromatin state annotations of 164 human cell types. The ENCODE Project was founded with the goal of creating a catalog of functional elements in the human genome. To that end, ENCODE and other consortia have generated thousands of genomics assays from hundreds of human tissue and cell types. I will present genome annotations of 164 human cell types, generated by a computational method called Segway that partitions and labels the genome of a given cell type based on a collection of genomics data sets. Second, I will present on method for understanding chromatin domains. The genomic domain where a gene resides (on the scale of 100k-1M base pairs) influences its regulation: the same gene with the same local regulatory elements (that is, the same promoter) may be expressed in one neighborhood but be silent in another. This type of regulation is crucial for gene regulation, but is currently much less well understood than local regulation. I will present a new method for discovering and annotating genomic domains that integrates many types of genomics data sets. Unlike previous methods, this approach can incorporate information about the 3D conformation of the genome in the nucleus. Using this approach, we discovered a new type of genomic domain. |