My senior thesis in Computational Biology took shape under the mentorship of Dr. Nicola Neretti and Dr. Sorin Istrail. The project grew out of my research in chromatin structure and organization that I had been doing for the past two years. It was a concerted push to answer a single question and was the most involved research project I’ve taken on to date. I’m going to summarize the work here, but the full text of the thesis can be found at the bottom of this page. Much of the introduction and background reading would be relevant to anyone looking to review the 3D chromatin structure reconstruction methods available as of April 2015. This research was also presented in poster form at Brown’s Theories in Action 2015 conference, and the poster is available here.
The main question I was trying to answer: given a 2D matrix representing the contact frequencies of segments along a chromosome from a Hi-C experiment, how can you infer a 3D structure that would give rise to the experimental data? Many scientists have tried to answer this question, starting with protein structure and folding long before chromatin structure information was accessible. After reviewing the literature, it was clear that there were statistical methods available to do one of two tasks:
- Construct 3D structures that could accurately explain the 2D data from a very small section of a chromosome – perhaps 100kb. The “MonteGrappa” algorithm (detailed in [1]) is the best example of this. Using a Markov Chain Monte Carlo (MCMC) approach, it reconstructs an ensemble of 3D structures that accurately explain a 2D contact map when averaged together. This method is powerful because it attempts to explain the underlying chromatin structures that are in a population of cells – no single structure could accurately model that.
- Provide information about the 3D arrangement of larger domains of a single chromosome or multiple chromosomes. In this paradigm, we’re not trying to give singular structures that can explain a chromosome – such a model would be too complex. Instead, we try to place each larger domain (larger than 100kb, for example) somewhere in 3D space and show how it interacts with the other domains. These types of models are important for visualizing 3D structures of entire chromosomes and calculating statistics on the structure, but they lack the complexity and detail at finer scales like the models in (1).
My contribution to the field was to build a statistical model that could bridge the gap between point (1) and point (2). I wanted to answer the question “how can we build a statistical model that can reconstruct an entire chromosome while maintaining the level of detail that finer-scale methods have?”
This was dangerous territory, because the model had the potential to become incredibly complex. Methods from point (1) cannot be applied to larger sections of a chromosome because there are too many points to simulate in space and computing the structures takes too long. I needed a way to leverage the benefits of both method types.
I eventually came up with a hybrid method that drew from both paradigms. The algorithm is titled “Monte Monster” because it uses MCMC techniques and tackles a monster statistical challenge! The basic idea of the method follows:
- We are going to reconstruct an ensemble of 3D structures for a single chromosome. Taken together, these structures represent the information contained in the 2D contact map provided by a Hi-C experiment.
- Divide the chromosome up into small domains. I chose to use Topologically Associated Domains (TADs, highly self-interacting regions along the chromosome) as the basic definition of chromosome domains.
- Using the MonteGrappa algorithm defined above, compute an ensemble of structures for each domain. This will capture the 3D structure within each domain, but not the interactions between domains.
- Use the MonteMonster algorithm to pick a single structure from the ensemble for each domain along the chromosome. Chose a position, rotation and scale for this structure in 3D space. Compute the probability of the structure given the 2D data.
- Iteration takes place with a MCMC sampling scheme. Each iteration can modulate one aspect of the 3D structure. A domain is chosen at random, and a move is applied to that structure. Either the position, rotation or scale of the structure is perturbed, or the structure of the domain is swapped for another member of the ensemble from step 3.
- MCMC moves are accepted according to the Metropolis criterion. Moves are always accepted if they increase the probability of the structure, and are accepted if they decrease the probability sometimes.
- Run the chain until it reaches a stationary distribution in Ns draws. Draws from the chain every Ns iterations after this point theoretically represent independent samplings of the chain, and can be used to build an ensemble of 3D structures for an entire chromosome.
This explanation is incomplete and leaves out many important details, mainly the math behind it all! It does give a main idea of the method, though. The key is that I am using pre-computed ensembles of structures to get the structure of an entire chromosome, rather than trying to simulate it all at once.
By the end of my senior year, I had started to program code to implement this method, but I was unable to finish and test it completely. I was able to show that it worked for the interaction between two neighboring domains. That is a far stretch from an entire chromosome which can have hundreds of domains. Dr. Neretti is currently looking for a new student to take over the project and move it to a practical implementation.
The full content of the thesis can be found below.
References
[1] Giorgetti, L. et al. Predictive polymer modeling reveals coupled fluctuations in chromosome conformation and transcription. Cell 157, 950–63 (2014)