Bakalaureuseõpe
Matemaatiline statistika
Paarikaupa Markovi mudel (PMM), juhendaja Jüri Lember.
Viide: van der Maaten, L., Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research 9.
Processes in living cells can seem magical at first sight - DNA is replicated faithfully at each cell division, independent signals are passed to the right recipients without crosstalk, the cell separates into two functional daughters, etc. Many of these behaviours can be described globally, ignoring the actual reactions that take place (e.g. cell divides), or locally via the reactions, without giving information about how this relates to the complex behaviour (e.g. nMP + ATP -> nMP-AMP + P2O7). The goal of this project is to help develop better intuition about these behaviours by defining them via local abstract rules, like A + B -> C, and simulate the outcome to obtain the global picture. More technically, you would implement a particle simulator that evolves according to a stochastic context free grammar that defines the transitions. The user interface would include both an area for determining the local rules, as well as a canvas for the global simulation.
As an example, consider a problem of DNA replication, where an exact copy of a long molecule is made. The process is relatively well understood, and could be written down as a large set of complicated differential equations. In a simplified abstract model, one could represent a single free-floating DNA base by X0 (where X could be A,C,G,T), the base in a polymer by X1, and a paired one by X2. The greatly simplified replication process in a complicated molecular soup of A0, T0, C0, T0, etc, could then be condensed to two rules:
X1 + X0 => X2X1
Y1 + X1 => Y2X2.
There is a wide range of complex behaviours that can be reached with such simple relations. Similarly, there are many complicated biochemical models of cell behaviour that could be attempted to reduce to a small number of rules. Your work would allow people to define and explore them.
This educational product is aimed for scientists or educators who want to develop intuition about workings of biomolecules, and more generally, for people who like exploring how complex natural phenomena can be captured by computational rules. There is an existing basic simulator (OrganicBuilder, GPL v3 license) that achieves some of the goals set above. However, it lacks a nice interface, such as touch-aided controls, as well as more lifelike behaviours, such as different molecule sizes and diffusion rates, and stochastic execution of the rules. We have permission of the developer to fork and expand on this code base, if it is more reasonable than starting from scratch.
https://github.com/BertrandDechoux/OrganicBuilder
https://bertranddechoux.github.io/OrganicBuilder
We previously attacked the problem of predicting individual characteristics using genomic information in yeast [1], and found that traits can be predicted surprisingly well, with on average 91% accuracy, when using information about variation in DNA, as well as other measurements for the same individual. Importantly, close relatives greatly aided prediction. This demonstrated that there are no fundamental limitations to accurate prediction, and we are now asking if the same holds true for human health information.
The aim of this project is to predict elements of electronic health records based on all the rest of the available data on the person, including DNA sequence and phenotypes of closely related individuals. The methods used would initially follow those of [1], starting with standard linear mixed models to combine information from the genome and other traits, and expanding to random forest based methods for a more flexible model class. If desired, other types of approaches, such as deep neural networks, can be tested. The project is in collaboration with the Estonian Genome Center (Geenivaramu) and its scientists.
This data science project is well-suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.
References: 1) Kaspar Märtens, Johan Hallin, Jonas Warringer, Gianni Liti, Leopold Parts. “Predicting quantitative traits from genome and phenome with near perfect accuracy”. Nature Communications, 2016. http://www.nature.com/ncomms/2016/160510/ncomms11512/full/ncomms11512.html
The goal of this project is to explore cellular processes that span a range of timescales using simulations in state-process models. The state X_t of the cellular system at time t is described by N state variables x_{t,n}, so X_t = (x_{t,1}, ..., x_{t,N}). For example, a state could be the number of certain mRNA molecules, number of ATP molecules, number of PolII-bound promoters of a gene, or cell cycle stage. The state can have associated uncertainty, and be represented by a distribution that can be used to calculate average values and higher moments.
The state changes due to action of M cellular processes F = (f_1, ... , f_M). Every process f_m takes in the current cellular state X_t, and calculates the change \Delta_{t,m} to all the state variables impacted by it over time dt. The processes capture either a single biochemical reaction (a low-level process with short timescale such as transcription factor binding), or more complicated ongoings, such as TCA cycle, or replication of all DNA. The definition reduces to a standard time-difference model using a graph Laplacian (e.g. Gunawardena 2013), \frac{dX_t}{dt} = L X_t, in a special case, but does not require linearity in the state, and can make use of uncertainty in the variables. As a downside, analytic solutions are likely not obtainable in general. Thus, the main goal of this model is accurate and intuitive simulation.
This model formalism enables intuitive and efficient calculation. First, evolution of the model over time can be captured with a message passing algorithm, where processes and states interact via sending information about their statistics. The messages can be passed at timescales that depend on local conditions - if states fluctuate a lot, and the fluctuation has large downstream consequences, the timescale can be tuned to be shorter. Finally, several processes operating at similar timescales could be combined into a single one with matching input/output characteristics, allowing simpler descriptions, and natural compositions.
This project is well-suited to a person with a background in mathematical modeling, elementary coding skills, and an interest in biochemistry, molecular biology, or living systems in general.
The features that determine the efficacy of Cas9 function have only been tested to some extent. For example, it is known that DNA sequence composition and its accessibility play a role [1,2]. Whether the editing results in a change for a cell further depends on the expression of the targeted gene and exon, conservation of the region across evolution, the protein domain edited, and the genetic background of the line. However, the extent of the influence of these factors remains poorly characterized for now, and it is difficult to predict whether a newly designed gRNA will perform well in a genome editing experiment.
The aim of this project is to build a predictive model of genome editing outcome, focusing on the properties of the targeted region. As a first step, the editing readouts can be modeled as in [2], and the model expanded to include the abundant additional genomic information. Alternatively, other machine learning approaches can be tested. The project will be in collaboration with the Genetic Screens of Cellular Traits group at the Wellcome Trust Sanger Institute, where new data for validating the findings can be generated.
This cutting edge project is well suited for someone with experience in (or desire to acquire) machine learning or statistical modeling methods, and basic data science skills of obtaining, cleaning, and visualising data. Knowledge of genomics is beneficial.
References: 1) Smith, Justin D., et al. "Quantitative CRISPR interference screens in yeast identify chemical-genetic interactions and new rules for guide RNA design." Genome biology17.1 (2016). http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0900-9
2) Li, W. et al. “MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens” Genome Biology 15:554 (2014). https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4
Uurimistöö käigus tuleb selgeks teha, mis on modifitseeritud regressioonhinnang, mis on selle omadused, ning rakendada hinnangut Eesti tööjõu-uuringu kuiste hinnangute arvutamiseks. Hinnangute arvutamisel saab kasutada tarkvara R paketti ReGenesees.
Vajalik valikuuringute teooria ja üldistatud regressioonanalüüsi tundmine.
Parema kasutuskogemuse tagamiseks kasutame küpsiseid. TÜ välisveeb ei töötle ega kogu isikuandmeid. Välisveeb kasutab FB Pixeli ja Google Analyticsi teenust. Loe lähemalt andmekaitsetingimustest.