Hidden Markov models (HMMs) and related fields. HMM is a subclass of latent variable models, where it is assumed that observed data exhibit unobserved patterns or structures that can be described by latent variables. For example, magnetic resonance tomography measurements for a human body depend on tissue classes that cannot be directly observed. HMMs are widely used for modeling for example in genetics, bioinformatics, signal processing, and image analysis. The aim is to study different theoretical problems related to HMMs and their generalizations as well as applications in real data. One of the main topics of the research group is segmentation – predicting or estimating the latent variable path based on observations. Besides classical segmentation algorithms such as Viterbi or forward-backward ones, the group also studies the alternative approaches (so-called hybrid algorithms). Theoretical research is largely focused on the asymptotic properties of the Viterbi path and several limit theorems. The research is partially international, the main foreign partner is A. Koloydenko (RHUL). Besides classical HMMs, the group also studies several modern generalizations as pairwise and triplet Markov models. Motivated largely by applications, an important research topic of the group is the segmentation in the Bayesian setup. In the Bayesian context, the classical segmentation algorithms are not applicable anymore and so the new methods and algorithms specific for the Bayesian setup need to be worked out.
Random sequence comparison. In many research areas, the central object of study is long (but finite length) sequences from (relatively small) finite alphabet: DNA sequences, protein alignments, written texts, binary codes. One of the central questions is how to measure the similarity (or homology in genetics) of two sequences. There are several methods for that, mostly it is done via so-called similarity score, the most popular score is the length of the longest common subsequence - the longer is the longest common subsequence, the more related are the sequences deemed to be. To incorporate the variety, the sequences are modeled as random processes (often iid) and not-related sequences are taken as an independent. Now the score is a random variable as well and to distinguish the independent sequences from dependent ones, one has to know the (asymptotic) behavior of the score. Unfortunately, analyzing the score is a notoriously hard mathematical problem and since the limit laws are absent, one starts with simpler yet important questions like what is the speed growth of expected score, variation, and other moments. The standard tools of probability theory allow rather easily find some upper bounds to the moments, but lower bounds need a different approach. Together with international partners from GeorgiaTech (mainly H. Matzinger), a general method for lower bounds is worked out. This method requires neither the sequences being independent nor consisting of independent entries (as it is in the case of many standard tools for upper bounds), hence also applicable for pairwise Markov models (as described in the previous section) linking the two research areas of the group. Besides lower bounds, several other questions like path structure of longest common subsequences, the influence of score parameters, suboptimal alignments, etc are studied.
Maximum spacing method and model validation. In statistical modeling, it is essential to find a model that is in some sense most suitable for describing reality. Unlike information criteria such as Akaike or Bayes, the maximum spacing method (MSP) enables together with parameter estimation to decide about the suitability of the assigned model class for given data. The purpose is to study the properties of MSP estimators and the possibilities of the method as a model validation tool for model classes of different complexity.
Moran-type evolution models. The evolution model describes the dynamics of the genetic structure of the population as time evolves. These dynamics are usually driven by two components: selection and mutation. There are a large variety of evolution models, some of them model the evolution time in generations (e.g. Wright-Fisher model), some of them in terms of individuals – a new individual with a certain genotype is born and, to keep the population size fixed, another individual with possibly but not necessarily different genotype is discarded. Typically the births and deaths are modeled as being random and so the models become stochastic. To every genotype is attached a number – fitness – that describes how the given type influences the probability of birth or death of the individual. More fit individuals have smaller probabilities to die. Or, when fitness influences the birth, then the more fit is a type, the bigger is the probability that a newborn individual is of that type. The classical individual-level model is the so-called Moran model. Since the genetic structure of the population after birth/death event depends on the structure before the event and additional randomness, the genetic structure of the population is a (discrete-time) Markov process. The properties of the process, its stationary distribution, reversibility, limit laws as the population size increases – these are the research questions of the group. Evolution models are a rather new field for the group. The main collaborators are C. Watkins (RHUL) and F. Zucca, D. Bertacchi (Milano). The group investigates already existing models (e.g. so-called Guiol-Machado-Schinazi model) as well as proposes new models.
Research Group staff
Research Group projects
Main past projects
Hidden Markov models (HMMs) and related fields:
Random sequence comparison:
Maximum spacing method and model validation:
Moran type evolution models: