Hidden Markov models (HMMs) and related fields. HMM is a subclass of latent variable models, where it is assumed that observed data exhibit unobserved patterns or structures that can be described by latent variables. For example, magnetic resonance tomography measurements for a human body depend on tissue classes that cannot be directly observed. HMMs are widely used for modeling for example in genetics, bioinformatics, signal processing, and image analysis. The aim is to study different theoretical problems related to HMMs and their generalizations as well as applications in real data. One of the main topics of the research group is segmentation – predicting or estimating the latent variable path based on observations. Besides classical segmentation algorithms such as Viterbi or forward-backward ones, the group also studies the alternative approaches (so-called hybrid algorithms). Theoretical research is largely focused on the asymptotic properties of the Viterbi path and several limit theorems. The research is partially international, the main foreign partner is A. Koloydenko (RHUL). Besides classical HMMs, the group also studies several modern generalizations as pairwise and triplet Markov models. Motivated largely by applications, an important research topic of the group is the segmentation in the Bayesian setup. In the Bayesian context, the classical segmentation algorithms are not applicable anymore and so the new methods and algorithms specific for the Bayesian setup need to be worked out.
Random sequence comparison. In many research areas, the central object of study is long (but finite length) sequences from (relatively small) finite alphabet: DNA sequences, protein alignments, written texts, binary codes. One of the central questions is how to measure the similarity (or homology in genetics) of two sequences. There are several methods for that, mostly it is done via so-called similarity score, the most popular score is the length of the longest common subsequence - the longer is the longest common subsequence, the more related are the sequences deemed to be. To incorporate the variety, the sequences are modeled as random processes (often iid) and not-related sequences are taken as an independent. Now the score is a random variable as well and to distinguish the independent sequences from dependent ones, one has to know the (asymptotic) behavior of the score. Unfortunately, analyzing the score is a notoriously hard mathematical problem and since the limit laws are absent, one starts with simpler yet important questions like what is the speed growth of expected score, variation, and other moments. The standard tools of probability theory allow rather easily find some upper bounds to the moments, but lower bounds need a different approach. Together with international partners from GeorgiaTech (mainly H. Matzinger), a general method for lower bounds is worked out. This method requires neither the sequences being independent nor consisting of independent entries (as it is in the case of many standard tools for upper bounds), hence also applicable for pairwise Markov models (as described in the previous section) linking the two research areas of the group. Besides lower bounds, several other questions like path structure of longest common subsequences, the influence of score parameters, suboptimal alignments, etc are studied.
Maximum spacing method and model validation. In statistical modeling, it is essential to find a model that is in some sense most suitable for describing reality. Unlike information criteria such as Akaike or Bayes, the maximum spacing method (MSP) enables together with parameter estimation to decide about the suitability of the assigned model class for given data. The purpose is to study the properties of MSP estimators and the possibilities of the method as a model validation tool for model classes of different complexity.
Moran-type evolution models. The evolution model describes the dynamics of the genetic structure of the population as time evolves. These dynamics are usually driven by two components: selection and mutation. There are a large variety of evolution models, some of them model the evolution time in generations (e.g. Wright-Fisher model), some of them in terms of individuals – a new individual with a certain genotype is born and, to keep the population size fixed, another individual with possibly but not necessarily different genotype is discarded. Typically the births and deaths are modeled as being random and so the models become stochastic. To every genotype is attached a number – fitness – that describes how the given type influences the probability of birth or death of the individual. More fit individuals have smaller probabilities to die. Or, when fitness influences the birth, then the more fit is a type, the bigger is the probability that a newborn individual is of that type. The classical individual-level model is the so-called Moran model. Since the genetic structure of the population after birth/death event depends on the structure before the event and additional randomness, the genetic structure of the population is a (discrete-time) Markov process. The properties of the process, its stationary distribution, reversibility, limit laws as the population size increases – these are the research questions of the group. Evolution models are a rather new field for the group. The main collaborators are C. Watkins (RHUL) and F. Zucca, D. Bertacchi (Milano). The group investigates already existing models (e.g. so-called Guiol-Machado-Schinazi model) as well as proposes new models.
Research Group staff
- Professor Jüri Lember (Head of Research Group)
- Associate Professor Kristi Kuljus
- PhD student Joonas Sova
- Stochastic modeling: mainly models related to Hidden Markov models (HMMs, pairwise and triplet Markov models (PMM and TMM)), segmentation (Viterbi related algorithms) parameter inference, limit theorems in segmentation, infinite alignments, Bayesian modeling, applications.
- Random sequence comparison: longest common subsequence related scores and optimal alignments, asymptotic properties, PMMs in sequence comparison.
- Maximum spacing methods: asymptotic properties of MSP estimators in the case of multivariate observations; extension of MSP method to dependent observations; MSP method as a model validation tool.
- Evolution models: Moran-like reversible models, large population limits, Dirichlet processes, Polya urn models, GMS models.
Research Group projects
- PI: IUT34-5 Mathematical statistics: theory and applications
- PRG865 Statistical modelling with hidden Markov models and model validation
Main past projects
- ETF9288 Hidden Markov models and random sequence comparison (The project is a continuation of the ETF grants 5694 and 7553.) (01.2012-07.2016)
- SF0180015s12 Stochastics: theory and applications (01.2012-12.2014)
Hidden Markov models (HMMs) and related fields:
- Lember, Jüri; Sova, Joonas (2021). Regenerativity of Viterbi Process for Pairwise Markov Models. Journal of Theoretical Probability. DOI: 10.1007/s10959-020-01022-z [not yet published].
- Koloydenko, A.; Kuljus, K.; Lember, J. (2020). MAP segmentation in Bayesian hidden Markov models: a case study. Journal of Applied Statistics. DOI: 10.1080/02664763.2020.1858273.
- Lember, Jüri; Sova, Joonas (2020). Existence of infinite Viterbi path for pairwise Markov models. Stochastic Processes and their Applications, 130 (3), 1388−1425. DOI: 10.1016/j.spa.2019.05.004.
- Lember, J., Gasbarra, D., Koloydenko, A. and Kuljus, K. (2019). Estimation of Viterbi path in Bayesian hidden Markov models. METRON, 1−27. DOI: 10.1007/s40300-019-00152-7.
- Kuljus, Kristi; Bayisa, Fekadu L.; Bolin, David; Lember, Jüri; Yu, Jun (2018). Comparison of hidden Markov chain models and hidden Markov random field models in estimation of computed tomography images. Communications in Statistics: Case Studies, Data Analysis and Applications , 4 (1), 46−55.10.1080/23737484.2018.1473059.
- Kuljus, Kristi; Lember, Jüri (2016). On the accuracy of the MAP inference in HMMs. Methodology And Computing In Applied Probability, 18 (3), 597−627.
- Koloydenko, Alexey; Lember, Jüri (2014). Bridging Viterbi and Posterior Decoding: A Generalized Risk Approach to Hidden Path Inference Based on Hidden Markov Models. Journal of Machine Learning Research, 15, 1−58.
- Kuljus, Kristi; Lember, Jüri (2012). Asymptotic risks of Viterbi segmentation. Stochastic Processes and their Applications, 122 (9), 3312−3341. DOI: 10.1016/j.spa.2012.05.014.
- Lember, J.; Koloydenko, A. (2008). The adjusted Viterbi training for hidden Markov models. Bernoulli, 14 (1), 180−206.
Random sequence comparison:
- Lember, Jüri; Matzinger, Heinrich; Sova, Joonas; Zucca, Fabio (2018). Lower bounds for moments of global scores of pairwise Markov chains. Stochastic Processes and their Applications, 128, 1678−1710. DOI: 10.1016/j.spa.2017.08.009.
- Gong, Ruoting; Houdre, Christian; Lember, Jüri (2018). Lower Bounds on the Generalized Central Moments of the Optimal Alignments Score of Random Sequences. Journal of Theoretical Probability, 31 (2), 1−41. DOI: 10.1007/s10959-016-0730-4.
- Lember, Jüri; Matzinger, Heinrich; Vollmer, Anna-Lisa (2014). Optimal alignments of longest common subsequences and their path properties. Bernoulli, 20 (3), 1292−1343. DOI: 10.3150/13-BEJ522.
- Lember, J.; Matzinger, H.; Torres, F. (2012). The rate of the convergence of the mean score in random sequence comparison. The Annals of Applied Probability, 22 (3), 1046−1058. DOI: 10.1214/11-AAP778.
- Lember, J.; Matzinger, H. (2009). Standard deviation of the longest common subsequences Annals of probability, 37 (3), 1192−1235. DOI: 10.1214/08-AOP436.
Maximum spacing method and model validation:
- Kuljus, Kristi; Ranneby, Bo (2020). Asymptotic normality of generalized maximum spacing estimators for multivariate observations. Scandinavian Journal of Statistics, 47 (3), 968−989. DOI: 10.1111/sjos.12436.
- Kuljus, Kristi; Ranneby, Bo (2015). Generalized maximum spacing estimation for multivariate observations. Scandinavian Journal of Statistics, 42 (4), 1092−1108.
Moran type evolution models:
- Lember, Jüri; Watkins, Chris (2021). An Evolutionary Model that Satisfies Detailed Balance. Methodology And Computing In Applied Probability. DOI: 10.1007/s11009-020-09835-5 [ilmumas].
- Bertacchi, Daniela; Lember, Jüri; Zucca, Fabio (2018). A stochastic model for the evolution of species with random fitness. Electronic Communications in Probability, 23 (88), 1−13. DOI: 10.1214/18-ECP190.