Stochastic Models in Bioinformatics

Instructor:  Dr. István MIKLÓS

Text: Durbin, Eddy, Krogh, Mitchison: Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids + handouts.

Prerequisite: None, but some degree of mathematical maturity is needed for this course. The course starts with a short overview of mathematics and biology needed.

Course description:  Bioinformatics is a new and hot discipline, which is extremely application oriented, however, it also has a wonderful background theory consisting of a nice mixture of combinatorics, probability theory, statistics and algorithm theory. This course is an introduction into the mathematical background of bioinformatics with a special emphasis on problem solving and applications.

Topics:

Basics: Models in biology. Biological sequences. RNA secondary structures and pseudo-knotted structures. Protein folding. Evolutionary trees. Basic concepts of evolutionary and comparative biology. Introduction to statistical inferring: likelihood function, maximum likelihood estimation, expectation maximization, the Bayes theorem, Bayesian statistics.

Sequence alignment: The classical and automaton approach for aligning sequences. Hidden Markov Models (HMMs): aligning sequences to a structure. Aligning sequences with pair-HMMs.

Stochastic grammars: The Chomsky hierarchy. Regular grammars are HMMs. Stochastic Context Free Grammars (SCFGs) and their applications in RNA structure prediction. The algorithm theory of regular and SCFGs. The algebraic dynamic programming approach.

Evolutionary trees: Concepts for inferring trees. Stochastic models of evolutionary trees. The Kingmann's coalescent.

Time continuous Markov models: Substitution models of nucleic and amino acids. Insertion-deletion models. Statistical sequence alignment. Comparative bioinformatics.

Optional topics (depending on how much time we will have):

Markov chain Monte Carlo: The concept of MCMC. Metropolis-Hastings. The Gibbs sampler. Partial Importance Sampler. Simulated Annealing. Parallel Tempering. Applications: Bayesian statistics of evolutionary trees, multiple sequence alignment, genome rearrangement.

RNA structures (advanced): Stochastic grammars for inferring pseudo-knotted structures. Folding simulations. Co-transcriptional folding.