4. Theory#

This page explains the scientific and statistical theory behind Maud, and then explains in broad terms how Maud implements that theory.

Look out for a paper setting this out in more detail in the near future!

4.1. The big picture#

The overall aim of an analysis using Maud is to synthesise information from the following sources:

Quantitative measurements of a cell’s metabolites, enzymes and fluxes.
A mechanistic model that describes how each reaction in the metabolic network works in terms of some unknown parameters. Since Maud’s model involves thermodynamics and kinetics, and because it sounds cool, we call it a “thermokinetic model”.
Information about the parameters in the model from sources other than the measurements, such as databases like [Placzek et al., 2016].

These information sources are interesting on their own, but it is clear that they are even better in combination. For example, there is not much gained by accurately parameterising an enzyme’s mechanism without knowing the values of the parameters, and the more sources of parameter information the better!

By treating this problem as a case of Bayesian statistical inference, it is possible to partition these information sources into separate sub-models:

A measurement model describing what is learned from any set of measurements using conditional probability distributions with the form $p(y\mid\hat{y})$, where $y$ represents the observed and $\hat{y}$ the true values of the measured quantities.
A generative model, encompassing our thermokinetic model, that describes how measurable quantities depend on parameters, i.e. a function $g:\Theta\rightarrow\hat{\mathfrak{y}}$ mapping any possible configuration of parameters $\theta$ to a set of measurable quantities $\hat{y}$.
A prior model describing what is known about the unobserved parameters using an unconditional probability function with the form $p(\theta)$.

Given submodels in this form, and a set of measurements $y$ the posterior distribution $p(\theta\mid y)$ synthesises the available information as we want. Questions about the network can be formulated in terms of integrals over the posterior distribution. For example, suppose that our thermokinetic model says that the flux through a reaction is given by the function $flux: \Theta\rightarrow\mathbb{R}$; our model’s probability, given measurements $y$, that the flux is greater than $x$ is then $\int_x^\inftyp(\theta\mid y)d\theta$.

The rest of this page sets out how Maud fleshes out the three sub-models and then briefly explains how it goes about calculating the required posterior integrals.

(sec-thermokinetic-models-in-general)

4.2. Thermokinetic models in general#

A thermokinetic model describes the rates (usually referred to as ‘fluxes’) of the chemical reactions in a metabolic network parametrically, in terms of the concentrations of the network’s metabolites, some kinetic parameters describing the network’s reactions, some thermodynamic parameters and some parameters describing the network’s boundary conditions. eq4.1 therefore captures the general form of a thermokinetic model.

(4.1)#\[ v = f_v(x, \theta) \]

In eq4.1, the term $x$ represents a vector of metabolite concentrations, the term $\theta$ represents a parameter vector, and $v$ is a vector of real numbers, one per reaction.

Thermokinetic models of cell metabolism are often contrasted with constraint-based models. Constraint-based models do not describe metabolic fluxes parametrically; instead they simply identify constraints that exclude certain flux values.

Many thermokinetic models are possible, depending on how the modeller trades off detail against other factors such as computational feasibility, model identification and ease of curation.

One thing that almost all thermokinetic models have in common is the assumption that the stoichiometric coefficient—that is, the number of molecules created or destroyed—of every metabolite in every reaction is known and collected in a stoichiometric matrix $S$ with a row for each metabolite and a column for each reaction.

4.3. Maud’s thermokinetic model#

The thermokinetic model that we chose to implement in Maud decomposes into factors as shown in eq4.2. The full model is presented here for completeness and to illustrate the general nature of the equations that we use, but the details can safely be skipped over.

(4.2)#\[ f_v(x, \theta) = Enzyme \cdot k_{cat} \cdot Reversibility \cdot Saturation \cdot Allostery \cdot Phosphorylation \]

Each of the terms on the right hand side of eq4.2 is a function of $x$ and $\theta$ representing a conceptually distinct aspect of how an enzyme-catalysed reaction works.

The term $Enzyme$ is a vector of non-negative real numbers representing the concentration of the enzyme catalysing each reaction.

The term $k_{cat}$ is a vector of non-negative real numbers representing the amount of flux carried per unit of saturated enzyme.

The term $Reversibility$ is a vector of real numbers capturing the impact of thermodynamics on the reaction’s flux, as shown in eq-reversibility. $$ \begin{aligned} Reversibility &= 1 - \exp(\frac{\Delta_{r}G' + RT \cdot S^T \ln(x)}{RT}) \\ \Delta_{r}G' &= S^{T}\Delta_{f}G' + n F \psi \end{aligned} $$(eq-reversibility)

In eq-reversibility, the term

$T$ is the temperature in Kelvin (a number),
$R$ is the gas constant (a number),
$\Delta_rG'$ is a vector representing the Gibbs free energy change of each reaction in standard conditions,
$\Delta_fG'$ is a vector representing the standard condition Gibbs free energy change of each metabolite’s formation reaction, or in other words each metabolite’s ‘formation energy’.
$n$ is a vector representing the number of charges transported by each reaction.
$F$ is the Faraday constant (a number)
$\psi$ is a vector representin each reaction’s membrane potential (these numbers only matter for reactions that transport non-zero charge)

Note that, for reactions with zero transported charge, the thermodynamic effect on each reaction is derived from metabolite formation energies. This formulation is helpful because, provided that all reactions’ rates are calculated from the same formation energies, they are guaranteed to be thermodynamically consistent.

The term $n$ accounts for both the charge and the directionality. For instance, a reaction that exports 2 protons to the extracellular space in the forward direction would have -2 charge. If a negatively charged molecule like acetate is exported in the forward direction, $n$ would be 1.

Notice how this way of modelling the effect of transported charge does not take into account that the concentration gradient used by the transport is that of the dissociated molecules. Thus, this expression is only correct for ions whose concentration can be expressed in the model only in the charged form; e.g., protons, $K^+$, $Na^+$, $Cl^-$, etc.

The term $Saturation$ in equation eq4.2 is a vector of non-negative real numbers representing, for each reaction, the fraction of enzyme that is saturated, i.e. bound to one of the reaction’s substrates. To describe saturation we use eq4.3, which is taken from [Liebermeister et al., 2010].

(4.3)#\[\begin{split} \begin{aligned} Saturation_r &= a \cdot \text{free enzyme ratio} \\ a &= \prod_{\text{s substrate}}\frac{x_s}{km_{rs}} \\ \text{free enzyme ratio} &= \begin{cases} \prod_{\text{s sustrate}} (1 + \frac{x_s}{km_{rs}})^{S_sr} + \sum_{\text{c inhibitor}}\frac{x_c}{ki_{rc}} & r\text{ irreversible} \\ -1 + \prod_{\text{s sustrate}} (1 + \frac{x_s}{km_{rs}})^{S_sr} + \sum_{\text{c inhibitor}}\frac{x_c}{ki_{rc}} + \prod_{\text{p product}} (1 + \frac{x_p}{km_{rp}})^{S_pr} & r\text{ reversible} \end{cases} \end{aligned} \end{split}\]

The term $Allostery$ in eq4.2 is a vector of non-negative numbers describing the effect of allosteric regulation on each reaction. Allosteric regulation happens when binding to a certain molecule changes an enzyme’s shape in a way that changes its catalytic behaviour. We use eq4.4, originally from [Popova and Sel'kov, 1975], to describe this phenomenon.

(4.4)#\[\begin{split} \begin{aligned} Allostery_r &= \frac{1}{1 + tc_r \cdot (\text{free enzyme ratio}_r \cdot \frac{Qtense}{Qrelaxed})^{subunits}} \\ Qtense &= 1 + \sum_{\text{i inhibitor}} \frac{x_i}{dc_{ri}} \\ Qrelaxed &= 1 + \sum_{\text{a activator}} \frac{x_a}{dc_{ra}} \end{aligned} \end{split}\]

Finally, the term $Phosphorylation$ in eq4.2 captures an important effect whereby enzyme activity is altered due to a coupled process of phosphorylation and dephosphorylation.

(4.5)#\[\begin{split} \begin{aligned} Phosphorylation_r &= (\frac{\beta}{\alpha + \beta})^{subunits} \\ \alpha &= \sum_{\text{p phosphoylator}} kcat_{p} \cdot concp_p \\ \beta &= \sum_{\text{p dephosphoylator}} kcat_{d} \cdot concd_d \\ \end{aligned} \end{split}\]

4.3.1. Steady state assumption#

As well as assuming that the fluxes in our metabolic network will behave as our thermokinetic model expects, we also assume that the system was measured in a steady state, so that every non-boundary metabolite’s concentration was not changing. This assumption is represented mathematically in eq4.6.

(4.6)#\[ S\cdot f_v(x, \theta) = 0 \]

The steady state assumption removes degrees of freedom equal to the rank of $S$ from our model, so that $v$ is constrained to lie in the right null space of S. Usually, given $\theta$ it is possible can solve eq4.6 to find a steady state metabolite concentration $x_{steady}$.

4.3.2. Summary: Maud’s full thermokinetic model#

Maud’s full thermokinetic model starts with a parameter vector $\theta$. It then calculates the steady state metabolite concentration vector $x_steady$ by solving eq4.6.

:::{note}

Note that in practice we do not solve eq4.6 directly but instead use ODE simulation - see section Solving the steady state problem for details

:::

Metabolite concentrations are now available and can be compared with metabolite measurements. To find flux values to compare with measurements we simply calculate $v_{steady}=f_v(x_{steady}, \theta)$.

4.3.3. Measurement model#

For measurements of metabolite and enzyme concentrations we use independent lognormal regression models:

(4.7)#\[\begin{split} \begin{aligned} y_{x} &\sim LN(\ln(x_{steady}), \sigma_x) \\ y_{enzyme} &\sim LN(\ln(Enzyme), \sigma_{enzyme}) \end{aligned} \end{split}\]

For measurements of steady state fluxes we use independent linear regression models:

(4.8)#\[ \begin{equation} y_{v} \sim N(f_v(x_{steady}, \theta), \sigma_v) \end{equation} \]

We ensure that flux measurements correspond to the flux modes of the measured network, so that the same pathway is never measured twice. See chapter 10 of [Palsson, 2015], entitled “the left null space” for a discussion of this issue.

We assume that the measurement errors $\sigma_x$, $\sigma_{enzyme}$ and $\sigma_v$ are known for each measurement.

These measurements $y_x$ and $y_{enzyme}$ are typically derived from quantitative metabolomics and proteomics experiments. Our choice of measurement model for these measurements is imperfect in at least these ways:

The measurement error is not known, and is in fact very difficult to estimate.
The measurements are not independent, as they are typically far more reliable as to differences—both between metabolites in the same experiment and between the same metabolite in different experiments—than they are as to absolute concentrations.

These problems also apply to flux measurements, but there is another issue: flux meausrements are derived from k

4.3.4. Prior model#

For kinetic parameters including $km$, $kcat$, $ki$ and parameters governing regulation we use independent informative lognormal prior distributions based on information gleaned from online databases, literature searches and intuition.

For boundary conditions including unbalanced metabolite concentrations, boundary fluxes and enzyme concentrations we use informative independent lognormal or normal prior distributions depending on the natural sign constraints of the variable (for example fluxes are not constrained to be positive, so in this case we use independent normal prior distributions). We sometimes use measurements to determine informative prior distributions for boundary conditions.

For thermodynamic parameters—i.e. metabolite formation energies—we use an informative multivariate normal distribution that is derived from equilibrium constant measurements reported in the NIST TECRDB [Goldberg et al., 2004]. The distribution is calculated using the component contribution method [Noor et al., 2013] as implemented in the software equilibrator [Beber et al., 2021]. In future we would like to use our own software to generate these priors.

4.4. Implementation#

In order to implement the statistical model described above, Maud performs posterior sampling using adaptive Hamiltonian Monte Carlo as provided by Stan.

To our knowledge this is the only viable approach. Realistic models have too many parameters for many Bayesian computation approaches, including rejection sampling, ABC, Metropolis-Hastings and Gibbs sampling, while a similar study [St. John et al., 2018], corroborated by our experience, indicates that variational inference is unlikely to provide satisfactory approximations in this case.

We believe that the non-linear, multi-parameter equations shown in Maud’s thermokinetic model create particular problems for MCMC sampling because they induce complex correlations in the posterior distribution. Consequently, although adaptive Hamiltonian Monte Carlo makes Bayesian inference for realistic thermokinetic models possible, many leapfrog steps are typically required per sample, as the sampler must traverse the posterior distribution in small steps in order to avoid discretisation errors.

4.4.1. Solving the steady state problem#

As mentioned in Summary: Maud’s full thermokinetic model, Maud’s generative model calculates steady state metabolite concentrations given parameter values by solving the steady state equation eq4.6. Since adaptive Hamiltonian Monte Carlo requires gradients of the posterior distribution, it is also necessary to calculate sensitivities of the steady state solution with respect to all parameters.

Our approach to this problem is to choose a starting concentration vector $x_{0}$ and a simulation time $t$, then find $x_t$ using numerical ODE integration. To verify whether $x_t$ is a steady state we then evaluate $S\cdot f_v(x_t, \theta)$ and check if the result is sufficiently close to zero.

Stan provides an interface to the Sundials ODE solver CVODES, including gradient calculations.

For the systems we have investigated, this method works better than solving the steady state problem using an algebra solver.

It is relevant to note a caveat for the Maximum A Posteriori estimation (optimization) in opposition to sampling the full distribution. In an optimization context, a hard rejection of non-steady states may be disadvantageous. The reason is that, when a rejection occurs, the gradients are lost, removing the trajectory of the optimization. For this application, maud provides the possibility of including the departure from steady state in the likelihood, such that the gradients are not lost but the optimization still favors steady states. This is implemented in the form of eq4.9

(4.9)#\[ S\cdot f_v(x_t, \theta) \sim N(0, $\epsilon$ $C$) \]

where $C$ is the vector of balanced concentrations and $\epsilon$ is a configurable parameter.

4.4.2. Analysis of results#

After sampling is complete, Maud uses the Bayesian analysis library arviz to transform the results into an InferenceData object and save it as a json file, along with a range of files containing debug information. These files can be used to validate the computation and to draw conclusions about the measured system.

[1]

Moritz E. Beber, Mattia G. Gollub, Dana Mozaffari, Kevin M. Shebek, and Elad Noor. eQuilibrator 3.0 – a platform for the estimation of thermodynamic constants. arXiv:2103.00621 [q-bio], February 2021. URL: http://arxiv.org/abs/2103.00621 (visited on 2021-12-03), arXiv:2103.00621.

[2]

Robert N. Goldberg, Yadu B. Tewari, and Talapady N. Bhat. Thermodynamics of enzyme-catalyzed reactions—a database for quantitative biochemistry. Bioinformatics, 20(16):2874–2877, November 2004. URL: https://doi.org/10.1093/bioinformatics/bth314 (visited on 2022-07-11), doi:10.1093/bioinformatics/bth314.

[3]

Wolfram Liebermeister, Jannis Uhlendorf, and Edda Klipp. Modular rate laws for enzymatic reactions: thermodynamics, elasticities and implementation. Bioinformatics, 26(12):1528–1534, June 2010. URL: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btq141 (visited on 2019-09-23), doi:10.1093/bioinformatics/btq141.

[4]

Elad Noor, Hulda S. Haraldsdóttir, Ron Milo, and Ronan M. T. Fleming. Consistent Estimation of Gibbs Energy Using Component Contributions. PLoS Computational Biology, 9(7):e1003098, July 2013. URL: https://dx.plos.org/10.1371/journal.pcbi.1003098 (visited on 2019-11-07), doi:10.1371/journal.pcbi.1003098.

[5]

Bernhard Ø. Palsson. Systems Biology: Constraint-based Reconstruction and Analysis. Cambridge University Press, Cambridge, 2015. ISBN 978-1-107-03885-1. URL: https://www.cambridge.org/core/books/systems-biology/7F8445BC87019806B3625DFC4B5C27D4 (visited on 2023-04-20), doi:10.1017/CBO9781139854610.

[6]

Sandra Placzek, Ida Schomburg, Antje Chang, Lisa Jeske, Marcus Ulbrich, Jana Tillack, and Dietmar Schomburg. BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic Acids Research, 45(D1):D380–D388, October 2016. URL: https://doi.org/10.1093/nar/gkw952, arXiv:https://academic.oup.com/nar/article-pdf/45/D1/D380/8847269/gkw952.pdf, doi:10.1093/nar/gkw952.

[7]

S. V. Popova and E. E. Sel'kov. Generalization of the model by monod, wyman and changeux for the case of a reversible monosubstrate reaction. FEBS Letters, 53(3):269–273, 1975. URL: https://febs.onlinelibrary.wiley.com/doi/abs/10.1016/0014-5793%2875%2980034-2 (visited on 2019-07-12), doi:10.1016/0014-5793(75)80034-2.

[8]

Peter St. John, Jonathan Strutz, Linda J Broadbelt, Keith E J Tyo, and Yannick J Bomble. Bayesian inference of metabolic kinetics from genome-scale multiomics data. bioRxiv, October 2018. URL: http://biorxiv.org/lookup/doi/10.1101/450163 (visited on 2019-02-19), doi:10.1101/450163.