4. Theory#

This page explains the scientific and statistical theory behind Maud, and then explains in broad terms how Maud implements that theory.

Look out for a paper setting this out in more detail in the near future!

4.1. The big picture#

The overall aim of an analysis using Maud is to synthesise information from the following sources:

Quantitative measurements of a cell’s metabolites, enzymes and fluxes.
A mechanistic model that describes how each reaction in the metabolic network works in terms of some unknown parameters. Since Maud’s model involves thermodynamics and kinetics, and because it sounds cool, we call it a “thermokinetic model”.
Information about the parameters in the model from sources other than the measurements, such as databases like [Placzek et al., 2016].

These information sources are interesting on their own, but it is clear that they are even better in combination. For example, there is not much gained by accurately parameterising an enzyme’s mechanism without knowing the values of the parameters, and the more sources of parameter information the better!

By treating this problem as a case of Bayesian statistical inference, it is possible to partition these information sources into separate sub-models:

A measurement model describing what is learned from any set of measurements using conditional probability distributions with the form $p(y\mid\hat{y})$, where $y$ represents the observed and $\hat{y}$ the true values of the measured quantities.
A generative model, encompassing our thermokinetic model, that describes how measurable quantities depend on parameters, i.e. a function $g:\Theta\rightarrow\hat{\mathfrak{y}}$ mapping any possible configuration of parameters $\theta$ to a set of measurable quantities $\hat{y}$.
A prior model describing what is known about the unobserved parameters using an unconditional probability function with the form $p(\theta)$.

Given submodels in this form, and a set of measurements $y$ the posterior distribution $p(\theta\mid y)$ synthesises the available information as we want. Questions about the network can be formulated in terms of integrals over the posterior distribution. For example, suppose that our thermokinetic model says that the flux through a reaction is given by the function $flux: \Theta\rightarrow\mathbb{R}$; our model’s probability, given measurements $y$, that the flux is greater than $x$ is then $\int_x^\inftyp(\theta\mid y)d\theta$.

The rest of this page sets out how Maud fleshes out the three sub-models and then briefly explains how it goes about calculating the required posterior integrals.

(sec-thermokinetic-models-in-general)

4.2. Thermokinetic models in general#

A thermokinetic model describes the rates (usually referred to as ‘fluxes’) of the chemical reactions in a metabolic network parametrically, in terms of the concentrations of the network’s metabolites, some kinetic parameters describing the network’s reactions, some thermodynamic parameters and some parameters describing the network’s boundary conditions. eq4.1 therefore captures the general form of a thermokinetic model.

(4.1)#\[ v = f_v(x, \theta) \]

In eq4.1, the term $x$ represents a vector of metabolite concentrations, the term $\theta$ represents a parameter vector, and $v$ is a vector of real numbers, one per reaction.

Thermokinetic models of cell metabolism are often contrasted with constraint-based models. Constraint-based models do not describe metabolic fluxes parametrically; instead they simply identify constraints that exclude certain flux values.

Many thermokinetic models are possible, depending on how the modeller trades off detail against other factors such as computational feasibility, model identification and ease of curation.

One thing that almost all thermokinetic models have in common is the assumption that the stoichiometric coefficient—that is, the number of molecules created or destroyed—of every metabolite in every reaction is known and collected in a stoichiometric matrix $S$ with a row for each metabolite and a column for each reaction.

4.3. Maud’s thermokinetic model#

The thermokinetic model that we chose to implement in Maud decomposes into factors as shown in eq4.2. The full model is presented here for completeness and to illustrate the general nature of the equations that we use, but the details can safely be skipped over.

(4.2)#\[ f_v(x, \theta) = Enzyme \cdot k_{cat} \cdot Reversibility \cdot Saturation \cdot Allostery \cdot Phosphorylation \]

Each of the terms on the right hand side of eq4.2 is a function of $x$ and $\theta$ representing a conceptually distinct aspect of how an enzyme-catalysed reaction works.

The term $Enzyme$ is a vector of non-negative real numbers representing the concentration of the enzyme catalysing each reaction.

The term $k_{cat}$ is a vector of non-negative real numbers representing the amount of flux carried per unit of saturated enzyme.

The term $Reversibility$ is a vector of real numbers capturing the impact of thermodynamics on the reaction’s flux, as shown in eq4.3.

(4.3)#\[\begin{split} \begin{aligned} Reversibility &= 1 - \exp(\frac{\Delta_{r}G' + RT \cdot S^T \ln(x)}{RT}) \\ \Delta_{r}G' &= S^{T}\Delta_{f}G' + n F \psi \end{aligned} \end{split}\]

In eq4.3, the term

$T$ is the temperature in Kelvin (a number),
$R$ is the gas constant (a number),
$\Delta_rG'$ is a vector representing the Gibbs free energy change of each reaction in standard conditions,
$\Delta_fG'$ is a vector representing the standard condition Gibbs free energy change of each metabolite’s formation reaction, or in other words each metabolite’s ‘formation energy’.
$n$ is a vector representing the number of charges transported by each reaction.
$F$ is the Faraday constant (a number)
$\psi$ is a vector representin each reaction’s membrane potential (these numbers only matter for reactions that transport non-zero charge)

Note that, for reactions with zero transported charge, the thermodynamic effect on each reaction is derived from metabolite formation energies. This formulation is helpful because, provided that all reactions’ rates are calculated from the same formation energies, they are guaranteed to be thermodynamically consistent.

The term $n$ accounts for both the charge and the directionality. For instance, a reaction that exports 2 protons to the extracellular space in the forward direction would have -2 charge. If a negatively charged molecule like acetate is exported in the forward direction, $n$ would be 1.

Notice how this way of modelling the effect of transported charge does not take into account that the concentration gradient used by the transport is that of the dissociated molecules. Thus, this expression is only correct for ions whose concentration can be expressed in the model only in the charged form; e.g., protons, $K^+$, $Na^+$, $Cl^-$, etc.

The term $Saturation$ in equation eq4.2 is a vector of non-negative real numbers representing, for each reaction, the fraction of enzyme that is saturated, i.e. bound to one of the reaction’s substrates. To describe saturation we use eq4.4, which is taken from [Liebermeister et al., 2010].

(4.4)#\[\begin{split} \begin{aligned} Saturation_r &= a \cdot \text{free enzyme ratio} \\ a &= \prod_{\text{s substrate}}\frac{x_s}{km_{rs}} \\ \text{free enzyme ratio} &= \begin{cases} \prod_{\text{s sustrate}} (1 + \frac{x_s}{km_{rs}})^{S_sr} + \sum_{\text{c inhibitor}}\frac{x_c}{ki_{rc}} & r\text{ irreversible} \\ -1 + \prod_{\text{s sustrate}} (1 + \frac{x_s}{km_{rs}})^{S_sr} + \sum_{\text{c inhibitor}}\frac{x_c}{ki_{rc}} + \prod_{\text{p product}} (1 + \frac{x_p}{km_{rp}})^{S_pr} & r\text{ reversible} \end{cases} \end{aligned} \end{split}\]

The term $Allostery$ in eq4.2 is a vector of non-negative numbers describing the effect of allosteric regulation on each reaction. Allosteric regulation happens when binding to a certain molecule changes an enzyme’s shape in a way that changes its catalytic behaviour. We use eq4.5, originally from [Popova and Sel'kov, 1975], to describe this phenomenon.

(4.5)#\[\begin{split} \begin{aligned} Allostery_r &= \frac{1}{1 + tc_r \cdot (\text{free enzyme ratio}_r \cdot \frac{Qtense}{Qrelaxed})^{subunits}} \\ Qtense &= 1 + \sum_{\text{i inhibitor}} \frac{x_i}{dc_{ri}} \\ Qrelaxed &= 1 + \sum_{\text{a activator}} \frac{x_a}{dc_{ra}} \end{aligned} \end{split}\]

Finally, the term $Phosphorylation$ in eq4.2 captures an important effect whereby enzyme activity is altered due to a coupled process of phosphorylation and dephosphorylation.

(4.6)#\[\begin{split} \begin{aligned} Phosphorylation_r &= (\frac{\beta}{\alpha + \beta})^{subunits} \\ \alpha &= \sum_{\text{p phosphoylator}} kcat_{p} \cdot concp_p \\ \beta &= \sum_{\text{p dephosphoylator}} kcat_{d} \cdot concd_d \\ \end{aligned} \end{split}\]

4.3.1. Steady state assumption#

As well as assuming that the fluxes in our metabolic network will behave as our thermokinetic model expects, we also assume that the system was measured in a steady state, so that every non-boundary metabolite’s concentration was not changing. This assumption is represented mathematically in eq4.7.

(4.7)#\[ S\cdot f_v(x, \theta) = 0 \]

The steady state assumption removes degrees of freedom equal to the rank of $S$ from our model, so that $v$ is constrained to lie in the right null space of S. Usually, given $\theta$ it is possible can solve eq4.7 to find a steady state metabolite concentration $x_{steady}$.

4.3.2. Summary: Maud’s full thermokinetic model#

Maud’s full thermokinetic model starts with a parameter vector $\theta$. It then calculates the steady state metabolite concentration vector $x_steady$ by solving eq4.7.

:::{note}

See section Solving the steady state problem for how we solve eq4.7 in practice.

:::

Metabolite concentrations are now available and can be compared with metabolite measurements. To find flux values to compare with measurements we simply calculate $v_{steady}=f_v(x_{steady}, \theta)$.

4.3.3. Measurement model#

For measurements of metabolite and enzyme concentrations we use independent lognormal regression models:

(4.8)#\[\begin{split} \begin{aligned} y_{x} &\sim LN(\ln(x_{steady}), \sigma_x) \\ y_{enzyme} &\sim LN(\ln(Enzyme), \sigma_{enzyme}) \end{aligned} \end{split}\]

For measurements of steady state fluxes we use independent linear regression models:

(4.9)#\[ \begin{equation} y_{v} \sim N(f_v(x_{steady}, \theta), \sigma_v) \end{equation} \]

We ensure that flux measurements correspond to the flux modes of the measured network, so that the same pathway is never measured twice. See chapter 10 of [Palsson, 2015], entitled “the left null space” for a discussion of this issue.

We assume that the measurement errors $\sigma_x$, $\sigma_{enzyme}$ and $\sigma_v$ are known for each measurement.

These measurements $y_x$ and $y_{enzyme}$ are typically derived from quantitative metabolomics and proteomics experiments. Our choice of measurement model for these measurements is imperfect in at least these ways:

The measurement error is not known, and is in fact very difficult to estimate.
The measurements are not independent, as they are typically far more reliable as to differences—both between metabolites in the same experiment and between the same metabolite in different experiments—than they are as to absolute concentrations.

These problems also apply to flux measurements, but there is another issue: flux meausrements are derived from k

4.3.4. Prior model#

For kinetic parameters including $km$, $kcat$, $ki$ and parameters governing regulation we use independent informative lognormal prior distributions based on information gleaned from online databases, literature searches and intuition.

For boundary conditions including unbalanced metabolite concentrations, boundary fluxes and enzyme concentrations we use informative independent lognormal or normal prior distributions depending on the natural sign constraints of the variable (for example fluxes are not constrained to be positive, so in this case we use independent normal prior distributions). We sometimes use measurements to determine informative prior distributions for boundary conditions.

For thermodynamic parameters—i.e. metabolite formation energies—we use an informative multivariate normal distribution that is derived from equilibrium constant measurements reported in the NIST TECRDB [Goldberg et al., 2004]. The distribution is calculated using the component contribution method [Noor et al., 2013] as implemented in the software equilibrator [Beber et al., 2021]. In future we would like to use our own software to generate these priors.

4.4. Implementation#

In order to implement the statistical model described above, Maud performs posterior sampling using adaptive Hamiltonian Monte Carlo as provided by Stan.

To our knowledge this is the only viable approach. Realistic models have too many parameters for many Bayesian computation approaches, including rejection sampling, ABC, Metropolis-Hastings and Gibbs sampling, while a similar study [St. John et al., 2018], corroborated by our experience, indicates that variational inference is unlikely to provide satisfactory approximations in this case.

We believe that the non-linear, multi-parameter equations shown in Maud’s thermokinetic model create particular problems for MCMC sampling because they induce complex correlations in the posterior distribution. Consequently, although adaptive Hamiltonian Monte Carlo makes Bayesian inference for realistic thermokinetic models possible, many leapfrog steps are typically required per sample, as the sampler must traverse the posterior distribution in small steps in order to avoid discretisation errors.

4.4.1. Solving the steady state problem#

As mentioned in Summary: Maud’s full thermokinetic model, Maud’s generative model calculates steady state metabolite concentrations given parameter values by solving the steady state equation eq4.7. Since adaptive Hamiltonian Monte Carlo requires gradients of the posterior distribution, it is also necessary to calculate sensitivities of the steady state solution with respect to all parameters.

Our approach to this problem is, before MCMC sampling, to choose a concentration vector $x_{guess}$. Every time Maud’s sampler evaluates the statistical model’s log probability and gradients, it numerically solves eq4.7, and finds its gradients, using this starting guess and Stan’s interface to the Sundials Newton solver IDAS.

The system function that is passed to IDAS includes a representation of eq4.7 in Stan code. In order to model the instantaneous change of metabolite concentrations for a given concentration vector due to eq4.7, Maud uses the Stan interface to the Sundials ODE solver CVODES to evolve the system. This hybrid approach, using both numerical ODE solving and algebra solving, follows the one taken in [Margossian, 2018].

It is relevant to note a caveat for the Maximum A Posteriori estimation (optimization) in opposition to sampling the full distribution. In an optimization context, a hard rejection of non-steady states may be disadvantageous. The reason is that, when a rejection occurs, the gradients are lost, removing the trajectory of the optimization. For this application, maud provides the possibility of including the departure from steady state in the likelihood, such that the gradients are not lost but the optimization still favors steady states. This is implemented in the form of eq4.10

(4.10)#\[ S\cdot f_v(x_t, \theta) \sim N(0, $\epsilon$ $C$) \]

where $C$ is the vector of balanced concentrations and $\epsilon$ is a configurable parameter.

4.4.2. Analysis of results#

After sampling is complete, Maud uses the Bayesian analysis library arviz to transform the results into an InferenceData object and save it as a json file, along with a range of files containing debug information. These files can be used to validate the computation and to draw conclusions about the measured system.

4.4.3. Metabolic Control Analysis#

Determining systems level changes in flux and metabolite concentration is not intuitive given the interation on the scale of individual enzymes. This problem has been addressed using metabolic control analysis (MCA) [Kacser and Burns, 1973]. MCA can be applied to any model parameter, however, in Maud we will address the most common application: changes in enzyme concentration. The question that MCA addresses is: How does the change in a given parameter change the system wide properties of flux and concentration? MCA provides a tool to address these changes close to the reference state, which can be any parameterised model that achieves a steady-state.

Given the solution to the steady-state is given by the following equation, we can take the derivative of this steady state with respect to the parameter of interest as shown in equation eq4.11.

(4.11)#\[ \frac{dx}{dt} = S \cdot f_v(x_t, \theta) \]

By taking the derivative of the steady state with respect to the parameters we can determine the systems wide change for a given parameter. To do so, we need to apply the chain rule, this separates into the following terms: $S\frac{\partial f_v}{\partial \theta}$, that determines the change in concentration due to the direct impact of the flux vector; and, $S\frac{\partial f_v}{\partial S}\frac{\partial x}{\partial \theta}$, that accounts for the change in flux due to the local change in metabolite concentration that resulted from the parameter change. This is presented in equation eq4.12.

(4.12)#\[ \partial \frac{S \cdot f_v(x_t, \theta)}{\partial \theta} = S\frac{\partial f_v}{\partial \theta} + S\frac{\partial f_v}{\partial x}\frac{\partial x}{\partial \theta} \]

Because we are at steady state it follows that the derivative is equal to zero:

(4.13)#\[ S\frac{\partial f_v}{\partial \theta} + S\frac{\partial f_v}{\partial x}\frac{\partial x}{\partial \theta} = 0 \]

By rearranging the above equation we can isolate the derivative of the substrate concentration with respect to the parameter (equation eq4.14).

(4.14)#\[ \frac{d x}{d \theta} = -\left( S\frac{\partial f_v}{\partial x} \right)^{-1} S \frac{\partial f_v}{\partial \theta} \]

(4.15)#\[ \frac{\partial f_v}{\partial P} = \frac{\partial f_v}{\partial \theta} + \frac{\partial f_v}{\partial x}\frac{\partial x}{\partial \theta} \]

Given the flux vector $f_v(x, \theta)$, we can differentiate this with respect to the parameter of interest $\theta$. We can also distinguish between the change in parameters with respect to the local flux $f_v(f, \theta)$, and that of the global flux that results from the changes in metabolites $J$. Differentiating the flux vector with respect to the parameter distributions arrive at equation eq4.16.

(4.16)#\[ \frac{\partial J}{\partial P} = \left( I - \frac{\partial f_v}{\partial x} \left( S \frac{\partial f_v}{\partial x} \right)^{-1} S \right) \frac{\partial f_v}{\partial \theta} \]

By substituting eq4.14 into eq4.15, we arrive at equation #eq4.16 that is only dependent on the stoichiometric matrix, and the local sensitivities and elasticities.

Where the local elasticities is defined in equation eq4.17 and is commonly known as the elasticity matrix. The difference between the local and global quantities is the requirement to be at steady state. For example equation eq4.15 is at steady state, whereas the corresponding sensitivity matrix defined in equation eq4.18 is not.

(4.17)#\[ \epsilon = \frac{\partial f_v}{\partial x} \]

(4.18)#\[ \sigma = \frac{\partial f_v}{\partial \theta} \]

We refer to the global changes $\frac{\partial x}{\partial \theta}$ and $\frac{\partial J}{\partial \theta}$ represented in equations eq4.14 and eq4.16. Each of these equations includes the local sensitivity $\sigma$ multiplied by the global control coefficients – this ensures that the quantities are represented at the steady-state. Therefore, we can investigate the global quantities as separate values, which are referred to as control coefficients. These are the changes in the respective parameter and the local change in flux. The flux control coefficient and concentration control coefficients are defined below in equations eq4.19 eq4.20.

(4.19)#\[ C_{i,j}^J = I - \frac{\partial f_v}{\partial x} \left( S \frac{\partial f_v}{\partial x} \right)^{-1} S \]

(4.20)#\[ C_{i,k}^x = -\left( S\frac{\partial f_v}{\partial x} \right)^{-1} S \]

4.4.3.1. Notes on MCA#

Unscaled parameters were used because knockouts can cause the normalised constants to be divided by 0.
The units are whatever the units are that the users passed to Maud.

[1]

Moritz E. Beber, Mattia G. Gollub, Dana Mozaffari, Kevin M. Shebek, and Elad Noor. eQuilibrator 3.0 – a platform for the estimation of thermodynamic constants. arXiv:2103.00621 [q-bio], February 2021. URL: http://arxiv.org/abs/2103.00621 (visited on 2021-12-03), arXiv:2103.00621.

[2]

Robert N. Goldberg, Yadu B. Tewari, and Talapady N. Bhat. Thermodynamics of enzyme-catalyzed reactions—a database for quantitative biochemistry. Bioinformatics, 20(16):2874–2877, November 2004. URL: https://doi.org/10.1093/bioinformatics/bth314 (visited on 2022-07-11), doi:10.1093/bioinformatics/bth314.

[3]

H. Kacser and J. A. Burns. The control of flux. Symposia of the Society for Experimental Biology, 27:65–104, 1973. arXiv:4148886.

[4]

Wolfram Liebermeister, Jannis Uhlendorf, and Edda Klipp. Modular rate laws for enzymatic reactions: thermodynamics, elasticities and implementation. Bioinformatics, 26(12):1528–1534, June 2010. URL: https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btq141 (visited on 2019-09-23), doi:10.1093/bioinformatics/btq141.

[5]

Charles Margossian. Computing steady states with Stan's nonlinear algebraic solver. In Stan Conference 2018. January 2018. URL: 10.5281/zenodo.1284375, doi:10.5281/zenodo.1284375.

[6]

Elad Noor, Hulda S. Haraldsdóttir, Ron Milo, and Ronan M. T. Fleming. Consistent Estimation of Gibbs Energy Using Component Contributions. PLoS Computational Biology, 9(7):e1003098, July 2013. URL: https://dx.plos.org/10.1371/journal.pcbi.1003098 (visited on 2019-11-07), doi:10.1371/journal.pcbi.1003098.

[7]

Bernhard Ø. Palsson. Systems Biology: Constraint-based Reconstruction and Analysis. Cambridge University Press, Cambridge, 2015. ISBN 978-1-107-03885-1. URL: https://www.cambridge.org/core/books/systems-biology/7F8445BC87019806B3625DFC4B5C27D4 (visited on 2023-04-20), doi:10.1017/CBO9781139854610.

[8]

Sandra Placzek, Ida Schomburg, Antje Chang, Lisa Jeske, Marcus Ulbrich, Jana Tillack, and Dietmar Schomburg. BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic Acids Research, 45(D1):D380–D388, October 2016. URL: https://doi.org/10.1093/nar/gkw952, arXiv:https://academic.oup.com/nar/article-pdf/45/D1/D380/8847269/gkw952.pdf, doi:10.1093/nar/gkw952.

[9]

S. V. Popova and E. E. Sel'kov. Generalization of the model by monod, wyman and changeux for the case of a reversible monosubstrate reaction. FEBS Letters, 53(3):269–273, 1975. URL: https://febs.onlinelibrary.wiley.com/doi/abs/10.1016/0014-5793%2875%2980034-2 (visited on 2019-07-12), doi:10.1016/0014-5793(75)80034-2.

[10]

Peter St. John, Jonathan Strutz, Linda J Broadbelt, Keith E J Tyo, and Yannick J Bomble. Bayesian inference of metabolic kinetics from genome-scale multiomics data. bioRxiv, October 2018. URL: http://biorxiv.org/lookup/doi/10.1101/450163 (visited on 2019-02-19), doi:10.1101/450163.