From ELBO to EEG: A Practical Guide to Variational Inference for BCI Engineers

April 23, 2026

Probabilistic models are everywhere in modern BCI research. You've probably heard that Active Inference minimises free energy, that Bayesian decoders give you calibrated posteriors, or that real-time EEG pipelines can adapt online. But if you've ever tried to implement any of this from scratch, you've quickly hit the same wall: exact Bayesian inference is intractable for any model complex enough to be useful.

The practical answer to that wall is variational inference (VI). It's the algorithm that makes probabilistic BCI models run in real time. Once you understand it, the rest of Active Inference — free energy, precision weighting, expected free energy — snaps into place. If you want a systems-level intro first, start with What Is Active Inference? A Practical Primer for BCI Engineers. This post is that foundation.

Why Exact Inference Breaks Down for EEG

Every probabilistic model asks the same question: given some observations $x$ (your EEG data) and a model with latent states $z$ , what is the posterior $p(z \mid x)$ ?

By Bayes' rule:

$p(z \mid x) = \frac{p(x \mid z)\, p(z)}{p(x)}$

The denominator $p(x) = \int p(x \mid z)\, p(z)\, dz$ is the model evidence — an integral over all possible latent states. For a model with continuous latent variables (neural source amplitudes, attention states, motor intention vectors), this integral has no closed form. You can't compute it exactly.

This is not a niche problem. It appears the moment your generative model goes beyond a simple Gaussian linear model. For any BCI system handling real EEG — non-stationary, multi-channel, session-variable — exact inference is off the table.

The ELBO: Turning Inference into Optimisation

Variational inference sidesteps the intractable integral by reframing the problem. Instead of computing the true posterior $p(z \mid x)$ exactly, we pick an approximate distribution $q(z)$ from a tractable family (e.g. a factorised Gaussian) and find the $q$ that is closest to the true posterior.

Closeness is measured by KL divergence: $D_{\mathrm{KL}}(q(z)\, \|\, p(z \mid x))$ . Minimising this divergence is still hard because it involves $p(x)$ . But with a little algebra you can show:

\log p(x) = \underbrace{\mathbb{E}_{q}[\log p(x, z)] - \mathbb{E}_{q}[\log q(z)]}_{\text{ELBO}} + D_{\mathrm{KL}}(q(z)\, \|\, p(z \mid x))

Because $\log p(x)$ is a constant and KL divergence is non-negative, maximising the ELBO is equivalent to minimising KL divergence. The ELBO — Evidence Lower Bound — is your tractable proxy objective.

The two terms in the ELBO have intuitive interpretations:

$\mathbb{E}_{q}[\log p(x, z)]$ — accuracy: how well does your approximate posterior explain the data under the model?
$-\mathbb{E}_{q}[\log q(z)]$ — entropy: how uncertain (exploratory) is your approximate posterior?

If you've read about Active Inference, this decomposition should look familiar. Variational free energy $F$ is simply $-\text{ELBO}$ . Minimising free energy is maximising the evidence lower bound. The two literatures are describing the same computation.

Message Passing: How VI Runs in Real Time

Maximising the ELBO over all parameters simultaneously is expensive. The practitioner's trick is coordinate ascent variational inference (CAVI): hold all but one factor of $q$ fixed, update that factor analytically, repeat. This turns a global optimisation into a sequence of local updates.

For graphical models — which is what most BCI generative models are — these updates decompose across the graph structure into message-passing operations. Algorithms like Belief Propagation and Variational Message Passing (VMP) implement exactly this: local nodes receive messages from their neighbours, update their beliefs, and pass updated messages forward.

This matters for real-time EEG because:

Incremental updates are cheap. When a new EEG sample arrives, you don't re-run inference from scratch — you propagate a message through the relevant part of the graph.
The graph structure encodes your priors. Temporal dynamics, spatial covariance, and session-level drift all live in the graph, not in ad-hoc post-processing steps.
Uncertainty propagates automatically. Downstream nodes receive distributions, not point estimates, so your action selection (or classifier) always knows how confident the current estimate is.

Modern Julia-based probabilistic programming frameworks — including those that underpin Nimbus's inference stack — implement VMP natively, letting you write a generative model in high-level code while the framework handles much of the message-passing inference machinery.

From ELBO to Active Inference: The Connection

Active Inference extends this framework to the action side. An agent doesn't just infer hidden states — it also selects actions that minimise expected free energy (EFE) over future time steps.

Expected free energy decomposes into:

Risk (extrinsic value): how far are expected future states from preferred states (your desired motor output, cursor position, or communication intent)?
Ambiguity (epistemic value): how much will the action reduce uncertainty about hidden states?

Minimising EFE drives the agent to simultaneously achieve goals and gather information. For BCI control, this is exactly the behaviour you want: a decoder that confidently selects actions when it is certain, and withholds or queries when it isn't.

The key insight is that both perception (inferring $z$ from $x$ ) and action (selecting policies to minimise EFE) are grounded in the same variational objective. In online systems, inference and adaptation can be tightly coupled rather than treated as fully separate stages: the model updates beliefs, evaluates uncertainty, and selects actions as part of one continuous loop.

Practical Implications for Your EEG Pipeline

Understanding VI has direct consequences for how you build and debug probabilistic BCI models:

Choose your variational family carefully. A mean-field (fully factorised) approximation is fast but assumes posterior independence between latent variables. For EEG, where channel and time correlations are real and informative, structured approximations (e.g. Gaussian with full or block-diagonal covariance) are often worth the extra cost.

Monitor the ELBO, not just accuracy. A model can achieve high decoding accuracy while fitting a poor approximate posterior. Tracking the ELBO tells you whether your variational distribution is actually approximating the true posterior or just memorising labels.

Warm-start from the previous time step. In online inference, initialise $q$ at each new sample from the posterior at the previous sample. This reduces the number of VMP iterations needed per update and keeps latency bounded.

Precision parameters are variational parameters too. Precision weighting in Active Inference (the mechanism by which the model attends to reliable channels and ignores noisy ones) is learned by treating precision as a latent variable with its own approximate posterior. When a channel goes noisy mid-session, the model can downgrade its precision automatically, reducing reliance on manual artefact rejection.

Conclusion

If you've been reading about Active Inference and feeling like the maths is floating slightly out of reach, start here. Once the ELBO clicks, everything else — free energy, expected free energy, policy selection — is just notation layered on top of the same core idea: approximate the posterior, maximise the bound, repeat.

If you want a deeper dive on the computational substrate, Factor Graphs and Message Passing: The Engine Behind Real-Time Bayesian BCI and Reactive Message Passing for BCI: How RxInfer.jl Brings Active Inference to Real Time are good follow-ons.