A Markov Chain Approach to Preference Alignment

Takuya Koriyama; Tengyuan Liang

A Markov Chain Approach to Preference Alignment

June 2026 Takuya Koriyama, Tengyuan Liang arXiv preprint

Download:

Abstract:

We propose Markov Chain from Human Feedback (MCHF), an elementary approach for aligning generative models from pairwise human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), which reduces comparisons to a scalar reward, and Nash Learning from Human Feedback (NLHF), which preserves pairwise utilities through a KL-regularized minimax optimization, MCHF uses pairwise preferences directly to define a transition mechanism over model outputs. Given a pairwise utility $U(x,y)$, which quantifies human preference for $y$ over $x$, and a reference probability distribution $\mu_{\mathsf{ref}}$, we define a Markov kernel, and take the Markov chain starting from $\mu_{\mathsf{ref}}$ as an iterative alignment procedure. We show that MCHF converges geometrically fast to the stationary distribution, with a convergence rate governed by the seminorm, which quantifies the non-transitive structure of the pairwise utility. We further show that a mirror-descent algorithm for NLHF satisfies an analogous structure-adaptive convergence guarantee. Finally, through a perturbation analysis, we prove that when the seminorm is small, MCHF and NLHF agree up to first order around an RLHF solution, which yields a unified view of reward-based, game-theoretic, and Markovian approaches to alignment.

Citation

Takuya Koriyama, and Tengyuan Liang. 2026. “A Markov Chain Approach to Preference Alignment.” arXiv:2606.22652.

@misc{KoriyamaLiang2026,
  title = {A Markov Chain Approach to Preference Alignment},
  author = {Koriyama, Takuya and Liang, Tengyuan},
  year = {2026},
  eprint = {2606.22652},
  archivePrefix = {arXiv},
  primaryClass = {cs.LG},
  url = {https://arxiv.org/abs/2606.22652},
}

Download:

Abstract:

Citation

Related material