Order structure in preference data
The preference datasets that drive reinforcement learning from human feedback (Christiano et al., 2017), alignment pipelines, and reward modelling (Ouyang et al., 2022) are almost always ordinal in their underlying structure: a labeller says that response is preferred to response , and that is the data. The standard pipeline then maps these preferences through a Bradley–Terry model (Bradley & Terry, 1952) or a near relative into real-valued reward scores, and training proceeds against those scores. The map gains tractability; what it loses is some of the order structure that the labellers actually provided — including the parts that say these two things are not comparable at all, and the parts that record this labeller is using a different criterion from that one.
This essay works through the alternative carefully: what an ordinal preference dataset actually is as a mathematical object, when it can be faithfully represented by a real-valued utility, what happens to the structure when one refuses to score it, and what the resulting order-theoretic picture suggests about disagreement, dimension, and the coherence of the underlying labelling process. The treatment follows the long lattice-theoretic tradition: classical references include Birkhoff (1967) and Davey & Priestley (2002); for FCA specifically, Ganter & Wille (1999).
1. Preferences as orders
Let be a set of items — say, candidate responses produced by a language model. A strict preference relation on is a binary relation which we read as “is strictly less preferred than.” The classical axioms ask that be a strict partial order, i.e. irreflexive and transitive:
We will not assume totality. Two items are comparable if or holds, and incomparable otherwise; we write in the latter case. Incomparability is not the same as indifference: indifference says “I see no preference between these two,” while incomparability says “the criterion you are asking me to apply does not return a verdict here.” Bradley–Terry collapses the two; the order-theoretic view keeps them apart.
A chain is a totally ordered subset of ; an antichain is a subset whose elements are pairwise incomparable. The height of is the length of its longest chain; its width is the size of its largest antichain. Dilworth’s theorem says that the width equals the minimum number of chains needed to cover . In the preference setting, width has an immediate interpretation: it is the number of distinct “non-mergeable” preference axes one would need to explain the data.
2. Utility representation — what counts as a score
A utility function is a map that represents in the sense that
The reward model in an RLHF pipeline is, in this language, a learned utility function. The question one rarely asks in that pipeline is whether the preference data being learned from admits a utility function at all.
The classical results on existence go back to the early twentieth century. For a countable totally ordered set , Cantor’s theorem on dense unbounded countable chains implies that a utility representation always exists — one can in fact map into . For uncountable totally ordered sets, existence is not automatic: the lexicographic order on has no real-valued utility representation, because any such representation would have to be injective on the uncountably many disjoint vertical intervals required by the second coordinate. The standard sufficient-condition theorem is due to Debreu (1954): a continuous total preorder on a connected separable topological space admits a continuous utility, and the gap lemma in (Debreu, 1954) is what makes the construction go through.
For partial orders, the situation is sharper still. A real-valued assigns to every pair exactly one of , so it can never witness genuine incomparability:
To use a real-valued utility on data that contains incomparabilities, one must either declare the incomparabilities to be indifferences (losing information) or extend the partial order to a total order (making a choice).
3. The Szpilrajn extension theorem
The “making a choice” path is principled, and it is principled in a classical way. Szpilrajn’s theorem (Szpilrajn, 1930) says that every strict partial order on a set extends to a strict total order on :
Let be a strict partial order on . Then there exists a strict total order on with .
The proof, by Zorn’s lemma, is short. The consequence is long: the set of all such linear extensions of is itself a non-trivial combinatorial object,
and almost everything order-theoretic about can be read off the structure of . The cardinality — the number of linear extensions — is a basic invariant. It is also -complete to compute in general (Brightwell & Winkler, 1991).
When one fits a Bradley–Terry score to a preference dataset, one is implicitly choosing one element of , weighted by likelihood under the BT model. The fact that can be astronomical is one order-theoretic way of saying that this implicit choice is doing more work than it appears to.
4. Interval orders and semiorders
Most real preference data lies between “fully total” and “arbitrary partial.” Two intermediate families are particularly relevant.
A partial order is an interval order if every can be assigned a real interval such that
Equivalently, an order is an interval order iff it contains no subposet isomorphic to (two disjoint two-element chains). This is Fishburn’s classical characterisation (Fishburn, 1985). Interval orders model the natural situation in which each item has a range of plausible utilities — different labellers, different sessions, different prompts — and one item is preferred to another only when the ranges do not overlap.
A semiorder is an interval order in which all intervals have the same length. Semiorders go back to (Luce, 1956) and formalise the “just-noticeable difference” intuition: two items within of each other are indistinguishable, two items more than apart are comparable. The set of semiorders on is a strict subclass of interval orders, characterised by the exclusion of as well as .
Both classes admit real-valued representations of a weakened kind — not utility functions in Debreu’s sense, but pairs of real-valued maps. For preference data that is intrinsically noisy rather than higher-dimensional, an interval-order representation is a strictly better fit than a single score.
5. Dimension
When the data is genuinely higher-dimensional — when labellers disagree because they care about different things — neither utility nor intervals will do. The classical order-theoretic invariant here is the order dimension , introduced by Dushnik and Miller in 1941:
Equivalently, is the smallest for which embeds order-preservingly into with the product (Pareto) order. A total order has dimension . A two-element antichain has dimension . Computing the dimension is -hard in general.
This invariant is a direct measure of how many independent reward signals are needed to faithfully reconstruct a preference dataset. A reward model that fits the data well at tells one something quite different from a reward model that does its best at . The latter is not a model of the preference — there is no such thing — but a projection of a five-dimensional structure onto one axis.
6. Cycles, consistency, and the data graph
In practice, preference datasets arrive not as an order but as a multiset of preference pairs
The relation that induces is the directed graph . The transitive closure is the smallest transitive relation containing . It is a partial order iff has no directed cycles. Cycles in are inconsistencies — either intrinsic (a labeller’s preferences are not transitive) or extrinsic (different labellers disagree). Either way, the standard scoring pipelines silently average them away.
A more honest pipeline begins with a cycle decomposition of . Let be the set of strongly connected components of . Contracting each non-trivial component to a point yields a DAG , whose transitive closure is a partial order on the contracted node set. The contracted nodes — the non-trivial strongly connected components — are exactly the places where the data is internally inconsistent. They are not noise to be smoothed; they are locations where the choice of representation is forced.
The size and distribution of non-trivial SCCs is, in this picture, the right first diagnostic for a preference corpus. It is also a diagnostic that takes constant time per edge.
7. A concept-lattice view of disagreement
The disagreement structure of a preference corpus admits a clean FCA treatment. Let be the set of items and let be the set of labellers. For each ordered pair , define a formal context
where iff labeller rendered verdict on the pair . The concepts of are the maximal subsets of labellers who agreed on a single verdict.
Glueing the per-pair contexts produces a global context whose objects are labellers and whose attributes are pair-verdict combinations. Concepts of the glued context describe coalitions of labellers that agreed on a maximal set of pair-verdicts. The order of concepts captures the inclusion structure of these coalitions, and the attribute concepts describe the minimal “argument” — the set of pair-verdicts — that distinguishes one coalition from another.
For a corpus of any realistic size the glued context is too large to visualise as a Hasse diagram in full, but it can be queried, scaled (see conceptual scaling), and projected onto subsets of pairs. The substantive payoff is that the structure of disagreement becomes a first-class object, rather than an averaged-out residual.
8. What changes operationally
This is an essay rather than a method paper; the operational claim that closes it is small and concrete. Before fitting a reward model, compute three things.
- The order dimension lower bound of on the cycle-free part of . If it is bounded by a small constant, a single score is defensible; if it grows with corpus size, it is not.
- The strongly connected components of . Inconsistencies are not noise; they are locations where the labelling instrument was registering more than one signal. Surfacing them is cheap; acting on them is the responsibility of whoever curates the corpus.
- A coalition concept lattice for the labellers, computed on a manageable subset of pairs. The concept structure is the non-scalar summary of who agrees with whom about what.
None of these computations is large. Each one tells a reward model something its loss function cannot. The classical theory of orders exists in part because real preference data has always been fundamentally ordinal; that the recent ML pipelines have re-discovered this — and that the tools for working with it have been on the shelf since Birkhoff and Debreu — is a coincidence worth making explicit.
A longer companion piece, including a worked example on a small publicly available preference dataset and a concrete recipe for the coalition concept lattice on a per-prompt basis, is in preparation.
Further reading. For lattice and order theory, see (Birkhoff, 1967) and the more accessible (Davey & Priestley, 2002). For measurement theory and the representation question, (Krantz et al., 1971) is the canonical reference. For FCA as a tool for the disagreement analysis sketched in §7, (Ganter & Wille, 1999) is the textbook; (Wille, 1982) sets out the philosophical motivation. For enumerative aspects of and the order polytope, (Stanley, 1986).