com.eignex.kumulant/bandit/univariate/TopTwoThompsonBandit

TopTwoThompsonBandit

class TopTwoThompsonBandit<R : Result>(val nbrArms: Int, val policy: ThompsonSampling<R>, val beta: Double = 0.5, val maxResamples: Int = 32, val random: Random = Random.Default) : UnivariateBandit, PerArmBandit<R> (source)

Top-Two Thompson Sampling (Russo 2020); pure-exploration variant of Thompson sampling for best-arm identification: sample every arm's posterior, take the argmax arm1, play it with probability beta, or else resample until the argmax differs from arm1 and play that runner-up.

The forced resampling keeps a fraction of the budget on the runner-up so the gap to the best arm is identified asymptotically optimally. Converges to the optimal exploration fraction beta = 0.5 for two-armed problems; tune lower to bias toward exploitation when running in the regret-minimisation regime.

Doesn't expose com.eignex.kumulant.bandit.Scorable: arm selection samples jointly and conditionally resamples, so there's no per-arm score callers can read in isolation. The per-arm posterior state still fits PerArmBandit.

Use cases: best-arm identification and pure-exploration problems where shrinking the regret gap matters more than minimising cumulative regret; any posterior expressible as a ThompsonSampling policy.

Arms: indexless, nbrArms ≥ 2 fixed at construction; each arm owns one posterior cell from policy.createArm().

Memory: O(nbrArms · arm-state); per-arm posterior plus a step counter.

Choose: O(nbrArms) expected; O(maxResamples · nbrArms) worst case when the second-arm resample loop spins to the cap.

Update: O(1) on the targeted arm, delegated to policy.update.

Randomness: every posterior sample and the beta coin flip use the caller-supplied random; reproducible under a fixed seed if the policy is.

Concurrency: per-arm com.eignex.kumulant.core.SeriesStat carries its own concurrency. The step counter is non-atomic; concurrent choose calls race on it. Cross-arm snapshot consistency during choose is best-effort under racing updates.

Constructors

TopTwoThompsonBandit

constructor(nbrArms: Int, policy: ThompsonSampling<R>, beta: Double = 0.5, maxResamples: Int = 32, random: Random = Random.Default)(source)

Properties

beta

val beta: Double(source)

Probability of playing the round's top sample; default 0.5 is Russo's recipe.

maxResamples

val maxResamples: Int(source)

Cap on the resample loop when searching for a different second arm.

nbrArms

open override val nbrArms: Int(source)

Number of arms.

policy

val policy: ThompsonSampling<R>(source)

Per-arm posterior + arm spec.

random

open override val random: Random(source)

Single source of randomness.

Functions

armResult

open override fun armResult(armIndex: Int): R(source)

Per-arm snapshot at armIndex. Default implementation reads from the full snapshot; implementations may override to avoid building the entire list when only one arm is needed.

choose

open override fun choose(): Int(source)

Choose an arm via the top-two protocol.

create

open override fun create(random: Random): TopTwoThompsonBandit<R>(source)

Spawn a fresh bandit with the same configuration; state resets to the prior seed. The random source is replaced; pass the source you want the new bandit to use for exploration (which is independent of merging in another snapshot's state).

Useful when a worker accepts a stream of snapshots to apply sequentially: create(random).also { it.merge(snapshot) }.

merge

open override fun merge(other: List<R>)(source)

Fold another replica's other state into this bandit. Most families merge exactly via the underlying stat's parallel-merge formula; SGD- based contextual bandits merge approximately. Each concrete bandit's KDoc documents its merge semantics.

reset

open override fun reset()(source)

Clear all state back to the prior-seeded baseline. Equivalent to spawning a fresh bandit with the same configuration via Snapshotable.create, but in place; keeps the same arm count, policy, concurrency mode, and random instance.

sampleArgmax

fun sampleArgmax(): Int(source)

Single posterior sample per arm; return the argmax.

snapshot

open override fun snapshot(): List<R>(source)

Materialise the current state as a serialisable snapshot. Reads are non-mutating; call as often as needed without affecting decisions. Same snapshot consistency rules as com.eignex.kumulant.core.Stat.read ; under com.eignex.kumulant.core.Concurrency.Relaxed coupled cells may drift by ULPs.

update

open override fun update(armIndex: Int, value: Double, weight: Double = 1.0)(source)

Fold a single observed reward value into the arm at armIndex with the given weight. Weight is the same observation-weight that runs through the rest of the library; typically 1.0, occasionally importance-weighted for off-policy correction.

Index out of range throws; some bandits also bound-check the value (e.g. Bernoulli arms require value in {0.0, 1.0}).

updateAll

open fun updateAll(armIndices: IntArray, values: DoubleArray, weights: DoubleArray? = null)

Batched update: fold one observation per arm/value pair in a single call. Equivalent to looping update but skips per-call overhead and may take a per-bandit lock once.