com.eignex.kumulant/stat/regression/tree

stat.regression.tree

Online VFDT decision trees and random forests, plus the shared machinery they're built on. The package covers both regression (continuous y) and classification (y in [0, numClasses)) under one consistent shape.

Picking a tree stat

Stat	Output	Leaf state
DecisionTreeRegressionStat	Continuous `y` prediction	com.eignex.kumulant.stat.summary.WeightedVarianceResult
RandomForestRegressionStat	Bagged ensemble of regression trees	One DecisionTreeRegressionStat per tree
DecisionTreeClassifierStat	K-way class probabilities	ClassCountsResult
RandomForestClassifierStat	Bagged ensemble of classifier trees; predictions average per-class probabilities	One DecisionTreeClassifierStat per tree

Each takes a list of Split candidates (axis-aligned thresholds or arbitrary ExprSplit expressions) and a config; RegressionTreeConfig for the regression side, ClassificationTreeConfig for the classification side. Splits fire when a candidate clears the Hoeffding bound on the configured metric: VarianceReduction for regression, GiniReduction / InformationGain for classification.

Split candidates

ThresholdSplit is the standard axis-aligned predicate x[i] <= t. ExprSplit takes any com.eignex.kumulant.schema.BoolExpr for non-axis-aligned splits; useful when you want a tree to split on a derived feature (x[0] + x[1] > 0, (x[0] - x[1]).abs() > 1) without materialising the feature in the input vector.

Pass an empty splitCandidates list to disable growth and degenerate the stat into a single global accumulator. The Random forest's config.mtry defaults to ceil(sqrt(p)) Breiman-style.

Posteriors

TreePosterior (and its forest counterpart ForestPosterior) turn a TreeRegressionResult / ForestRegressionResult into a scalar score:

MeanTreePosterior / MeanForestPosterior: deterministic leaf mean.
ThompsonTreePosterior / ThompsonForestPosterior: Normal-Gamma draw from the leaf's (mean, variance, totalWeights) triplet.
UcbTreePosterior / UcbForestPosterior: UCB-style mean + alpha * sqrt(variance / n).

These plug into com.eignex.kumulant.bandit.contextual.RegressionContextualBandit for non-linear contextual bandits with the same Thompson / UCB / mean choice that the GLM side has via the com.eignex.kumulant.stat.regression.glm.LinearPosterior family.

Merge

All four merge approximately through tree.mergeSnapshot(), which combines matching leaf accumulators: weighted Welford per leaf for the regression trees, cell-wise class-count sums for the classifiers, per-tree for the forests. The combine assumes the tree structures line up (same split history); divergent trees merge only their shared leaves. For distributed training, prefer feeding one stat the ordered stream and treat merge as a roll-up.

Concurrency

The hot update path touches exactly one accumulator; the leaf the observation routes to. Internal split nodes carry no live arm; subtree aggregates are derived by combining descendants at snapshot time. Each leaf arm honours the com.eignex.kumulant.core.Concurrency level passed in, so multiple threads landing in different leaves never contend. Split conversion takes a per-tree lock fired only at split decisions. See RegressionTree / ClassificationTree for the full concurrency design.

Types

ClassCountsResult

@Serializable

@SerialName(value = "ClassCountsResult")

data class ClassCountsResult(val numClasses: Int, val counts: DoubleArray) : Result

Snapshot of a K-bin weighted class-count vector. Used as the leaf aggregate for ClassificationTree: the per-leaf running tally of how many (weighted) observations of each class landed in this leaf. Exposes derived class probabilities and the argmax prediction.

ClassCountsStat

class ClassCountsStat(val numClasses: Int, val concurrency: Concurrency = Concurrency.None) : SeriesStat<ClassCountsResult>

Series stat over discrete class-index inputs. update(value, weight) interprets value.toInt() as the class index; out-of-range values are dropped. Each class cell is an independent striped atomic adder so updates commute under any Concurrency level.

ClassificationAuditLeaf

class ClassificationAuditLeaf(val arm: SeriesStat<ClassCountsResult>, val candidates: List<SerializableSplit>, val pos: List<SeriesStat<ClassCountsResult>>, val neg: List<SeriesStat<ClassCountsResult>>) : ClassificationLeafNode

Audit leaf tracking per-candidate pos/neg class-count accumulators.

ClassificationLeafNode

sealed class ClassificationLeafNode : ClassificationNode

Leaf; owns a per-class count accumulator.

ClassificationNode

sealed interface ClassificationNode

Classification mirror of RegressionNode. Identical structure to the regression side except that leaf arms carry class-count snapshots rather than weighted-variance summaries.

ClassificationSplitMetric

@Serializable

sealed interface ClassificationSplitMetric

Classification analogue of SplitMetric: scores a candidate split using the leaf's per-class counts. Higher is better; returns 0 when the split has no signal (one side empty, or zero impurity reduction).

ClassificationSplitNode

class ClassificationSplitNode(val split: SerializableSplit, pos: ClassificationNode, neg: ClassificationNode, carryover: SeriesStat<ClassCountsResult>? = null) : ClassificationNode

Split mirror; predicate routes to pos or neg; carryover absorbs orphan aggregates produced by mixed-structure merges or pre-split snapshots.

ClassificationTerminalLeaf

class ClassificationTerminalLeaf(val arm: SeriesStat<ClassCountsResult>) : ClassificationLeafNode

Frozen leaf; no further splits considered.

ClassificationTree

class ClassificationTree(numClasses: Int, splitCandidates: List<SerializableSplit>, config: ClassificationTreeConfig = ClassificationTreeConfig(), concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0)

Classification mirror of RegressionTree: online VFDT decision tree where each leaf carries a per-class count accumulator and audit leaves track class counts per candidate split. Splits fire when a candidate clears the Hoeffding bound on the configured ClassificationSplitMetric (Gini or information gain).

ClassificationTreeConfig

@Serializable

data class ClassificationTreeConfig(val delta: Double = 0.05, val deltaDecay: Double = 0.9, val tau: Double = 0.05, val minSamplesSplit: Double = 30.0, val minSamplesLeaf: Double = 5.0, val splitPeriod: Int = 10, val maxDepth: Int = 16, val maxNodes: Int = 1024, val metric: ClassificationSplitMetric = GiniReduction, val mtry: Int? = null)

Classification analogue of RegressionTreeConfig. Same tunables, but the split metric defaults to GiniReduction and the criterion is a ClassificationSplitMetric.

DecisionTreeClassifierStat

class DecisionTreeClassifierStat(val featureSize: Int, val numClasses: Int, val splitCandidates: List<SerializableSplit>, val config: ClassificationTreeConfig = ClassificationTreeConfig(), val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0) : RegressionStat<TreeClassificationResult>

Online VFDT decision-tree classifier; the classification counterpart of DecisionTreeRegressionStat. Each leaf carries a ClassCountsResult (per-class weighted counts); splits fire when a candidate beats the runner-up by the Hoeffding bound on Gini reduction or information gain.

DecisionTreeRegressionStat

class DecisionTreeRegressionStat(val featureSize: Int, val splitCandidates: List<SerializableSplit>, val config: RegressionTreeConfig = RegressionTreeConfig(), val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0) : RegressionStat<TreeRegressionResult>

Online VFDT decision-tree regressor; a piecewise-constant predictor over the feature space, growing on the fly via the Hoeffding bound. Wraps a RegressionTree in the kumulant RegressionStat contract so it composes with everything that consumes regressors (the bandit family, schemas, op pipelines).

ExprSplit

@Serializable

@SerialName(value = "ExprSplit")

data class ExprSplit(val expr: BoolExpr) : SerializableSplit

Route by an arbitrary BoolExpr evaluated against the context vector. The expression sees the context's first coordinate as X, the second as Y, and the full vector via V(i); matching the existing kumulant AST conventions. Wire-portable through skema's polymorphism on BoolExpr.

ForestClassificationResult

@Serializable

@SerialName(value = "ForestClassificationResult")

data class ForestClassificationResult(val numClasses: Int, val trees: List<TreeClassificationResult>) : Result

Snapshot of a RandomForestClassifierStat: per-tree immutable snapshots plus ensemble-aware predict helpers.

ForestPosterior

sealed interface ForestPosterior : RegressionPosterior<ForestRegressionResult>

TreePosterior family ported to forests: every leaf snapshot the query routes to is merged into a single weighted-variance result, then scored with the tree-posterior semantics. Same options, applied to the ensembled leaf.

ForestRegressionResult

@Serializable

@SerialName(value = "ForestRegressionResult")

data class ForestRegressionResult(val trees: List<TreeRegressionResult>) : Result

Snapshot of a RandomForestRegressionStat: per-tree immutable snapshots.

GiniReduction

@Serializable

@SerialName(value = "GiniReduction")

data object GiniReduction : ClassificationSplitMetric

Weighted Gini-impurity reduction. The classic CART classification criterion.

InformationGain

@Serializable

@SerialName(value = "InformationGain")

data object InformationGain : ClassificationSplitMetric

Information-gain split criterion: parent entropy minus weighted children entropy.

MeanForestPosterior

data object MeanForestPosterior : ForestPosterior

Forest counterpart to MeanTreePosterior.

MeanTreePosterior

data object MeanTreePosterior : TreePosterior

Score is the leaf's running mean; point estimate, no exploration.

RandomForestClassifierStat

class RandomForestClassifierStat(val featureSize: Int, val numClasses: Int, val splitCandidates: List<SerializableSplit>, val nbrTrees: Int = 10, config: ClassificationTreeConfig = ClassificationTreeConfig(), val bagging: Boolean = true, val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<ClassCountsResult> = { ClassCountsStat(numClasses, concurrency) }, randomSeed: Int = 0) : RegressionStat<ForestClassificationResult>

Online random-forest classifier; the classification counterpart of RandomForestRegressionStat. Same diversity tricks (Oza & Russell bagging, per-leaf mtry), but per-tree leaves are ClassCountsResult and ensemble predictions average per-class probabilities across trees.

RandomForestRegressionStat

class RandomForestRegressionStat(val featureSize: Int, val splitCandidates: List<SerializableSplit>, val nbrTrees: Int = 10, config: RegressionTreeConfig = RegressionTreeConfig(), val bagging: Boolean = true, val concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0) : RegressionStat<ForestRegressionResult>

Online random-forest regressor; a population of RegressionTrees sharing the candidate-split pool. Diversity comes from:

RegressionAuditLeaf

class RegressionAuditLeaf<Row>(val arm: SeriesStat<WeightedVarianceResult>, val candidates: List<Split<Row>>, val pos: List<SeriesStat<WeightedVarianceResult>>, val neg: List<SeriesStat<WeightedVarianceResult>>) : RegressionLeafNode<Row>

Leaf that tracks per-candidate pos/neg stats. When a candidate clears the Hoeffding- bound test, this leaf is replaced by a RegressionSplitNode. The candidate subset is per-leaf ; picked at leaf birth; so mtry-style random subspace selection lives at the leaf level.

RegressionLeafNode

sealed class RegressionLeafNode<Row> : RegressionNode<Row>

Leaf node; terminus of the tree walk for a given row, and the only node type that owns a live accumulator.

RegressionNode

sealed interface RegressionNode<Row>

Internal tree node, generic over the feature Row the tree routes. The hot update path touches only the leaf an observation routes to; internal split nodes are never written by RegressionTree.update. Splits may carry an optional carryover arm: a one-shot snapshot of the pre-split aggregate captured at the moment a leaf converts into a split, plus any orphaned aggregates folded in by mixed-structure merges. Subtree aggregates include the carryover but the hot path never reads or writes it.

RegressionSplitNode

class RegressionSplitNode<Row>(val split: Split<Row>, pos: RegressionNode<Row>, neg: RegressionNode<Row>, carryover: SeriesStat<WeightedVarianceResult>? = null) : RegressionNode<Row>

Routes by split to either pos (true) or neg (false). The optional carryover holds aggregates that don't structurally belong to either child; the pre-split data frozen at split time, or orphans absorbed from a mixed merge. Never written by the update hot path; never read by findLeaf or predict; included by subtreeAggregate.

RegressionTerminalLeaf

class RegressionTerminalLeaf<Row>(val arm: SeriesStat<WeightedVarianceResult>) : RegressionLeafNode<Row>

Frozen leaf; no further splits will be considered.

RegressionTree

class RegressionTree<Row>(splitCandidates: List<Split<Row>>, config: RegressionTreeConfig = RegressionTreeConfig(), concurrency: Concurrency = Concurrency.None, leafArmFactory: () -> SeriesStat<WeightedVarianceResult> = { VarianceStat(concurrency) }, randomSeed: Int = 0)

Online VFDT-style decision tree partitioning feature rows of type Row. Each leaf carries a weighted-variance accumulator; audit leaves additionally track pos/neg sub-arms per candidate split and, every RegressionTreeConfig.splitPeriod observations, evaluate them against the Hoeffding bound to decide whether to convert themselves into a RegressionSplitNode.

RegressionTreeConfig

@Serializable

data class RegressionTreeConfig(val delta: Double = 0.05, val deltaDecay: Double = 0.9, val tau: Double = 0.05, val minSamplesSplit: Double = 30.0, val minSamplesLeaf: Double = 5.0, val splitPeriod: Int = 10, val maxDepth: Int = 16, val maxNodes: Int = 1024, val metric: SplitMetric = VarianceReduction, val mtry: Int? = null)

Tunables for RegressionTree growth, shared by DecisionTreeRegressionStat and RandomForestRegressionStat.

SerializableSplit

@Serializable

sealed interface SerializableSplit : Split<VectorView>

Wire-portable Split over a dense VectorView context. Sealed + serializable so tree snapshots round-trip cleanly through kotlinx.serialization. Built-in implementations: ThresholdSplit (numeric x[i] <= t) and ExprSplit (wrapping a BoolExpr); callers needing custom predicates compose them as BoolExpr AST nodes and wrap in ExprSplit.

Split

interface Split<in Row>

Growth-time routing predicate over a feature Row. The tree engine (RegressionTree) only ever calls direction to route an observation to a child, so any feature representation can drive growth by supplying its own Splits — for example a downstream library's typed, constraint-coupled splits over a non-vector row.

SplitInfo

data class SplitInfo(val top1: Double, val top2: Double, val bestIndex: Int)

Result of evaluating all candidate splits at a leaf: best score, runner-up, best index.

SplitMetric

@Serializable

sealed interface SplitMetric

Scores a candidate split against a leaf's pre-split distribution. Higher is better. Returned score must satisfy value(total, total, empty) == 0 so that "no signal" is always last in the ranking.

ThompsonForestPosterior

data class ThompsonForestPosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : ForestPosterior

Forest counterpart to ThompsonTreePosterior.

ThompsonTreePosterior

data class ThompsonTreePosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : TreePosterior

Thompson sampling over the leaf's Normal-Gamma posterior. Given the leaf's pseudo- count n, sample mean m, and sample variance v, draws are mu ~ N(m, exploration * v / max(n, 1)); the posterior on the leaf mean assuming a Normal-Gamma conjugate with weak prior. exploration = 0.0 collapses to the leaf mean.

ThresholdSplit

@Serializable

@SerialName(value = "ThresholdSplit")

data class ThresholdSplit(val featureIndex: Int, val threshold: Double) : SerializableSplit

Route by row[featureIndex] <= threshold. Threshold is inclusive on the "pos" side.

TreeClassificationLeafResult

@Serializable

@SerialName(value = "TreeClassificationLeafResult")

data class TreeClassificationLeafResult(val value: ClassCountsResult) : TreeClassificationNodeResult

Immutable classification leaf-node snapshot.

TreeClassificationNodeResult

@Serializable

sealed interface TreeClassificationNodeResult

Snapshot of a single classification-tree node; split or leaf.

TreeClassificationResult

@Serializable

@SerialName(value = "TreeClassificationResult")

data class TreeClassificationResult(val root: TreeClassificationNodeResult) : Result

Classification mirror of TreeRegressionResult.

TreeClassificationSplitResult

@Serializable

@SerialName(value = "TreeClassificationSplitResult")

data class TreeClassificationSplitResult(val split: SerializableSplit, val pos: TreeClassificationNodeResult, val neg: TreeClassificationNodeResult, val value: ClassCountsResult) : TreeClassificationNodeResult

Immutable classification split-node snapshot.

TreeLeafResult

@Serializable

@SerialName(value = "TreeLeafResult")

data class TreeLeafResult(val value: WeightedVarianceResult) : TreeNodeResult

Immutable leaf-node snapshot.

TreeNodeResult

@Serializable

sealed interface TreeNodeResult

Snapshot of a single tree node; split or leaf.

TreePosterior

sealed interface TreePosterior : RegressionPosterior<TreeRegressionResult>

RegressionTree-aware scorer: routes the query x to a leaf snapshot and turns its weighted- variance summary into a single Double. Parallels the linear-side com.eignex.kumulant.stat.regression.glm.LinearPosterior family for the tree regressor shape.

TreeRegressionResult

@Serializable

@SerialName(value = "TreeRegressionResult")

data class TreeRegressionResult(val root: TreeNodeResult) : Result

Immutable, wire-portable snapshot of a RegressionTree over a dense VectorView context. Carries the tree structure (SerializableSplit predicates + per-node weighted-variance aggregates) so callers can route a context vector to its leaf without reaching back into the live stat.

TreeSplitResult

@Serializable

@SerialName(value = "TreeSplitResult")

data class TreeSplitResult(val split: SerializableSplit, val pos: TreeNodeResult, val neg: TreeNodeResult, val value: WeightedVarianceResult) : TreeNodeResult

Immutable split-node snapshot.

UcbForestPosterior

data class UcbForestPosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : ForestPosterior

Forest counterpart to UcbTreePosterior.

UcbTreePosterior

data class UcbTreePosterior(val priorWeight: Double = 1.0, val priorVariance: Double = 1.0) : TreePosterior

UCB-style score: mean + exploration * sqrt(variance / (totalWeights + priorWeight)). The sqrt(.) term is the leaf's standard error of the mean; the prior-weight floor keeps the bound finite at empty leaves.

VarianceReduction

@Serializable

@SerialName(value = "VarianceReduction")

data object VarianceReduction : SplitMetric

Mean variance reduction. The classic CART regression criterion.

Functions

aggregate

fun RegressionNode<*>.aggregate(): WeightedVarianceResult

Public per-node subtree aggregate over all observations routed through this node (leaves: the arm snapshot; splits: the exact Chan-merge of both children plus any RegressionSplitNode.carryover). Lets callers score internal nodes — e.g. a Thompson walk-down that picks a branch by its subtree's posterior — without the tree having to keep a live arm on every split node. Equivalent to a directly-accumulated internal arm: Chan's parallel merge is exact for weighted-variance aggregates.

mergeSnapshot

fun RegressionTree<VectorView>.mergeSnapshot(other: TreeNodeResult)

Snapshot merge using only the immutable result. Mirrors RegressionTree.merge but the "other" side is a TreeNodeResult tree-of-results rather than a live tree. VectorView only, for the same reason snapshot is.

rank

fun SplitMetric.rank(total: WeightedVarianceResult, pos: List<WeightedVarianceResult>, neg: List<WeightedVarianceResult>, minSamplesSplit: Double, minSamplesLeaf: Double): SplitInfo

Score every candidate split and return the top-2 + index. Splits that don't meet minSamplesLeaf on both sides or minSamplesSplit in total are skipped.

snapshot

fun RegressionNode<VectorView>.snapshot(): TreeNodeResult

Freeze a live VectorView tree node into an immutable, serializable snapshot. Internal split aggregates are derived from the snapshotted children so the wire format stays stable even though live splits hold no arm.