stat.score
Online evaluation metrics. Inputs are paired (prediction, truth) observations (or richer shapes for distributional metrics) and outputs are accuracy / discrimination / calibration / distributional summaries.
The calibration diagnostic (Reliability) and the calibration fixes live in com.eignex.kumulant.stat.calibration. This package is for the rest: regression errors, proper scoring rules, discrimination metrics, classification metrics, and distributional forecast diagnostics.
Regression errors
| Stat | Result | Use |
|---|---|---|
MseLoss (in Loss.kt) | com.eignex.kumulant.stat.summary.WeightedMeanResult | Mean squared error; weights large errors quadratically. |
MaeLoss (in Loss.kt) | com.eignex.kumulant.stat.summary.WeightedMeanResult | Mean absolute error; robust median-error-style alternative. |
| PinballLossStat | com.eignex.kumulant.stat.summary.WeightedMeanResult | Quantile (pinball) loss at quantile tau. The right pick when the model emits a quantile rather than a mean. |
Binary proper scoring rules
| Stat | Result | Use |
|---|---|---|
LogLoss (in Loss.kt) | com.eignex.kumulant.stat.summary.WeightedMeanResult | Cross-entropy / log-likelihood. Log-likelihood-shaped objectives. |
| BrierScoreStat | com.eignex.kumulant.stat.summary.WeightedMeanResult | Bounded squared-error counterpart of log loss. Reliability-decomposable. |
LogLoss penalises confident-wrong predictions much more harshly than Brier. Pick LogLoss when you want a likelihood-shaped objective; pick Brier when you want a bounded, calibration-decomposable error.
Discrimination
AucStat reports streaming ROC AUC over a fixed-resolution score histogram. AUC measures whether positives score higher than negatives on average and is calibration-agnostic; a perfectly-discriminative model can still be miscalibrated, and a perfectly-calibrated model can have mediocre AUC.
Classification
| Stat | Result | Use |
|---|---|---|
| AccuracyStat | com.eignex.kumulant.stat.summary.WeightedMeanResult | Weighted classification accuracy: fraction of predicted == truth. |
| ConfusionMatrixStat | ConfusionMatrixResult | K-by-K confusion matrix with per-class precision / recall / F1, macro / micro averages, multiclass MCC. |
AccuracyStat is the O(1) shortcut when only the scalar accuracy matters. ConfusionMatrixStat is the full P/R/F1 surface with a per-class breakdown; reach for it when accuracy alone hides class-imbalance effects.
Distributional forecast diagnostics
The PIT (probability integral transform) family covers calibration of distributional forecasts:
pitHistogram(numBins)(factory inPitHistogram.kt): feeds PIT values into an equiprobable LinearHistogramStat over[0, 1]. Under correct distributional forecasts the histogram should be uniform; deviations diagnose under- or over-coverage and tail mis-specifications.The functions in
PitTests.ktrun the standard PIT uniformity tests on the histogram (Kolmogorov-Smirnov-style summary statistics).
Use these when the model emits a CDF (not just a point estimate) and you want to check whether the predicted distribution matches the empirical one.
Compose patterns
MseLoss.windowed(window)for windowed regression error.BrierScore.transform(...)after a Platt or Isotonic step to score calibrated probabilities.Auc+ Reliability in parallel: AUC tells you discrimination, reliability tells you calibration. A pipeline that monitors both catches different failure modes.
Merge
All paired-mean-shaped metrics (MseLoss, MaeLoss, LogLoss, BrierScore, PinballLoss, Accuracy) merge via the underlying MeanStat's Chan-style parallel formula; exact across replicas. AucStat and ConfusionMatrixStat merge via cell-wise bin / matrix addition.
Concurrency
The mean-shaped metrics (AccuracyStat, BrierScoreStat, PinballLossStat, and the MseLoss / MaeLoss / LogLoss stats in Loss.kt) inherit MeanStat's Welford-coupled model: locked under com.eignex.kumulant.core.Concurrency.Strict / com.eignex.kumulant.core.Concurrency.HighWrite, drifting by ULPs under com.eignex.kumulant.core.Concurrency.Relaxed but never throwing. AucStat and ConfusionMatrixStat apply independent striped atomic increments to their histogram / matrix cells; lock-free and exact under every level, with the trapezoidal / precision-recall read running single-threaded.
Types
Streaming classification accuracy: paired (predictedClass, trueClass) aggregated as the weighted mean of 1[predicted == truth]. Classes are compared on toLong() so floating-point class indices round-trip safely.
AUC snapshot with the per-bin counts needed for merge. auc is NaN until at least one positive and one negative have been observed; consult totalPositives / totalNegatives to detect that case.
Streaming binary ROC-AUC by score-binning. Each update is paired (score, label) with label in {0, 1} (soft labels work too via the convex split into pos/neg weights).
Streaming Brier score for binary probabilistic forecasts. Paired input is (probability, outcome) where outcome in {0, 1}; aggregated as the mean of (probability - outcome)^2.
Snapshot of a weighted K-by-K confusion matrix indexed as counts[predicted][truth].
Streaming K-by-K confusion matrix over paired (predictedClass, trueClass) observations. Inputs are class indices in [0, numClasses); the doubles are truncated to ints via toInt() and out-of-range pairs are ignored. Use for online classifier evaluation; pair with the metric getters on ConfusionMatrixResult for accuracy, per-class P/R/F1, macro F1, and MCC.
Streaming binary log loss (cross-entropy): paired (probability, outcome) aggregated as the mean of -[y*ln(p) + (1-y)*ln(1-p)].
Streaming mean absolute error: paired (prediction, truth) aggregated as the mean of |prediction - truth|.
Streaming mean squared error: paired (prediction, truth) aggregated as the mean of (prediction - truth)^2.
Streaming pinball / quantile loss at level tau. Paired input is (prediction, truth); the per-row loss is max(tau*(y - yhat), (tau - 1)*(y - yhat)), which equals |y - yhat| when tau = 0.5.
Functions
Pearson chi-squared statistic for uniformity on [0, 1]. Compares the empirical bin counts against the uniform expectation total / numBins and sums (observed - expected)^2 / expected over all numBins bins.
Probability Integral Transform histogram: bins F(y) (the forecast CDF evaluated at the observed truth) into numBins equal-width buckets across [0, 1]. A uniform empirical distribution indicates a well-calibrated forecaster; concentrated mass indicates miscalibration.
Kolmogorov-Smirnov statistic against the uniform distribution on [0, 1]. Walks every bin (including empty ones) and returns the supremum of |empCdf(x) - x| evaluated at bin upper boundaries.