kumulant

stat.sketch

Structural queries on a stream that aren't shaped like a cardinality or a quantile. Four members; each answers a different question.

Picking a sketch

StatResultQuestion
BloomFilterStatBloomFilterResult"Have I seen this key before?" Membership only, one-sided false positives, no per-key counts.
CountMinSketchStatCountMinSketchResult"How many times has this key appeared?" Approximate per-key counts; biased upward by collisions, exact for heavy items.
SpaceSavingStatHeavyHittersResult"What are the top K heaviest keys?" Long-tail problem: out of millions of distinct items, a handful might dominate volume.
MinHashStatMinHashResult"How similar are two sets?" Jaccard similarity sketch; estimates overlap without comparing the sets directly.

All four take a stream of opaque Long keys through com.eignex.kumulant.core.DiscreteStat. As with com.eignex.kumulant.stat.cardinality, hash domain-specific keys through com.eignex.kumulant.math.hash64 first so the input carries uniform 64-bit entropy.

When to pick what

  • BloomFilter when you need yes/no membership at the absolute cheapest memory per key. A "yes" may be a false positive (rate controlled by configured bits + hashes); a "no" is always exact. Standard tool for deduplication of seen-already keys.

  • CountMinSketch when you need per-key counts but an exact map would blow memory. Counts are biased upward by collisions but exact for heavy items, which is fine for top-K-like questions and rate monitoring.

  • SpaceSaving when you only need the K heaviest keys: viral URLs, hot database rows, top users. Cheaper than a full map: keeps K counters and guarantees that any key whose true frequency exceeds the K-th largest is in the retained set.

  • MinHash for near-duplicate detection over streams. Standard Jaccard-similarity sketch; two sets' MinHash signatures collide on hash j with probability equal to their Jaccard index.

Merge

All four families work cleanly in distributed pipelines: workers track slices, ship snapshots, the coordinator merges.

Concurrency

Each family decomposes updates into independent atomic operations on single cells (BloomFilter / CountMin / MinHash) or a small bounded set of cells (SpaceSaving). Lock-free and exact under every com.eignex.kumulant.core.Concurrency level.

Types

Link copied to clipboard
@Serializable
@SerialName(value = "BloomFilterResult")
data class BloomFilterResult(val bits: Int, val hashes: Int, val words: LongArray, val totalSeen: Long, val hasher: HasherRef = HasherRef.SplitMix64) : Result

Bloom-filter snapshot. words is the bitset packed as bits / 64 longs; merging two snapshots requires identical bits and hashes.

Link copied to clipboard
class BloomFilterStat(val bits: Int = 1 shl 16, val hashes: Int = 7, val hasher: LongHasher = SplitMix64, val concurrency: Concurrency = Concurrency.None) : DiscreteStat<BloomFilterResult>

Bloom filter - probabilistic set-membership test with no false negatives. bits bits are split across hashes positions per insert, derived from the hasher (default SplitMix64) via the Kirsch-Mitzenmacher double-hashing scheme.

Link copied to clipboard
@Serializable
@SerialName(value = "CountMinSketchResult")
data class CountMinSketchResult(val depth: Int, val width: Int, val seed: Long, val counters: LongArray, val totalSeen: Long, val hasher: HasherRef = HasherRef.SplitMix64) : Result

CountStat-MinStat sketch snapshot. counters is the depth x width matrix of counters in row-major order. seed determines the per-row hash salts; merging two snapshots requires identical depth, width, and seed. totalSeen is the unweighted update count.

Link copied to clipboard
class CountMinSketchStat(val depth: Int = 5, val width: Int = 1024, val seed: Long = -7046029254386353133L, val hasher: LongHasher = SplitMix64, val concurrency: Concurrency = Concurrency.None) : DiscreteStat<CountMinSketchResult>

CountStat-MinStat sketch - a probabilistic frequency estimator over a depth x width matrix of counters. Each update hashes the value with depth independent salts (derived from seed) through the hasher (default SplitMix64) and increments one counter per row; the estimated count for any value is the minimum counter across rows.

Link copied to clipboard
@Serializable
@SerialName(value = "HeavyHittersResult")
data class HeavyHittersResult(val capacity: Int, val keys: LongArray, val counts: LongArray, val errors: LongArray, val totalSeen: Long) : Result

Space-Saving heavy-hitters snapshot. keys, counts, errors are parallel arrays of length <= capacity; for each tracked key, counts is the (over)estimated weighted count and errors is the Space-Saving overestimate bound (the count is at most this much above the true count). totalSeen is the unweighted update count.

Link copied to clipboard
@Serializable
@SerialName(value = "MinHashResult")
data class MinHashResult(val numHashes: Int, val seed: Long, val signatures: LongArray, val totalSeen: Long) : Result

MinHash signature snapshot. signatures is the per-hash running minimum of splitmix64(value xor splitmix64(seed + i)) over all updates; merging two snapshots takes element-wise min and requires identical numHashes and seed.

Link copied to clipboard
class MinHashStat(val numHashes: Int = 128, val seed: Long = -3724518991637283867L, val hasher: LongHasher = SplitMix64, val concurrency: Concurrency = Concurrency.None) : DiscreteStat<MinHashResult>

MinHash signature - for each of numHashes independent hash functions (salted by splitmix64(seed + i)), maintain the running minimum hash over all inserted values. The Jaccard similarity between two sets is estimated by the fraction of slots whose signatures agree (see jaccard).

Link copied to clipboard
class SpaceSavingStat(val capacity: Int, val concurrency: Concurrency = Concurrency.None) : DiscreteStat<HeavyHittersResult>

Heavy-hitters tracker; keeps the capacity most-frequent keys with one-sided overestimate guarantees on their counts.

Functions

Link copied to clipboard

True iff every bit set during an update(value) is still set in words. Re-derives the bit positions with the same LongHasher the filter used (resolved by name via Hashers), so a custom hasher must be registered before querying.

Link copied to clipboard

Estimated weighted count of value - the minimum across rows. Re-derives the per-row index with the same LongHasher the sketch used (resolved by name via Hashers), so a custom hasher must be registered before querying.

Link copied to clipboard

Estimated Jaccard similarity between the two underlying sets - the fraction of slots where signatures agree. Requires matching numHashes and seed.