Net v0.9 — "Killing Moon" Phase II
v0.9 is a hardening release. No new features, no new transports, no new SDK surfaces — every commit on this branch is a bug fix, a regression test, or a documentation tightening. The conviction we shipped under v0.8 ("Killing Moon") was that distributed compute should not be a control-plane problem. v0.9 is the version where we stand behind that conviction by walking it through audit after audit and tightening every seam we found.
The work was driven by four parallel-pass internal audits totalling 102 items across the bus, the shard manager, the RedEX append log and its CortEX fold, the JetStream and Redis adapters, the mesh transport, the FFI surface, and every binding.
Addressed in this release
RedEX & CortEX (storage + folded state)
- Lost events on partial replay failure —
MigrationTargetHandler::drain_pendingreturned on first delivery error without restoring the undelivered tail; everything past the failure was permanently lost. Fix preserves the tail for the next drain and a regression test pins both the resume and the prefix-not-redelivered invariant. - Silent eviction during tail backfill — backfill could miss the
Laggedsignal under retention rollover and silently drop events. Now signals correctly during backfill. - Index task exits permanently after
Lagged— the tail task halted onLaggedand never recovered. Now clears the index, re-tails live-only with a 5/20/60/250 ms saturation backoff, and surfaces alag_resets()counter so aggregating downstreams can detect lossy resets. - Snapshot-store retention drops high-water mark on
remove— a stale producer could re-stage older snapshots after a remove. Added a per-entity high-water table that survivesremove.forget()is nowpub(crate)so the anti-rewind invariant can't be defeated externally. - Observable seq rollback via
next_seq()— external readers could observe a temporarily-bumpednext_seqmid-rollback. Now reads under the state lock. new_heapacceptsRedexFlags::INLINE— the heap path silently accepted the inline flag, breaking invariants. Now rejected.append_batchempty-input returns plausible-looking seq (breaking) — returned0for both empty input and the legitimate seq-0 first write. NowResult<Option<u64>, _>. See breaking-changes section.- Age-retention off-by-one (breaking) — boundary was
>(entries at exact cutoff dropped); now>=(retained). See breaking-changes section. Stoppolicy halts without finalchanges_txnotify — subscribers got no signal on halt. Initial fix addednotify_waiters+changes_tx.send(seq); the broadcast was later refined to NOT emit the failing seq, sincechanges_txis documented as carrying successfully-folded sequences.- Cortex
changes_txbroadcasts failing seq on Stop+non-recoverable halt — pre-fix subscribers could observe a phantomSeq(failing_seq), mis-routing state. Now drops the broadcast on halt; subscribers pollis_running(). RedexFile::Debugdeadlock footgun —Debugcalledlen()andnext_seq(), both of which take the state lock. Now reads only the lock-free atomics.RedexIndex::clear()on Lagged is silent — added thelag_resets()accessor as a public sentinel.RedexIndexsaturation-resume can hot-loop — under sustained burst with an under-sizedtail_buffer_sizethe loop emitted a warn per cycle. Now backed off and rate-limited.
Bus, shards, and dispatch
- Activation-failure abort drops drain-worker scratch buffer / Batch worker abort drops in-memory
current_batch—.abort()dropped events. Now graceful await + dispatch with boundedtokio::time::timeout(2 × adapter_timeout)so the rollback can't hang on a parked worker. num_shardsdecremented on rollback that never incremented it — activate-failure rollback over-decrementednum_shardsfor never-activated shards. Decrement is now gated on the shard's mapper state. A targetedremove_specific_stopped_shardreplaces the bulkremove_stopped_shards()so sequentialmanual_scale_downdoesn't prune sibling state under itself.ShardManager::activate_sharddouble-counts on idempotent calls — repeated activates kept bumpingnum_shards. Now gated on the mapper'stransitionedsignal.activate()budget gate — load-then-store is safe today because the held write lock onshardsserializes both the load and the mutation. The lock-held invariant is now documented as the correctness gate (CAS would be belt-and-braces, not strictly required).- Shutdown drain race past
in_flight_ingests— single zero-pass could miss late producers. Now requires two consecutive zero passes. shutdown()returnsOk(())after timeout-with-drops — lossy shutdown looked successful. Now surfaces viaevents_dropped+ a dedicatedshutdown_was_lossyflag.drain_finalize_ready—Releasepairs only via implicit fence on the in-flight spin's SeqCst; promoted to SeqCst at the store site so the happens-before is explicit. Deadline-break path documented as the data-loss escape hatch.PollMergerdefault shard list is wrong after dynamic scale-down — polled from a stale0..num_shardsrange, missing live shards. Now uses the live shard id set, propagated through both add and remove paths.poll_mergerArcSwap leaves polls operating on stale topology — topology-snapshot semantics now documented onpoll().per_shard_limitsilently capped at 10 000 — caller had no signal. Surfaced viatruncated_at_per_shard_cap: boolinConsumeResponse.has_more=truefrom a stalled adapter is silently suppressed — stalled shards invisible to the caller. Now surfaced viastalled_shards: Vec<u16>.Cursor::encodereturns empty cursor on serialization failure — empty cursor restarted polling from zero (silent rewind). Initial fix usedexpect(...); later refined to returnResult<String, ConsumerError>so an asyncpoll()panic can't take down a runtime worker. Minor breaking change for direct callers.PER_SHARD_FETCH_CAPmade public — exposed an internal tuning knob as API. Now#[doc(hidden)]. Readtruncated_at_per_shard_capinstead.add_events(vec![])flushes as a side effect — load-bearing for the rollback path. Documented and pinned byadd_events_empty_can_flush_via_timeout.flush()baseline excludes events flushed viaremove_shard_internal— verifiedevents_dispatchedis bumped on stranded-flush; was already correct.dispatch_batchfinal attempt collapses error reasons — all retries were tagged with one collapsed error. Now structured per-attemptreason.dispatch_batchretry sleep has no jitter / backoff — synchronized retry storms across shards. Now jittered exponential viaretry_backoff(shard_id, attempt).drain_finalize_readyordering doc — clarified that the SeqCst happens-before only covers the non-deadline exit; deadline-path stranded events are exactly the ones surfaced viaevents_dropped+shutdown_was_lossy.
Atomics, timestamps, and counters
pushes_since_drain_startmismatched atomic ordering — producer used Relaxed, drain side used Acquire. Now both Acquire.in_flight_ingestsisAtomicU32with no saturating semantics — pathological producer counts could wrap. Widened toAtomicU64.TimestampGeneratoruses hard-coded baseline0— TSC delta math wrong. Now captures baseline at construction.TimestampGeneratormonotonicity stalls before the documented panic — stalled spin instead of advertised panic. Now panics preemptively atu64::MAX.velocity_samplesVecDequebounded only by time, not count — burst could grow unbounded. Now also count-capped.- Partition
next_idreuses ID 0 onu64::MAXoverflow — wrap-around silently re-issued IDs. Now saturates.
Adapters (JetStream / Redis / dedup)
- JetStream
as u16truncatesshard_id— values > 65 535 wrapped silently. Now rejected withFatal(andpoll_shardpropagates theFatalinstead of log-and-skipping). - JetStream
unwrap_or_default()on remote JSON — malformedrfield re-serialized as empty bytes. Now propagated asFatal. - JetStream cold-stream poll walks
fetch_limit * 10round-trips — ~1010 RTTs per poll on cold streams. Now bails afterconsecutive_not_found_cap, gated onfirst_seq == 0so populated sparse streams (events at seq 1, 500, 1000) walk past arbitrary deletion gaps. - JetStream
from_idcursorseq + 1overflows — wrapped to 0 atu64::MAX, silent restart. Nowchecked_add(1).unwrap_or(seq). - JetStream
Fataldrops accumulated batch inpoll_shard— documented; acceptable sinceFatalis non-retryable. - Redis
is_healthyPING has no enforced timeout — could hang indefinitely. Now wrapped incommand_timeout. - Redis & JetStream
limit + 1overflow on adversarial limits — wrapped to 0, silent under-delivery. Nowsaturating_add(1). RedisStreamDedup::newaccepts unbounded capacity — clamped atMAX_CAPACITY = 1<<24.RedisStreamDedupis FIFO eviction, not LRU as documented — docs were wrong. Updated to describe FIFO accurately.dedup_statesilently swallows fsync failures —let _ = f.sync_all()ignored disk-full errors. Propagated; cross-platform fixed via single writable handle (File::openreturned read-only on Windows;FlushFileBuffersfailed silently).dedup_state::create_new(true)poison after crash — a stale tempfile from a crashed prior run could break every subsequent save. Addedfs::remove_file(&tmp).ok()beforecreate_new.
Security & permissions
ttl_seconds = 0token mints expired — born-expired tokens with no diagnostic to the issuer.try_issuereturnsTokenError::ZeroTtl.Identity::issue_tokenpanic onDuration::ZERO— first fix routed throughtry_issue.expect(...), which still aborted the process with a misleading "ReadOnly" message. Now soft-clamps to 1 second,debug_assert!s in dev builds, and the wrapper's panic messages match eachtry_issuevariant precisely.PermissionToken::issuepanic message misattributes ZeroTtl as ReadOnly — fixed in tandem with the above.- Anti-replay window cleared on large legitimate jumps — whole bitmap zeroed silently. Now emits a structured warn before zeroing.
OriginStamphas no per-packet binding — threat model documented.- Untrusted-input panics in subnet config — added
try_*fallible constructors for SDK callers. - Channel decoder accepts trailing bytes on UNSUBSCRIBE/ACK — decoder now requires
cur.remaining() == 0after the channel name + token.
Bindings (Node, Python, Go, C)
- Node binding
u32 → u8truncation on member index —as u8silently truncated > 255. Switched totry_intowith explicit> 255rejection. - Python bindings hold GIL across blocking compute ops —
scale_to,on_node_failure,sync_standbys,promoteblocked the GIL during long ops. Now release via PyO3 0.28'spy.detach. - Node-binding groups carry an unused
kind: Stringfield — removed dead field. RedisStreamDedupstripped from generated Node binding surface — a regen-without-redis-feature dropped the class fromindex.d.tsandindex.js. Re-ran NAPI generation with--features redis,….- Python parity test for
append_batch([])returnsNone— added so future binding regenerations don't silently drop the contract. include_str!ofgo/net.hescapes the crate root — brokecargo publishand out-of-repo vendoring. Copied to in-crateinclude/net.go.hand updated the parity test.- C SDK README — fixed stale references to a removed
bindings/go/net/net.hpath. Runtime::block_onfromextern "C"shims unwinds across FFI — reentrancy hazard documented.
Behavior rules & evaluators
- Lossy
as_f64for all numeric ordering in rules — big i64/u64 values lost precision through f64. Now compares i64/u64 directly with sign-aware mixed-type fallback. compare_numbersbrittle withserde_json/arbitrary_precision— a transitive dep enabling that feature would silently make rules fail closed. Addeddebug_assert!so the misuse is loud in dev.- Non-deterministic verdict ordering —
window_failuresordering depended on iteration order. Now sorts and dedups for determinism. record_executionwindow-reset across rule reload — counters mis-reset for non-rate-limited rules. Now skipped for those.- Stream tight-loop spin — zero
poll_intervalspun the loop. Clamped to non-zero. - Stream backoff overflow on absurd
poll_interval— doubling overflowed. Now saturating. Rule::newlossily castsu128millis tou64— long uptimes truncated. Now uses saturatingu64::try_from.
Compute (daemons + migration)
- Migration
next_seqoverflow —replayed_through + 1could panic atu64::MAX. Nowsaturating_add. - DashMap entry guard held across registry I/O —
start_snapshotheld the entry guard across user-supplied snapshot code, deadlock-prone. Drops the guard before I/O. Two racing starts produce twoMeshDaemon::snapshot()calls — non-idempotent daemons must single-flight at their layer; documented. on_node_recoverydoes not break after first matching partition — documented as intentional for overlapping partitions.
Mesh transport & packet codec
- Silent
event_counttruncation in packet builder — builder accepted oversized batches and truncated. Now rejects with explicit error. StreamWindow.decodeunboundedtotal_consumed— consumer-side clamp was already enforced; documented.- Modulo bias in equal-weight candidate selection —
hash % lenbiased low for non-power-of-2. Now Lemire's(hash * len) >> 64. cpus.saturating_mul(2)capsmax_shards: u16at 65 535 — documented as intentional.mapper.rscooldown check + scale mutation atomicity — RwLock-implicit serialization documented.
SDK & error surface
SdkError::Ingestion(String)flattens structuredIngestionError— backpressure / sampled / unrouted all funnelled through one stringly-typed variant. Routed to structuredSampled/Unrouted/Backpressure. Breaking — see breaking-changes section.SdkErrorenum is breaking and not#[non_exhaustive]— added#[non_exhaustive]so future variant additions are minor-version changes.NetBuilder::identity()silently overridesentity_keypair— builder accepted both fields and silently dropped one; now rejects the conflict at build time.NetAdapterConfig::validateaccepts pathological values — added upper bounds + heartbeat floor.Dropreleases shutdown gates synchronously while workers holdArc<Self>— no partial-destruction UB; documented.
Test hygiene
MigrationTargetHandler::drain_pendingregression test — strengthened to also assert the prefix is NOT redelivered.add_events_empty_can_flush_via_timeout— pins that empty input flushes aftermax_delay. Load-bearing for the rollback path.retry_backoffjitter test — relaxed from>= 8 / 16to>= 4 / 16to stay robust againstDefaultHasherdistribution drift across toolchain versions.debug_does_not_acquire_state_lock— pins the lock-freeDebuginvariant by holdingstate.lock()acrossformat!("{:?}", file).stop_policy_does_not_broadcast_failing_seq— pins the cortex broadcast contract.cold_stream_bail_gate_only_fires_when_first_seq_is_zero— pins the JetStream sparse-stream gate.
Breaking changes
Rust core (net crate)
RedexFile::append_batch signature changed
append_batch and append_batch_ordered now return Result<Option<u64>, RedexError> instead of Result<u64, RedexError>.
Why: the prior shape returned Ok(0) for an empty batch, which collided with the legitimate "first event of a non-empty batch landed at seq 0" return — callers couldn't distinguish "I appended nothing" from "I appended one event at seq 0".
Migrate:
// Before
let first_seq: u64 = file.append_batch(&payloads)?;
// After
let first_seq: Option<u64> = file.append_batch(&payloads)?;Same change cascaded through OrderedAppender::append_batch and TypedRedexFile::append_batch.
Retention boundary semantics
Age-based retention now uses >= instead of > for the cutoff. An entry whose timestamp equals the cutoff exactly is retained (was: evicted).
Why: the original > comparison was off-by-one — entries on the boundary lasted strictly less than the configured retention_max_age. Production deployments with tight age caps observed events expiring one tick early.
Migrate: no source change required, but tests that asserted exact-boundary entries were evicted will now fail. Update assertions to expect retention.
Cursor::encode returns Result
CompositeCursor::encode now returns Result<String, ConsumerError> instead of String. Affects callers using the type directly; EventBus::poll() already handles the new shape.
Migrate: append .unwrap() (in tests) or ? (in production) to existing call sites.
PollMerger::new signature
PollMerger::new takes Vec<u16> of active shard IDs instead of num_shards: u16. This is an internal-leaning type but pub; downstream wrappers may need to update.
ConsumeResponse struct fields
Added truncated_at_per_shard_cap: bool and stalled_shards: Vec<u16>. Callers that construct ConsumeResponse directly need to populate the new fields. Pattern matches with .. unaffected.
PER_SHARD_FETCH_CAP is #[doc(hidden)]
Still pub const (callable), but no longer documented as API. Callers checking truncation should read ConsumeResponse::truncated_at_per_shard_cap instead of comparing against the constant.
SnapshotStore::forget is pub(crate)
Was pub. The function defeats the high-water-mark anti-rewind invariant — exposing it publicly let any caller stage stale snapshots over fresh ones. No production callers existed; only test code referenced it.
Rust SDK (net-sdk)
SdkError is #[non_exhaustive] + new variants
SdkError now carries the #[non_exhaustive] attribute. Two new variants moved out of the stringly-typed Ingestion(String) fallback:
Sampled— event deliberately dropped by a sampling / decimation policy. Retry is pointless.Unrouted— no routable shard for the event (typically a topology-transient state). Retry once topology stabilizes.
From<IngestionError> now routes IngestionError::Sampled and IngestionError::Unrouted to these structured variants. Code that string-matched on the content of Ingestion(String) for those causes silently stops matching.
Migrate:
// Match arms now must include a wildcard
match err {
SdkError::Backpressure => /* drop or retry */,
SdkError::Sampled => /* accept the drop */,
SdkError::Unrouted => /* retry after topology stabilizes */,
SdkError::NotConnected => /* peer gone */,
_ => /* future-proof catch-all */,
}If you were substring-matching on Ingestion(...) for "sampled" or "no shard", switch to the structured variants.
Identity::issue_token no longer panics on Duration::ZERO
Previously the panicking convenience wrapper aborted with a misleading "public-only keypair" message when ttl == Duration::ZERO. It now soft-clamps to 1 second and debug_assert!s in dev builds, so the misuse surfaces in tests but doesn't take down the process in release.
Identity::try_issue_token (the explicit fallible surface) still rejects zero-TTL with TokenError::ZeroTtl — bindings route through it.
Migrate: nothing strictly required. Tests that exercised the panic with #[should_panic(expected = "public-only keypair")] need updating — the new debug-assert message contains "Duration::ZERO".
Bindings
| Binding | Change |
|---|---|
| Node | appendBatch(...) returns bigint | null (was bigint). Empty input → null. |
| Python | append_batch(...) returns int | None (was int). Empty input → None. |
| Node | RedisStreamDedup class is back on the binding surface (it had been stripped by an earlier feature-incomplete regen — not a breaking change for downstream npm consumers, just a regression repaired). |
| Go | IssueToken{TTLSeconds: 0} returns a non-nil error (was: same — surfaced from FFI's try_issue path). No source change. |
Behavioral fixes that may surface as test breakage
These aren't strictly API-breaking, but if your test suite asserted the old behavior they will need updating:
num_shardsrollback:add_shard+ failedactivate_shard+ rollback no longer over-decrementsnum_shards. Tests that expected the off-by-one will fail.- JetStream sparse-stream polling:
poll_shardno longer breaks early on 64 consecutiveNotFounds wheninfo()reported a populated stream (first_seq > 0). Tests on populated sparse streams that asserted early-bail behavior will see longer walks. - Cortex
changes_with_laghalt path: onStop+ non-recoverable error the failing seq is no longer broadcast onchanges_tx. Subscribers need to pollis_running()to detect halt — pre-fix they could have observed (incorrectly) aChangeEvent::Seq(failing_seq). RedexFile::Debug: no longer acquires the state mutex; reads only the lock-free atomics. Output format changed (next_seq_atomicfield name;lenremoved).SnapshotStore::store: equal-seq concurrent-store linearization is now documented to be on the snapshots-side entry guard, not on the high-water mark. Behavior unchanged; doc clarified.
How to upgrade
- Bump your
Cargo.toml/package.json/requirements.txt/go.modto the v0.9 line. - Recompile. The signature changes (
append_batch→Result<Option<u64>>,Cursor::encode→Result,SdkError#[non_exhaustive]) will surface as compile errors at the exact call sites that need updating — follow the Migrate snippets above. - If you have tests that assert pre-fix behavior on the items in Behavioral fixes that may surface as test breakage, update those assertions.
- Bindings consumers (Node / Python): no source change is required — the type-stub updates are forward-compatible — but treat the new
null/Noneempty-input returns as the canonical "I appended nothing" signal in your call sites. - Re-run your full suite. The lib + binding suites run green; if your suite covers integration paths not exercised by the audit, this is the right release to catch any drift.
Released 2026-05-02.
License
See LICENSE.