Module xoshiro_simd

Expand description

SIMD-batched fill_bytes (feature simd).

Architecture-conditional (NEON on aarch64, AVX2 on x86_64); excluded from coverage measurement via .tarpaulin.toml because a single-platform tarpaulin run can never observe both halves. Validated by the dedicated simd CI matrix job that runs cargo test --features simd on both ubuntu-latest and macos-latest. SIMD-batched Xoshiro256++ for fill_bytes.

Holds K independent Xoshiro256++ states (K = 2 on AArch64 NEON, K = 4 on x86_64 AVX2) in SIMD registers and advances all of them in one inner-loop iteration. Each fill_bytes call derives the K lane states by SplitMix64-whitening the scalar generator’s state with a distinct lane-specific constant: cheap (~10 ns of setup), and statistically independent lanes by construction. The scalar state is advanced by the equivalent number of next_u64 calls so that subsequent scalar calls remain consistent with the scalar-only path.

An earlier draft used crate::xoshiro::Xoshiro256PlusPlus::jump for 2¹²⁸-step separation per lane, but at 256 scalar next_u64s per call its ~256 ns setup wiped out the SIMD win for buffers under ~4 KiB. The SplitMix derivation keeps lanes uncorrelated (probability of state collision is ≤ K²/2²⁵⁶ - negligible) at a fraction of the cost.

§Reproducibility contract

The same seed produces a different byte stream between the scalar path and the SIMD path. This is fundamental: there is no correctness-preserving way to interleave K independent Xoshiro generators into the same sequence a single-threaded generator would produce. Code that depends on bit-for-bit reproducibility across feature sets must use the scalar path.

Statistical quality is unchanged - each lane is a full Xoshiro256++ and inherits all of its properties.

Functions§

fill_bytes: x86_64 dispatch: prefer AVX2 if the CPU supports it; else scalar.