Expand description
SIMD-batched fill_bytes (feature simd).
Architecture-conditional (NEON on aarch64, AVX2 on x86_64);
excluded from coverage measurement via .tarpaulin.toml because
a single-platform tarpaulin run can never observe both halves.
Validated by the dedicated simd CI matrix job that runs
cargo test --features simd on both ubuntu-latest and
macos-latest.
SIMD-batched Xoshiro256++ for fill_bytes.
Holds K independent Xoshiro256++ states (K = 2 on AArch64 NEON,
K = 4 on x86_64 AVX2) in SIMD registers and advances all of them in
one inner-loop iteration. Each fill_bytes call derives the K lane
states by SplitMix64-whitening the scalar generator’s state with a
distinct lane-specific constant: cheap (~10 ns of setup), and
statistically independent lanes by construction. The scalar state
is advanced by the equivalent number of next_u64 calls so that
subsequent scalar calls remain consistent with the scalar-only
path.
An earlier draft used crate::xoshiro::Xoshiro256PlusPlus::jump
for 2¹²⁸-step
separation per lane, but at 256 scalar next_u64s per call its
~256 ns setup wiped out the SIMD win for buffers under ~4 KiB. The
SplitMix derivation keeps lanes uncorrelated (probability of state
collision is ≤ K²/2²⁵⁶ - negligible) at a fraction of the cost.
§Reproducibility contract
The same seed produces a different byte stream between the scalar path and the SIMD path. This is fundamental: there is no correctness-preserving way to interleave K independent Xoshiro generators into the same sequence a single-threaded generator would produce. Code that depends on bit-for-bit reproducibility across feature sets must use the scalar path.
Statistical quality is unchanged - each lane is a full Xoshiro256++ and inherits all of its properties.
Functions§
- fill_
bytes - x86_64 dispatch: prefer AVX2 if the CPU supports it; else scalar.