Please disable your adblock and script blockers to view this page

Show HN: Shishua ? Fast pseudo-random generator


PRNG
Intel
AMD
cpb
POPCNT
ARM

Vigna’s
SIMD
AVX2
AESNI
NORX
MOV
A+B
XOR
u0
s1
RDTSC
Google Cloud Platform’s
GCP
AMD.To
Lehmer128
BigCrush
bad!Next
TiB of PractRand



BigCrush
drand48
Gimli
Sebastiano Vigna
xoshiro256++
Julia
PRNG.SIMD
SHISHUA!AVX2
GF(264
them?We
Xoshiro256+
ChaCha8.)So
Hacker News


Xoshiro256+

No matching tags


PractRand


PRNG.For
ChaCha8

No matching tags

Positivity     47.00%   
   Negativity   53.00%
The New York Times
SOURCE: https://espadrine.github.io/blog/posts/shishua-the-fastest-prng-in-the-world.html
Write a review: Hacker News
Summary

the number of CPU cycles spent to generate a byte of output. or become widely used by the major websites of the world.To improve your cpb, you can do three things:We will do all of the above.Therefore, to boot with point 1, we need to output more bits on each iteration.I am worried that people might say, “this is not a PRNG unless it outputs 32-bit numbers,” or “64-bit numbers”. And today, Intel and AMD CPUs support 256-bit operations through AVX2.Just like RC4 outputs 1 byte, and drand48 can only output 4 at a time; we will output 32 bytes at a time.Obviously, while 8 bytes could be output as a 64-bit number, for reasons that will slowly become obvious along the article.It looks like this:Let’s dive in line by line.Our state is cut in two pieces that both fit in an AVX2 register (256 bits). We keep output around in the state to get a bit of speed, but it is not actually part of the state.We also have a 64-bit counter; it is also an AVX2 register to ease computation. or within the right 128 bits if they started on the right).Here are the shuffle constants:To make the shuffle really strenghten the output, we move weak (low-diffusion) 32-bit parts into irreducible combinations of XOR and AND expressions across 64-bit positions.Storing the result of the addition in the state keeps that diffusion permanently.So, where do we get the output from?Easy: the structure we built is laid out in such a way that but an odd increment goes through all integers.)We use a different odd number of each 64-bit number in the state, which is enough entropy to start and stay strong.Then we do the following thing a ROUNDS number of times:Setting to the output increases the diffusion of the state. coming out of the PractRand PRNG quality tool.)Speed measurement benchmarks are tricky for so many reasons.I use a dedicated CPU instruction, RDTSC, which computes the number of cycles.To make sure that everyone can reproduce my results, I use a cloud virtual machine. Impossible to parallelize; I got 7.5 cpb (cycles spent per generated byte).Now, let’s look at a very common and fast MCG: Lehmer128, by combining the four states pairwise:And that is how you get SHISHUA, and its 0.06 cpb speed.That is twice as fast as the previously-fastest in the world that passes 32 TiB of PractRand. (Please make it so.)It passes BigCrush and 32 TiB of PractRand without suspicion.In fact, all of its four outputs do.One of the not-ideal aspects of the design is that SHISHUA is not reversible.You can see this with a reduction to a four-bit state, with s0 = [a, b] and s1 = [c, d].

As said here by