How to Shuffle and Sample on the Command-Line

ahh · on Oct 16, 2015

Interesting. I did the same investigation myself a few years back, but was frustrated by the lack of the -r flag for shuf(1). It seems that's been added at some point recently (though many of my systems do not have it--GNU coreutils percolates slowly through older Debian/Ubuntu versions. :))

Good to know things are still getting better in coreutils!

keithpeter · on Oct 17, 2015

Oh, nice, on Fedora 23 beta, I can simulate die rolls

     shuf -r -z -n 100 -e 1 2 3 4 5 6;echo -e "\n"

And rolls of a non-transitive (Grime) die

     shuf -r -z -n 100 -e 3 3 3 3 3 6;echo -e "\n"

Which is timely. I sometimes forget how flexible the terminal prompt is...

zatkin · on Oct 17, 2015

It's pseudo-random though, right?

keithpeter · on Oct 17, 2015

Oh, I'd imagine so. Good enough for illustrative purposes and for catching any gross errors in my arithmetic when analysing the games. Not good enough for anything 'real'.

IgorPartola · on Oct 17, 2015

I recently discovered shuf since I needed to shuffle a fairly large number of URL's in a file to allow multiple processes to work through them in parallel (yes I could have actually done this using a queue and a producer/consumer, but this was a one time deal so it was faster to just throw a bit more hardware at it). What I was amazed by is that it took 'cat urls.txt | shuf > new-urls.txt' just about a second to complete even thoug the original file was about 1GB. How does it work so incredibly fast?

pixelbeat · on Oct 18, 2015

This uses a few utils and techniques to deal 5 random cards:

https://twitter.com/pixelbeat_/status/587703133717057537

    paste -d '' <(printf '%s\n' $(seq 2 9) T J Q K A | sed 'p;p;p') \
    <(yes $'H\nD\nS\nC' | head -n52) |
    shuf -n5

latkin · on Oct 17, 2015

On Windows, Get-Random handles the basic cases nicely

    > 1..100 | Get-Random
    72
    > 1..100 | Get-Random -count 3
    16
    96
    56

but doesn't have something like -r to resample. Or a nice way to simply shuffle the whole collection (workaround is to just pass -count as large or larger than the collection).

vortico · on Oct 16, 2015

Nice! I've been using

    sort -r | head -n100

but obviously this requires the entire file to be shuffled before printing the first 100 lines.

eru · on Oct 17, 2015

-r gives a random hash. So it will do the wrong thing in the face of repeated lines. (Either you get all instances of a repeated line or none.)

_napl · on Oct 17, 2015

The -R option not being available on OS X, you might do something like

  awk "BEGIN { srand($RANDOM) } { print int(rand() * 1000000), \$0 }" | sort -n | cut -d' ' -f2-

to shuffle an input

bla2 · on Oct 17, 2015

http://bost.ocks.org/mike/shuffle/compare.html

epistasis · on Oct 17, 2015

Note that hnov's awk command is the equivalent of "sort (random order)" at that and shows good randomness properties in the plot. However, that link shows "sort (random comparator)" by default which looks terrible at randomly sorting lists. hnov's awk script should be suitable for most needs, though I'd tweak it a bit:

     awk "{print rand(), $0}" | sort -g | cut -d' ' -f2-

which is shorter allows more than 1,000,000 random values, namely ~52bits in awk's 64bit implementations.

wodenokoto · on Oct 17, 2015

How does this handle files that are gigs in size? It looks to me like you load the entire file into memory and pass it along, shuffled.

sgk284 · on Oct 17, 2015

See: reservoir sampling [1]

[1] https://en.wikipedia.org/wiki/Reservoir_sampling

mappu · on Oct 17, 2015

If you have an unknown input size and a finite number of requested output lines, you can still do it in O(n) and no additional memory, by having a steadily-decreasing chance of replacing one of the outputs with the next line.

falsedan · on Oct 17, 2015

    man shuf

emmelaich · on Oct 17, 2015

Yeah but what's sad is that

    $ man -k shuffle

gives pstops (1) - shuffle pages in a PostScript file

I wish I knew about shuf. I've been using sort -R or using Python.

lcswi · on Oct 17, 2015

It's not the documentation, it's knowing which tools exist in the first place.

falsedan · on Oct 18, 2015

    man coreutils

?

ooh topical:

    ls /usr/bin | shuf -n1 | xargs man

falsedan · on Oct 18, 2015

oh sorry, I mean

    http://linux.die.net/man/1/shuf

since if it's not on the internet, how will anyone ever find it?