Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to Shuffle and Sample on the Command-Line (jpalardy.com)
82 points by jpalardy on Oct 16, 2015 | hide | past | favorite | 20 comments


Interesting. I did the same investigation myself a few years back, but was frustrated by the lack of the -r flag for shuf(1). It seems that's been added at some point recently (though many of my systems do not have it--GNU coreutils percolates slowly through older Debian/Ubuntu versions. :))

Good to know things are still getting better in coreutils!


Oh, nice, on Fedora 23 beta, I can simulate die rolls

     shuf -r -z -n 100 -e 1 2 3 4 5 6;echo -e "\n"
And rolls of a non-transitive (Grime) die

     shuf -r -z -n 100 -e 3 3 3 3 3 6;echo -e "\n"
Which is timely. I sometimes forget how flexible the terminal prompt is...


It's pseudo-random though, right?


Oh, I'd imagine so. Good enough for illustrative purposes and for catching any gross errors in my arithmetic when analysing the games. Not good enough for anything 'real'.


I recently discovered shuf since I needed to shuffle a fairly large number of URL's in a file to allow multiple processes to work through them in parallel (yes I could have actually done this using a queue and a producer/consumer, but this was a one time deal so it was faster to just throw a bit more hardware at it). What I was amazed by is that it took 'cat urls.txt | shuf > new-urls.txt' just about a second to complete even thoug the original file was about 1GB. How does it work so incredibly fast?


This uses a few utils and techniques to deal 5 random cards:

https://twitter.com/pixelbeat_/status/587703133717057537

    paste -d '' <(printf '%s\n' $(seq 2 9) T J Q K A | sed 'p;p;p') \
    <(yes $'H\nD\nS\nC' | head -n52) |
    shuf -n5


On Windows, Get-Random handles the basic cases nicely

    > 1..100 | Get-Random
    72
    > 1..100 | Get-Random -count 3
    16
    96
    56
but doesn't have something like -r to resample. Or a nice way to simply shuffle the whole collection (workaround is to just pass -count as large or larger than the collection).


Nice! I've been using

    sort -r | head -n100
but obviously this requires the entire file to be shuffled before printing the first 100 lines.


-r gives a random hash. So it will do the wrong thing in the face of repeated lines. (Either you get all instances of a repeated line or none.)


The -R option not being available on OS X, you might do something like

  awk "BEGIN { srand($RANDOM) } { print int(rand() * 1000000), \$0 }" | sort -n | cut -d' ' -f2-
to shuffle an input



Note that hnov's awk command is the equivalent of "sort (random order)" at that and shows good randomness properties in the plot. However, that link shows "sort (random comparator)" by default which looks terrible at randomly sorting lists. hnov's awk script should be suitable for most needs, though I'd tweak it a bit:

     awk "{print rand(), $0}" | sort -g | cut -d' ' -f2-
which is shorter allows more than 1,000,000 random values, namely ~52bits in awk's 64bit implementations.


How does this handle files that are gigs in size? It looks to me like you load the entire file into memory and pass it along, shuffled.



If you have an unknown input size and a finite number of requested output lines, you can still do it in O(n) and no additional memory, by having a steadily-decreasing chance of replacing one of the outputs with the next line.


    man shuf


Yeah but what's sad is that

    $ man -k shuffle
gives pstops (1) - shuffle pages in a PostScript file

I wish I knew about shuf. I've been using sort -R or using Python.


It's not the documentation, it's knowing which tools exist in the first place.


    man coreutils
?

ooh topical:

    ls /usr/bin | shuf -n1 | xargs man


oh sorry, I mean

    http://linux.die.net/man/1/shuf
since if it's not on the internet, how will anyone ever find it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: