Efficient Tabular Storage

SloopJon · on Aug 29, 2015

It looks like the NYCTaxi dataset is here:

http://www.andresmh.com/nyctaxitrips/

Some background on this data:

http://chriswhong.com/data-visualization/taxitechblog1/

And data for 2014 directly from the city:

https://data.cityofnewyork.us/view/gn7m-em8n

TheGuyWhoCodes · on Aug 29, 2015

Vertica has all those performance enhancements, great DB can't recommend because of pricing :(

beagle3 · on Aug 29, 2015

kdb+ answers same description.

And it's a 300KB executable with no dependencies (other than glib/MSVCRT).

owlish · on Aug 29, 2015

How do databases like MySQL store data efficiently for querying? It seems like something like protobuf would do well here, though you'd need to generate code for each dataset.

kragen · on Aug 29, 2015

Typically they use row-oriented binary storage, optionally with individual columns or subsets of columns duplicated into indices for fast querying. Have you tried protobufs? How many hundreds of megs per second do you get? I think it is remarkably slow on the scales we're talking about here.

brudgers · on Aug 29, 2015

Traditional DBMS's get performance by optimizing storage down to the physical layout of the data on the hardware. So MySQL makes a lot of assumptions based on the mechanics if spinning disks and buffers tailored to their physics. Database Systems: The Complete Book is a good text on the subject and the second half is all about the hardware and software used in implementing traditional systems.

Little_Peter · on Aug 29, 2015

How is it different from HDF5 (h5py and pytables)?

Ashim_Usmani · on Aug 29, 2015

Great share

Sprint · on Aug 29, 2015

This is super interesting, thanks!