Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Efficient Tabular Storage (matthewrocklin.com)
62 points by ah- on Aug 28, 2015 | hide | past | favorite | 9 comments


It looks like the NYCTaxi dataset is here:

http://www.andresmh.com/nyctaxitrips/

Some background on this data:

http://chriswhong.com/data-visualization/taxitechblog1/

And data for 2014 directly from the city:

https://data.cityofnewyork.us/view/gn7m-em8n


Vertica has all those performance enhancements, great DB can't recommend because of pricing :(


kdb+ answers same description.

And it's a 300KB executable with no dependencies (other than glib/MSVCRT).


How do databases like MySQL store data efficiently for querying? It seems like something like protobuf would do well here, though you'd need to generate code for each dataset.


Typically they use row-oriented binary storage, optionally with individual columns or subsets of columns duplicated into indices for fast querying. Have you tried protobufs? How many hundreds of megs per second do you get? I think it is remarkably slow on the scales we're talking about here.


Traditional DBMS's get performance by optimizing storage down to the physical layout of the data on the hardware. So MySQL makes a lot of assumptions based on the mechanics if spinning disks and buffers tailored to their physics. Database Systems: The Complete Book is a good text on the subject and the second half is all about the hardware and software used in implementing traditional systems.


How is it different from HDF5 (h5py and pytables)?


Great share


This is super interesting, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: