For some data sets we use an archive format and the corresponding script unpacka...

For some data sets we use an archive format and the corresponding script unpackages the data. For others, we have all individual files in S3. Using archives improves download speed, and sometimes we provide both, so some narrowly-defined data sets benefit from the archive file, while others can pick and choose specific samples to include or exclude.

One of the nice things about the simple script approach is that the dataset can be defined however you like. Whatever the script retrieves and unpacks for you, that is the data set of that script. It could be a superset or subset of other data, and intersection of samples with a certain property from other data. As long as that definition is treated as immutable, and the backing data is immutable, it defines a specific collection of items however you want.