Oh, this resonates with me so much! I'm running 4 different DeepSpeech models right now, each using a differently processed version of LibriSpeech dataset (mfcc/fbanks/linear spectrograms, deltas? energy? padding? etc). Because the original DS papers didn't bother describing it, and every implementation I found uses completely different methods and libraries.
Not to mention every one of those implementation packages their preprocessed version into a different data format, and then creates a different data pipeline (and I only looked at tensorflow implementations)
The DeepSpeech2 paper does not include any details about audio processing. I see an older Baidu-Research implementation of DS1 that uses "log of linear spectrogram from FFT energy". Also, there's a pytorch implementation [1], where they use Librosa's STFT, is that what you're referring to?
That's two more implementations that I haven't considered. I'm sure most of the processing steps under the hood are the same or similar, but as I'm not an audio processing expert, I can't tell which method is better (and why).
And it's hard to tell if it "works well" because or despite the way I processed the files.
A step in the right direction for machine learning in science, but they could have done some more research into naming conflicts:
$ apt-cache show quilt
Package: quilt
[..]
Description-en: Tool to work with series of patches
Quilt manages a series of patches by keeping track of the changes
each of them makes. They are logically organized as a stack, and you can
apply, un-apply, refresh them easily by traveling into the stack (push/pop).
.
Quilt is good for managing additional patches applied to a package received
as a tarball or maintained in another version control system. The stacked
organization is proven to be efficient for the management of very large patch
sets (more than hundred patches). As matter of fact, it was designed by and
for Linux kernel hackers (Andrew Morton, from the -mm branch, is the
original author), and its main use by the current upstream maintainer is to
manage the (hundreds of) patches against the kernel made for the SUSE
distribution.
.
This package provides seamless integration into Debhelper or CDBS,
allowing maintainers to easily add a quilt-based patch management system in
their packages. The package also provides some basic support for those not
using those tools. See README.Debian for more information.
i hear you. on pypi the name is uncontested so, at least in the python eco-system, there is only one quilt. that said, for future revisions we'll try for a unique name because it can indeed be confusing, e.g. in the apt-get case.
I have been researching data orchestration/versioning tools for a long time and have been following the Quilt guys closely. It is definitely one of the more powerful tools in the ML/AI engineer's toolbox and solves a huge problem that almost everyone runs in to right out the gate. It's still early days in this space but Quilt gets a lot of things right and I'm super excited to see this product develop.
Full disclosure: I run Paperspace (https://www.paperspace.com) and am working with the Quilt team to integrate their tools in to our platform.
I use Quilt pretty much daily and while I like AWS open datasets I don't think it is as actively developed on as Quilt is. DAT project on the other hand I really do like as a way to simply transfer large amounts of data between contributors, that said, if you are just trying to get data out there and have people use it freely for their own work I think Quilt presents the solution due to searchable and easily understood python (and I think an R repo) usage of datasets.
Pachyderm founder here. We're not really a data hosting provider, although we may offer that in the future. Right now Pachyderm is more like intranet data hosting for companies. You have to spin up your own Kubernetes cluster and deploy Pachyderm on it. It's also not normally used to download data onto your local machine for processing because it has its own computation layer which allows you to run code at scale and tracks the provenance of the data to keep things reproducible.
I've tried dat (https://datproject.org/) and git lfs (https://git-lfs.github.com/) but so far have found quilt to be easiest to use & best fitting to my use case (experimental physics characterization experiments).
Just use AWS S3 (or similar) and shell scripts. My team uses a git repository named something like "data-packages", which is nothing but a collection of shell scripts with the name <dataset>.sh, that perform the necessary download and extraction steps to get a dataset from S3. Data sets are immutable by convention, so any changes to a data set requires you to provide a totally new shell script. That script could download an older data set and then mutate it if you don't want to maintain large copies of big data sets, but the older data set itself is not permitted to be mutated on S3.
My team has found this drastically easier than Quilt, and we do a ton of stuff with reproducible environments in Docker, creating Makefiles to reproduce exact model training with the exact same data, etc. We probably hit just about every case there is (huge models, small models, models where we'd like to train separately or collectively on a bunch of different benchmark data sets, in-house data sets, models that need to be refreshed with new data in pipelines, etc.) So far, Quilt has not been competitive with a simple repo of shell scripts for us, in terms of ease of use or effectiveness in maintaining different packages of data.
The other super nice thing is that when people start out on new models or experiments, we already have our in-house maintained copies of a bunch of academic data sets, private data sets, etc., and you can throw together an incredibly simple Dockerfile or Makefile that uses the appropriate script. It's just one or two lines of shell code and voila, you have an environment with the dataset you want. Check that into git and now your experiment is immediately reproducible from day one. We've found this to dramatically increase the amount of code review that researchers engage in for checking their statistical methodology and sanity checking their intended models or experiments. With Quilt, you have the extra issue of versioning (rather than harshly enforcing all data sets to be immutable ... even just adding one more training example to the data set means you must provide a new shell script that downloads the old data, injects your lone additional sample, and has a documentation entry about exactly what it is doing), as well as the overhead of using yet another tool instead of super standard shell scripts.
For me, any of the tools that pop up attempting to be like conda-forge but for data packages is sort of like taking a gatling gun to a problem that can be solved with a hammer.
Interesting thoughts. Quilt has a ways to grow. You correctly point out that, in some cases, S3 is lighter weight. You'll see future versions of Quilt get lighter, and offer more S3-like "just store this" functionality. In its next minor revision, Quilt simplifies point updates (i.e. it will be possible to update a single training example without materializing the entire package).
That said, there are a few areas where your system glosses over the needs of a data pipeline:
* "immutable by convention" is not a data preservation strategy; the system should enforce immutability
* what about deserialization? it's not enough to store and move bits. there are so many examples of "serdes" headaches. pickling (yes, pickle is a horrible format) in python 2 vs python 3 is one example. not to mention performance. my point is not that scripts can't do serdes, but that serdes information should travel with the data, so it's (mostly) transparent to the consumer.
* multiple writers (e.g. suppose you are generating training data in a distributed manner) requires write atomicity at the bucket level, which S3 doesn't provide
* deduplication of data fragments - I can see how one might do this with a "scripts over S3" strategy, but it's complicated enough that it's far easier to rely on a third-party app that just works in this regard
* fine-grained permissions - what if each data package has a different audience? sure, you can roll this with S3, but is that the best use of developer time?
* change history and access auditing
* querying and filtering - in many cases there is an enormous data corpus which needs to be sliced a different way by each user, e.g. Google Open Images. it is much more robust to have a single query mechanism that understands data layout than to write a fresh script for each slice.
> “immutable by convention" is not a data preservation strategy; the system should enforce immutability”
I actually disagree with this. In a Python-like “consenting adults” philosophy, I think it’s worse to spend engineering effort to guarantee immutability rather than to trust people not to and just have a reasonable system of backups.
Immutability by convention is 99.99999999% as good as enforced immutability for this particular type of task, and there’s even less risk with a good backup strategy to fall back if there is an accident.
Change history and access auditing are super easy on S3, as is fine-grained access control. With immutability by convention, change history is just git history, and you can customize access groups on a file-by-file basis if you want. You could also instrument logging in the shell scripts themselves if you really want, and I’m not convinced that’s worse than a third party doing it, especially if your logging backend is prone to change, or you wantbyo pipe stuff to Grafana, etc., which are quite common needs.
Querying and filtering are separate post-processing tasks. They should be expressed as source code that mutates a data set after downloading a local copy, and in fact version controlling any data cleaning, post-processing, etc., should be kept completely separate from the management of a data package. They are logically hugely different parts of the process. Slicing data ought to be up to the individual developer or researcher, to choose their tools, to optimize, etc. Version control of that source code is the right way to make that part of the work reproducible, not trying to tie custom treatments into a version of a data package.
Your points about dedup and deserialization are good ones. I can imagine problem cases for a simple script approach, but I can also say even for gigantic in-house image data sets, creating multiple slightly different materialized copies has rarely been an issue.
Do you store the datasets as tar/zip archives on S3, or do you have some way of representing how a collection of items goes together to form a dataset?
Quilt co-founder here, no, Quilt doesn’t use tar or zip. Each package version has a manifest that specifies the set of items it contains. Each item in the collection is stored and transported separately. Items are identified by their hash and stored once even if used in more than one package version.
For some data sets we use an archive format and the corresponding script unpackages the data. For others, we have all individual files in S3. Using archives improves download speed, and sometimes we provide both, so some narrowly-defined data sets benefit from the archive file, while others can pick and choose specific samples to include or exclude.
One of the nice things about the simple script approach is that the dataset can be defined however you like. Whatever the script retrieves and unpacks for you, that is the data set of that script. It could be a superset or subset of other data, and intersection of samples with a certain property from other data. As long as that definition is treated as immutable, and the backing data is immutable, it defines a specific collection of items however you want.
Has anyone tried out the option of self-hosting Quilt registries? I really like the idea of Quilt, although I am worried that my network bandwidth would be an issue for 10-100GB datasets...
Quilt isn't doing the inference (the PyTorch model is). But, in any case, no. Super-resolution is more than blurring, it's pixel inference. https://arxiv.org/abs/1609.05158
Not to mention every one of those implementation packages their preprocessed version into a different data format, and then creates a different data pipeline (and I only looked at tensorflow implementations)