Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Very much agree that to this is the direction data orchestration platforms should go towards - the basic DAG creation can be straightforward, depending on how you do the authoring - (parsing SQL is always the wrong answer, but is tempting) - but backfills, code updates, etc are when it starts to get spicy.


I think this is where it gets interesting. With partition dependency propagation, backfills are just “hey this range of partitions should exist”. Or, your “wants” partitions are probably still active, and you can just taint the existing partitions. This invalidates the existing partitions, so the wants trigger builds again, and existing consumers don’t see the tainted partitions as live. I think things actually get a lot simpler when you stop trying to reason about those data relationships manually!


This is true, but you can get combinatorial complexity explosions, especially with the data modeling patterns for efficiency common at some companies - eg a mix of latest dimensions and historical snapshots, without always having clear delineations about when you're using what. Common example is something like a recursive incremental table that needs to be rebuilt from the first partition seed. Some SQL operations can also be very opaque (syntactically, or in terms of special DB features) as to what partitions are being referenced, especially again when aggregates get involved.

It's absolutely solvable if you're building clean; retrofitting onto existing dataflow is when things get messy, and then managing user/customer expectations of a more strict system. People like to be able to do wild things!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: