(Hot Take馃尪)
The call graph for your data pipelines should have depth no more than 2.
2
2
OK, what I meant with that: I frequently see code that does any pipeline that does like:
first_we_do_x
then_we_do_y
and_another_thing
and then when you look into first_we_do_x, you see that it consists for more non-trivial substeps, and so on.
3
So it looks like you're decomposing the task, but you're actually confuscating the flow of data. I personally find it better if the data flow is always visible on the top level, you process different parts of data at the top level and compose them there.
4
Put in a different way: every level should work on some level of abstraction that makes it possible to understand what is going on without having to dive in all the subfunctions first.
5
If you realize you need to break out complex processing into subfunctions, do so in a way that is separate from the data pipeline and generalized enough that they can stand on their own.
6
Is this always possible? No idea. I also don't claim that everything I do always looks like this. Just that when I achieve a way to structure it like this, it gets much easier to understand and work with.
7
Maybe some overlap with ideas like POJOs and functional, side-effect free programming, and dependency injection. Essentially build simple things that can be composed to do what you want.
(end of rant 馃尪)