Automatic rerun detection
Overview
Depending on how many people are working on the same pipeline, manual
deletions can get annoying quite fast. Therefore a feature is implemented
in luisy.tasks.base.Task
to detect whether the task needs to be rerun based on code
changes.
More specific, whenever a luisy
pipeline has been successfully
executed, a hash of the code of all executed tasks is computed and
stored into a .luisy.hash
file located in the project directory
and whenever a luisy
pipeline is executed, the hashes of the
tasks computed at runtime are compared with the persisted hashes.
If code of a task changes then its hash changes, the task is executed
again, and its outfile is overwritten. Moreover, all the downstream
dependencies of this task are re-run as well as their input may have
changed.
Long story short: You will never have to manually delete files anymore
Mechanics of the hash computation
The first idea is to create a hash value out of the sourcecode of every task. However it is not enough to only capture the code of the task itself, we also need to capture:
the sourcecode of functions, classes and constants used in the source code of the task
versions of external libraries used
Core Algorithm
Using pythons AST library, we analyse the source code inside the body of the task class. With AST, we can get all variable names, that are used inside the class body. We also obtain all the local variables which are assigned (stored) inside the class body. This way we can identify variable names that are
used inside the class body
not defined/assigned inside the class body
These variables must be coming from outside the class body. Now we can check which of these three cases applies:
Import from an external library
Import from another module of the same package as our task is in
Variable is a function, class or constant from the same module as our task is in
For the last two cases, we can just apply the same step recursively and add the sourcecode to a list.
Whenever we hit external dependencies, we collect their version info from the requirements.txt
and add it to a list. If we cannot find a requirements.txt, we use
pipdeptree
to infer the version of the dependency from the installation.
When the recursive algorithm has stopped, we generate a hash of all the source code and version
info that we have collected.
Dealing with external requirements
First, find all the external packages that our task uses. If a package (package A) is not listed in requirements.txt directly, we look for another package (package B) that requires package A and is itself listed in requirements.txt. Then we can include the version info of package B in the hash.
Guidelines when creating the code of your task
To use the task hash functionality most efficiently, note the following points:
Do not shadow variables from module scope in class- or function scope. If you shadow a module level variable with a class-or function-level one, changes in the former will not be detected.
Try to import only what you need, not the whole module. Otherwise a tiny change in the imported module changes the hash of your task and leads to re-execution.
Star imports are not supported. They will make the hash creation fail.
When using
eval
, the variables used inside the the expression will not be tracked.