Cloud synchronisation
Note
This is an experimental feature. Expect sharp edges and bugs.
Each luisy.Task
comes with the parameters upload
and download
that allow to synchronize the target of a task
with the Azure cloud.
How the synchronisation works
If a luisy
pipeline is invoked the
following steps are performed subsequently:
Starting at the root-task given by the user, all upstream tasks are detected and their hash is determined using the local code-base (see also Overview for more details).
Compared with the hashes in the local filesystem, the tasks whose hashes have changed are detected.
All local files whose tasks need a rerun are deleted
Using luigi, the pipeline is executed. If
--download
is used,luisy
checks for each task whether a file with the correct hash resides in the cloud. If so, the file is downloaded instead of executing the task locally.After the pipeline has run through, all files of the pipeline whose hash is different from the hash in the cloud are uploaded if
--upload
is specified.
Note
Notice that --download
and --upload
are asymmetric to some
extend: Only what is needed is downloaded but everything is
uploaded. The reason is that we want to minimize the data on the
clients locally but everything should be available in the cloud
storage.
Note
As any other synchronization-tool, up- and downloading to your cloud storage system creates additional costs for you at your cloud provider service. Dependend on your cloud service, the costs can depend on the file sizes as well as on the number of files. luisy has no control over and is not in charge for these costs.
Prerequisites
To use this service, the access token for the the storage has to be set:
export LUISY_AZURE_STORAGE_KEY=SECRET_KEY
export LUISY_AZURE_CONTAINER_NAME=CONTAINER_NAME
export LUISY_AZURE_ACCOUNT_NAME=ACCOUNT_NAME
Note
The account name can be read of the Azure URL: https://[ACCOUNT_NAME].blob.core.windows.net
Warning
Dont add this to your .bashrc or something similar. The secret key is sensitive and should never be stored unencrypted. By adding a space in front of the export command, you ensure that this command will not be saved inside your .bash_history.
Here, we use the container projects
. There, a folder of the
name of the project (that is, the python package where the task lives
in) has to be created.
Downloading
If --download
is added to task execution, like
luisy --download --module my_module.tasks MyTask
then if the task is not yet completed (that is, if its outfile does not exists), then it is checked whether the file exists in the cloud. If the file exists in the cloud, then the file is downloaded and the task is marked as complete.
Uploading
If --upload
is added to task execution, like
luisy --upload --module my_module.tasks MyTask
then the result of the task is uploaded to the cloud. This can also be added to tasks that have already been executed, in this case, just the upload is done.
Tree traverse
If a task is called with both --upload
and --download
,
like
luisy --upload --download --module my_module.tasks MyTask
then uploading and downloading is done recursively through the tree
for all tasks whose completion is checked, provided all involved tasks
are luisy.Tasks
that correctly forward the parameters,
like with luisy.requires
and luisy.inherits
.