Decorators ========== luisy's core functionality is the use of decorators. Our idea is to reduce lots of boilerplate code by offering the user various decorators to decorate his tasks. Decorators are used for: * Reading / Writing Tasks * Filename generation * Passing parameters between tasks * Sorting tasks into `raw/interim/final` directory structure Local targets ------------- Here is a list of decorators for the outfile of a task: Generally, tasks with those decorators can persist their output with :py:func:`self.write()` and tasks requiring these tasks can read their output with `self.input().read()` * :py:func:`~luisy.decorators.hdf_output`: For tasks whose output is a two dimensional dataframe * :py:func:`~luisy.decorators.xlsx_output`: For tasks whose output is an Excel-file. * :py:func:`~luisy.decorators.csv_output`: For tasks whose output is a CSV file * :py:func:`~luisy.decorators.pickle_output`: For tasks whose output is a pickle. May be used to (de)serialize any python object. * :py:func:`~luisy.decorators.parquetdir_output`: For tasks whose output is a directory holding parquet files, like the result of a Spark computation. * :py:func:`~luisy.decorators.make_directory_output`: Factory to create your own directory output decorator. This method wants you to pass a function, which tells luisy how to handle the files in your directory. Parquet dir output ~~~~~~~~~~~~~~~~~~ When persisting results of a spark computation, the prefered output format are parquet files. This may look as follows: .. code-block:: python import luisy @luisy.raw @luisy.requires(SomeTask) @luisy.parquet_output('some_dir/spark_result') class SomePySparkResult(luisy.SparkTask): def run(self): df = self.read_input() # do something self.write(df) If the resulting path needs to be parametrized, the method :code:`get_folder_name` needs to be implemented: .. code-block:: python import luisy @luisy.raw @luisy.requires(SomeTask) @luisy.parquet_output class SomePySparkResult(luisy.SparkTask): partition = luigi.Parameter() def run(self): df = self.read_input() # do something self.write(df) def get_folder_name(self): return f"some_dir/spark_result/Partition=self.partition" In some cases, only the result of a spark pipeline triggered externally need to be further processed. Here, these files are external inputs to the pipeline and :py:func:`~luisy.decorators.parquetdir_output` together with the cloud-synchronisation can be used download these files automatically and process them locally: .. code-block:: python import luisy @luisy.raw @luisy.parquetdir_output('some_dir/some_pyspark_output') class SomePySparkResult(luisy.ExternalTask): pass This task points to the folder :code:`[project_name]/raw/some_dir/some_py/` in the Blob storage holding multiple parquet-files. Cloud targets ------------- The following targets can be used in combination with a `luisy.task.SparkTask`: * :py:func:`~luisy.decorators.deltatable_output`: For spark tasks whose output should be saved in a deltatable * :py:func:`~luisy.decorators.azure_blob_storage_output`: For spark tasks whose output should be saved in a azure blob storage. See :ref:`pyspark` for more details on how to use these targets. Cloud inputs ------------ To ease the development of cloud pipelines using `luisy.task.SparkTask` that should access data already persisted in cloud storages, we provide input decorators that render the usage of :py:class:`~luisy.tasks.base.ExternalTask` unnecessary: * :py:func:`~luisy.decorators.deltatable_input`: For spark tasks that should acess data saved in a deltatable * :py:func:`~luisy.decorators.azure_blob_storage_input`: For spark tasks whose output should be saved in a azure blob storage. See :ref:`pyspark` for more details on how to implement pipelines using these inputs. .. note:: Make sure that the `delta-spark` extension is installed into your spark cluster. See more `here `_. Directory structure ------------------- The following decorators denote the directory in the project directory * :py:func:`~luisy.decorators.raw`: For tasks whose output is in :code:`working_dir/project_name/raw` * :py:func:`~luisy.decorators.interim`: For tasks whose output is in :code:`working_dir/project_name/interim` * :py:func:`~luisy.decorators.final`: For tasks whose output is in :code:`working_dir/project_name/final` The :code:`project_name` dir-layer can be modified using :py:func:`~luisy.decorators.project_name` .. automodule:: luisy.decorators :noindex: