luigi.contrib.dataproc module¶
luigi bindings for Google Dataproc on Google Cloud
-
class
luigi.contrib.dataproc.
DataprocBaseTask
(*args, **kwargs)[source]¶ Bases:
luigi.contrib.dataproc._DataprocBaseTask
Base task for running jobs in Dataproc. It is recommended to use one of the tasks specific to your job type. Extend this class if you need fine grained control over what kind of job gets submitted to your Dataproc cluster.
-
class
luigi.contrib.dataproc.
DataprocSparkTask
(*args, **kwargs)[source]¶ Bases:
luigi.contrib.dataproc.DataprocBaseTask
Runs a spark jobs on your Dataproc cluster
-
main_class
= Parameter¶
-
jars
= Parameter (defaults to )¶
-
job_args
= Parameter (defaults to )¶
-
-
class
luigi.contrib.dataproc.
DataprocPysparkTask
(*args, **kwargs)[source]¶ Bases:
luigi.contrib.dataproc.DataprocBaseTask
Runs a pyspark jobs on your Dataproc cluster
-
job_file
= Parameter¶
-
extra_files
= Parameter (defaults to )¶
-
job_args
= Parameter (defaults to )¶
-
-
class
luigi.contrib.dataproc.
CreateDataprocClusterTask
(*args, **kwargs)[source]¶ Bases:
luigi.contrib.dataproc._DataprocBaseTask
Task for creating a Dataproc cluster.
-
gcloud_zone
= Parameter (defaults to europe-west1-c)¶
-
gcloud_network
= Parameter (defaults to default)¶
-
master_node_type
= Parameter (defaults to n1-standard-2)¶
-
master_disk_size
= Parameter (defaults to 100)¶
-
worker_node_type
= Parameter (defaults to n1-standard-2)¶
-
worker_disk_size
= Parameter (defaults to 100)¶
-
worker_normal_count
= Parameter (defaults to 2)¶
-
worker_preemptible_count
= Parameter (defaults to 0)¶
-
image_version
= Parameter (defaults to )¶
-
-
class
luigi.contrib.dataproc.
DeleteDataprocClusterTask
(*args, **kwargs)[source]¶ Bases:
luigi.contrib.dataproc._DataprocBaseTask
Task for deleting a Dataproc cluster. One of the uses for this class is to extend it and have it require a Dataproc task that does a calculation and have that task extend the cluster creation task. This allows you to create chains where you create a cluster, run your job and remove the cluster right away. (Store your input and output files in gs://… instead of hdfs://… if you do this).