luigi.contrib.dataproc module

luigi bindings for Google Dataproc on Google Cloud

luigi.contrib.dataproc.get_dataproc_client()[source]
luigi.contrib.dataproc.set_dataproc_client(client)[source]
class luigi.contrib.dataproc.DataprocBaseTask(*args, **kwargs)[source]

Bases: _DataprocBaseTask

Base task for running jobs in Dataproc. It is recommended to use one of the tasks specific to your job type. Extend this class if you need fine grained control over what kind of job gets submitted to your Dataproc cluster.

submit_job(job_config)[source]
submit_spark_job(jars, main_class, job_args=None)[source]
submit_pyspark_job(job_file, extra_files=[], job_args=None)[source]
wait_for_job()[source]
class luigi.contrib.dataproc.DataprocSparkTask(*args, **kwargs)[source]

Bases: DataprocBaseTask

Runs a spark jobs on your Dataproc cluster

main_class = Parameter
jars = Parameter (defaults to )
job_args = Parameter (defaults to )
run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.dataproc.DataprocPysparkTask(*args, **kwargs)[source]

Bases: DataprocBaseTask

Runs a pyspark jobs on your Dataproc cluster

job_file = Parameter
extra_files = Parameter (defaults to )
job_args = Parameter (defaults to )
run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.dataproc.CreateDataprocClusterTask(*args, **kwargs)[source]

Bases: _DataprocBaseTask

Task for creating a Dataproc cluster.

gcloud_zone = Parameter (defaults to europe-west1-c)
gcloud_network = Parameter (defaults to default)
master_node_type = Parameter (defaults to n1-standard-2)
master_disk_size = Parameter (defaults to 100)
worker_node_type = Parameter (defaults to n1-standard-2)
worker_disk_size = Parameter (defaults to 100)
worker_normal_count = Parameter (defaults to 2)
worker_preemptible_count = Parameter (defaults to 0)
image_version = Parameter (defaults to )
complete()[source]

If the task has any outputs, return True if all outputs exist. Otherwise, return False.

However, you may freely override this method with custom logic.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.dataproc.DeleteDataprocClusterTask(*args, **kwargs)[source]

Bases: _DataprocBaseTask

Task for deleting a Dataproc cluster. One of the uses for this class is to extend it and have it require a Dataproc task that does a calculation and have that task extend the cluster creation task. This allows you to create chains where you create a cluster, run your job and remove the cluster right away. (Store your input and output files in gs://… instead of hdfs://… if you do this).

complete()[source]

If the task has any outputs, return True if all outputs exist. Otherwise, return False.

However, you may freely override this method with custom logic.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run