luigi.contrib.beam_dataflow module

class luigi.contrib.beam_dataflow.DataflowParamKeys[source]

Bases: object

Defines the naming conventions for Dataflow execution params. For example, the Java API expects param names in lower camel case, whereas the Python implementation expects snake case.

runner
project
zone
region
staging_location
temp_location
gcp_temp_location
num_workers
autoscaling_algorithm
max_num_workers
disk_size_gb
worker_machine_type
worker_disk_type
job_name
service_account
network
subnetwork
labels
class luigi.contrib.beam_dataflow.BeamDataflowJobTask[source]

Bases: luigi.task.MixinNaiveBulkComplete, luigi.task.Task

Luigi wrapper for a Dataflow job. Must be overridden for each Beam SDK with that SDK’s dataflow_executable().

For more documentation, see:
https://cloud.google.com/dataflow/docs/guides/specifying-exec-params

The following required Dataflow properties must be set:

project # GCP project ID temp_location # Cloud storage path for temporary files

The following optional Dataflow properties can be set:

runner # PipelineRunner implementation for your Beam job.
Default: DirectRunner
num_workers # The number of workers to start the task with
Default: Determined by Dataflow service
autoscaling_algorithm # The Autoscaling mode for the Dataflow job
Default: THROUGHPUT_BASED
max_num_workers # Used if the autoscaling is enabled
Default: Determined by Dataflow service
network # Network in GCE to be used for launching workers
Default: a network named “default”
subnetwork # Subnetwork in GCE to be used for launching workers
Default: Determined by Dataflow service
disk_size_gb # Remote worker disk size. Minimum value is 30GB
Default: set to 0 to use GCP project default
worker_machine_type # Machine type to create Dataflow worker VMs
Default: Determined by Dataflow service
job_name # Custom job name, must be unique across project’s
active jobs
worker_disk_type # Specify SSD for local disk or defaults to hard
disk as a full URL of disk type resource Default: Determined by Dataflow service.
service_account # Service account of Dataflow VMs/workers
Default: active GCE service account
region # Region to deploy Dataflow job to
Default: us-central1
zone # Availability zone for launching workers instances
Default: an available zone in the specified region
staging_location # Cloud Storage bucket for Dataflow to stage binary
files Default: the value of temp_location
gcp_temp_location # Cloud Storage path for Dataflow to stage temporary
files Default: the value of temp_location
labels # Custom GCP labels attached to the Dataflow job
Default: nothing
project = None
runner = None
temp_location = None
staging_location = None
gcp_temp_location = None
num_workers = None
autoscaling_algorithm = None
max_num_workers = None
network = None
subnetwork = None
disk_size_gb = None
worker_machine_type = None
job_name = None
worker_disk_type = None
service_account = None
zone = None
region = None
labels = {}
cmd_line_runner

alias of _CmdLineRunner

dataflow_params = None
dataflow_executable()[source]

Command representing the Dataflow executable to be run. For example:

return [‘java’, ‘com.spotify.luigi.MyClass’, ‘-Xmx256m’]

args()[source]

Extra String arguments that will be passed to your Dataflow job. For example:

return [‘–setup_file=setup.py’]

before_run()[source]

Hook that gets called right before the Dataflow job is launched. Can be used to setup any temporary files/tables, validate input, etc.

on_successful_run()[source]

Callback that gets called right after the Dataflow job has finished successfully but before validate_output is run.

validate_output()[source]

Callback that can be used to validate your output before it is moved to its final location. Returning false here will cause the job to fail, and output to be removed instead of published.

file_pattern()[source]

If one/some of the input target files are not in the pattern of part-, we can add the key of the required target and the correct file pattern that should be appended in the command line here. If the input target key is not found in this dict, the file pattern will be assumed to be part- for that target.

:return A dictionary of overridden file pattern that is not part-* for the inputs

on_successful_output_validation()[source]

Callback that gets called after the Dataflow job has finished successfully if validate_output returns True.

cleanup_on_error(error)[source]

Callback that gets called after the Dataflow job has finished unsuccessfully, or validate_output returns False.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

static get_target_path(target)[source]

Given a luigi Target, determine a stringly typed path to pass as a Dataflow job argument.