luigi.contrib.beam_dataflow module¶
-
class
luigi.contrib.beam_dataflow.
DataflowParamKeys
[source]¶ Bases:
object
Defines the naming conventions for Dataflow execution params. For example, the Java API expects param names in lower camel case, whereas the Python implementation expects snake case.
-
runner
¶
-
project
¶
-
zone
¶
-
region
¶
-
staging_location
¶
-
temp_location
¶
-
gcp_temp_location
¶
-
num_workers
¶
-
autoscaling_algorithm
¶
-
max_num_workers
¶
-
disk_size_gb
¶
-
worker_machine_type
¶
-
worker_disk_type
¶
-
job_name
¶
-
service_account
¶
-
network
¶
-
subnetwork
¶
-
labels
¶
-
-
class
luigi.contrib.beam_dataflow.
BeamDataflowJobTask
[source]¶ Bases:
luigi.task.MixinNaiveBulkComplete
,luigi.task.Task
Luigi wrapper for a Dataflow job. Must be overridden for each Beam SDK with that SDK’s dataflow_executable().
- For more documentation, see:
- https://cloud.google.com/dataflow/docs/guides/specifying-exec-params
The following required Dataflow properties must be set:
project # GCP project ID temp_location # Cloud storage path for temporary files
The following optional Dataflow properties can be set:
- runner # PipelineRunner implementation for your Beam job.
- Default: DirectRunner
- num_workers # The number of workers to start the task with
- Default: Determined by Dataflow service
- autoscaling_algorithm # The Autoscaling mode for the Dataflow job
- Default: THROUGHPUT_BASED
- max_num_workers # Used if the autoscaling is enabled
- Default: Determined by Dataflow service
- network # Network in GCE to be used for launching workers
- Default: a network named “default”
- subnetwork # Subnetwork in GCE to be used for launching workers
- Default: Determined by Dataflow service
- disk_size_gb # Remote worker disk size. Minimum value is 30GB
- Default: set to 0 to use GCP project default
- worker_machine_type # Machine type to create Dataflow worker VMs
- Default: Determined by Dataflow service
- job_name # Custom job name, must be unique across project’s
- active jobs
- worker_disk_type # Specify SSD for local disk or defaults to hard
- disk as a full URL of disk type resource Default: Determined by Dataflow service.
- service_account # Service account of Dataflow VMs/workers
- Default: active GCE service account
- region # Region to deploy Dataflow job to
- Default: us-central1
- zone # Availability zone for launching workers instances
- Default: an available zone in the specified region
- staging_location # Cloud Storage bucket for Dataflow to stage binary
- files Default: the value of temp_location
- gcp_temp_location # Cloud Storage path for Dataflow to stage temporary
- files Default: the value of temp_location
- labels # Custom GCP labels attached to the Dataflow job
- Default: nothing
-
project
= None¶
-
runner
= None¶
-
temp_location
= None¶
-
staging_location
= None¶
-
gcp_temp_location
= None¶
-
num_workers
= None¶
-
autoscaling_algorithm
= None¶
-
max_num_workers
= None¶
-
network
= None¶
-
subnetwork
= None¶
-
disk_size_gb
= None¶
-
worker_machine_type
= None¶
-
job_name
= None¶
-
worker_disk_type
= None¶
-
service_account
= None¶
-
zone
= None¶
-
region
= None¶
-
labels
= {}¶
-
cmd_line_runner
¶ alias of
_CmdLineRunner
-
dataflow_params
= None¶
-
dataflow_executable
()[source]¶ Command representing the Dataflow executable to be run. For example:
return [‘java’, ‘com.spotify.luigi.MyClass’, ‘-Xmx256m’]
-
args
()[source]¶ Extra String arguments that will be passed to your Dataflow job. For example:
return [‘–setup_file=setup.py’]
-
before_run
()[source]¶ Hook that gets called right before the Dataflow job is launched. Can be used to setup any temporary files/tables, validate input, etc.
-
on_successful_run
()[source]¶ Callback that gets called right after the Dataflow job has finished successfully but before validate_output is run.
-
validate_output
()[source]¶ Callback that can be used to validate your output before it is moved to its final location. Returning false here will cause the job to fail, and output to be removed instead of published.
-
file_pattern
()[source]¶ If one/some of the input target files are not in the pattern of part-, we can add the key of the required target and the correct file pattern that should be appended in the command line here. If the input target key is not found in this dict, the file pattern will be assumed to be part- for that target.
:return A dictionary of overridden file pattern that is not part-* for the inputs
-
on_successful_output_validation
()[source]¶ Callback that gets called after the Dataflow job has finished successfully if validate_output returns True.