luigi.contrib.spark module

class luigi.contrib.spark.SparkSubmitTask(*args, **kwargs)[source]

Bases: ExternalProgramTask

Template task for running a Spark job

Supports running jobs on Spark local, standalone, Mesos or Yarn

See http://spark.apache.org/docs/latest/submitting-applications.html for more information

name = None
entry_class = None
app = None
always_log_stderr = False
stream_for_searching_tracking_url = 'stderr'

Used for defining which stream should be tracked for URL, may be set to ‘stdout’, ‘stderr’ or ‘none’.

Default value is ‘none’, so URL tracking is not performed.

property tracking_url_pattern

Class to parse optional parameters.

app_options()[source]

Subclass this method to map your task parameters to the app’s arguments

property pyspark_python
property pyspark_driver_python
property hadoop_user_name
property spark_version
property spark_submit
property master
property deploy_mode
property jars
property packages
property py_files
property files
property conf
property properties_file
property driver_memory
property driver_java_options
property driver_library_path
property driver_class_path
property executor_memory
property driver_cores
property supervise
property total_executor_cores
property executor_cores
property queue
property num_executors
property archives
property hadoop_conf_dir
get_environment()[source]
program_environment()[source]

Override this method to control environment variables for the program

Returns:

dict mapping environment variable names to values

program_args()[source]

Override this method to map your task parameters to the program arguments

Returns:

list to pass as args to subprocess.Popen

spark_command()[source]
app_command()[source]
class luigi.contrib.spark.PySparkTask(*args, **kwargs)[source]

Bases: SparkSubmitTask

Template task for running an inline PySpark job

Simply implement the main method in your subclass

You can optionally define package names to be distributed to the cluster with py_packages (uses luigi’s global py-packages configuration by default)

app = '/home/docs/checkouts/readthedocs.org/user_builds/luigi/envs/stable/lib/python3.9/site-packages/luigi/contrib/pyspark_runner.py'
property name
property py_packages
property files
property pickle_protocol
setup(conf)[source]

Called by the pyspark_runner with a SparkConf instance that will be used to instantiate the SparkContext

Parameters:

conf – SparkConf

setup_remote(sc)[source]
main(sc, *args)[source]

Called by the pyspark_runner with a SparkContext and any arguments returned by app_options()

Parameters:
  • sc – SparkContext

  • args – arguments list

app_command()[source]
run()[source]

The task run method, to be overridden in a subclass.

See Task.run