luigi.contrib.spark module

class luigi.contrib.spark.SparkSubmitTask(*args, **kwargs)[source]

Bases: luigi.contrib.external_program.ExternalProgramTask

Template task for running a Spark job

Supports running jobs on Spark local, standalone, Mesos or Yarn

See http://spark.apache.org/docs/latest/submitting-applications.html for more information

name = None
entry_class = None
app = None
always_log_stderr = False
app_options()[source]

Subclass this method to map your task parameters to the app’s arguments

spark_submit
master
deploy_mode
jars
packages
py_files
files
conf
properties_file
driver_memory
driver_java_options
driver_library_path
driver_class_path
executor_memory
driver_cores
supervise
total_executor_cores
executor_cores
queue
num_executors
archives
hadoop_conf_dir
get_environment()[source]
program_environment()[source]
program_args()[source]
spark_command()[source]
app_command()[source]
class luigi.contrib.spark.PySparkTask(*args, **kwargs)[source]

Bases: luigi.contrib.spark.SparkSubmitTask

Template task for running an inline PySpark job

Simply implement the main method in your subclass

You can optionally define package names to be distributed to the cluster with py_packages (uses luigi’s global py-packages configuration by default)

app = '/home/docs/checkouts/readthedocs.org/user_builds/luigi/envs/latest/local/lib/python2.7/site-packages/luigi-2.6.0-py2.7.egg/luigi/contrib/pyspark_runner.py'
deploy_mode = 'client'
name
py_packages
setup(conf)[source]

Called by the pyspark_runner with a SparkConf instance that will be used to instantiate the SparkContext

Parameters:conf – SparkConf
setup_remote(sc)[source]
main(sc, *args)[source]

Called by the pyspark_runner with a SparkContext and any arguments returned by app_options()

Parameters:
  • sc – SparkContext
  • args – arguments list
program_args()[source]
app_command()[source]
run()[source]