luigi.contrib.scalding module

luigi.contrib.scalding.logger = <Logger luigi-interface (WARNING)>

Scalding support for Luigi.

Example configuration section in luigi.cfg:

[scalding]
# scala home directory, which should include a lib subdir with scala jars.
scala-home: /usr/share/scala

# scalding home directory, which should include a lib subdir with
# scalding-*-assembly-* jars as built from the official Twitter build script.
scalding-home: /usr/share/scalding

# provided dependencies, e.g. jars required for compiling but not executing
# scalding jobs. Currently required jars:
# org.apache.hadoop/hadoop-core/0.20.2
# org.slf4j/slf4j-log4j12/1.6.6
# log4j/log4j/1.2.15
# commons-httpclient/commons-httpclient/3.1
# commons-cli/commons-cli/1.2
# org.apache.zookeeper/zookeeper/3.3.4
scalding-provided: /usr/share/scalding/provided

# additional jars required.
scalding-libjars: /usr/share/scalding/libjars
class luigi.contrib.scalding.ScaldingJobRunner[source]

Bases: JobRunner

JobRunner for pyscald commands. Used to run a ScaldingJobTask.

get_scala_jars(include_compiler=False)[source]
get_scalding_jars()[source]
get_scalding_core()[source]
get_provided_jars()[source]
get_libjars()[source]
get_tmp_job_jar(source)[source]
get_build_dir(source)[source]
get_job_class(source)[source]
build_job_jar(job)[source]
run_job(job, tracking_url_callback=None)[source]
class luigi.contrib.scalding.ScaldingJobTask(*args, **kwargs)[source]

Bases: BaseHadoopJobTask

A job task for Scalding that define a scala source and (optional) main method.

requires() should return a dictionary where the keys are Scalding argument names and values are sub tasks or lists of subtasks.

For example:

{'input1': A, 'input2': C} => --input1 <Aoutput> --input2 <Coutput>
{'input1': [A, B], 'input2': [C]} => --input1 <Aoutput> <Boutput> --input2 <Coutput>
relpath(current_file, rel_path)[source]

Compute path given current file and relative path.

source()[source]

Path to the scala source for this Scalding Job

Either one of source() or jar() must be specified.

jar()[source]

Path to the jar file for this Scalding Job

Either one of source() or jar() must be specified.

extra_jars()[source]

Extra jars for building and running this Scalding Job.

job_class()[source]

optional main job class for this Scalding Job.

job_runner()[source]
atomic_output()[source]

If True, then rewrite output arguments to be temp locations and atomically move them into place after the job finishes.

requires()[source]

The Tasks that this Task depends on.

A Task will only run if all of the Tasks that it requires are completed. If your Task does not require any other Tasks, then you don’t need to override this method. Otherwise, a subclass can override this method to return a single Task, a list of Task instances, or a dict whose values are Task instances.

See Task.requires

job_args()[source]

Extra arguments to pass to the Scalding job.

args()[source]

Returns an array of args to pass to the job.