luigi.contrib.hive module

exception luigi.contrib.hive.HiveCommandError(message, out=None, err=None)[source]

Bases: exceptions.RuntimeError

luigi.contrib.hive.load_hive_cmd()[source]
luigi.contrib.hive.get_hive_syntax()[source]
luigi.contrib.hive.run_hive(args, check_return_code=True)[source]

Runs the hive from the command line, passing in the given args, and returning stdout.

With the apache release of Hive, so of the table existence checks (which are done using DESCRIBE do not exit with a return code of 0 so we need an option to ignore the return code and just return stdout for parsing

luigi.contrib.hive.run_hive_cmd(hivecmd, check_return_code=True)[source]

Runs the given hive query and returns stdout.

luigi.contrib.hive.run_hive_script(script)[source]

Runs the contents of the given script in hive and returns stdout.

class luigi.contrib.hive.HiveClient[source]

Bases: object

table_location(table, database='default', partition=None)[source]

Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.

table_schema(table, database='default')[source]

Returns list of [(name, type)] for each column in database.table.

table_exists(table, database='default', partition=None)[source]

Returns true if db.table (or db.table.partition) exists. partition is a dict of partition key to value.

partition_spec(partition)[source]

Turn a dict into a string partition specification

class luigi.contrib.hive.HiveCommandClient[source]

Bases: luigi.contrib.hive.HiveClient

Uses hive invocations to find information.

table_location(table, database='default', partition=None)[source]
table_exists(table, database='default', partition=None)[source]
table_schema(table, database='default')[source]
partition_spec(partition)[source]

Turns a dict into the a Hive partition specification string.

class luigi.contrib.hive.ApacheHiveCommandClient[source]

Bases: luigi.contrib.hive.HiveCommandClient

A subclass for the HiveCommandClient to (in some cases) ignore the return code from the hive command so that we can just parse the output.

table_schema(table, database='default')[source]
class luigi.contrib.hive.MetastoreClient[source]

Bases: luigi.contrib.hive.HiveClient

table_location(table, database='default', partition=None)[source]
table_exists(table, database='default', partition=None)[source]
table_schema(table, database='default')[source]
partition_spec(partition)[source]
class luigi.contrib.hive.HiveThriftContext[source]

Bases: object

Context manager for hive metastore client.

luigi.contrib.hive.get_default_client()[source]
class luigi.contrib.hive.HiveQueryTask(*args, **kwargs)[source]

Bases: luigi.contrib.hadoop.BaseHadoopJobTask

Task to run a hive query.

n_reduce_tasks = None
bytes_per_reducer = None
reducers_max = None
query()[source]

Text of query to run in hive

hiverc()[source]

Location of an rc file to run before the query if hiverc-location key is specified in luigi.cfg, will default to the value there otherwise returns None.

Returning a list of rc files will load all of them in order.

hiveconfs()[source]

Returns an dict of key=value settings to be passed along to the hive command line via –hiveconf. By default, sets mapred.job.name to task_id and if not None, sets:

  • mapred.reduce.tasks (n_reduce_tasks)
  • mapred.fairscheduler.pool (pool) or mapred.job.queue.name (pool)
  • hive.exec.reducers.bytes.per.reducer (bytes_per_reducer)
  • hive.exec.reducers.max (reducers_max)
job_runner()[source]
class luigi.contrib.hive.HiveQueryRunner[source]

Bases: luigi.contrib.hadoop.JobRunner

Runs a HiveQueryTask by shelling out to hive.

prepare_outputs(job)[source]

Called before job is started.

If output is a FileSystemTarget, create parent directories so the hive command won’t fail

run_job(job, tracking_url_callback=None)[source]
class luigi.contrib.hive.HiveTableTarget(table, database='default', client=None)[source]

Bases: luigi.target.Target

exists returns true if the table exists.

exists()[source]
path

Returns the path to this table in HDFS.

open(mode)[source]
class luigi.contrib.hive.HivePartitionTarget(table, partition, database='default', fail_missing_table=True, client=None)[source]

Bases: luigi.target.Target

exists returns true if the table’s partition exists.

exists()[source]
path

Returns the path for this HiveTablePartitionTarget’s data.

open(mode)[source]
class luigi.contrib.hive.ExternalHiveTask(*args, **kwargs)[source]

Bases: luigi.task.ExternalTask

External task that depends on a Hive table/partition.

database = Parameter (defaults to default)
table = Parameter
partition = Parameter (defaults to None): Python dictionary specifying the target partition e.g. {"date": "2013-01-25"}
output()[source]