luigi.contrib.hive module¶
- exception luigi.contrib.hive.HiveCommandError(message, out=None, err=None)[source]¶
Bases:
RuntimeError
- luigi.contrib.hive.run_hive(args, check_return_code=True)[source]¶
Runs the hive from the command line, passing in the given args, and returning stdout.
With the apache release of Hive, so of the table existence checks (which are done using DESCRIBE do not exit with a return code of 0 so we need an option to ignore the return code and just return stdout for parsing
- luigi.contrib.hive.run_hive_cmd(hivecmd, check_return_code=True)[source]¶
Runs the given hive query and returns stdout.
- luigi.contrib.hive.run_hive_script(script)[source]¶
Runs the contents of the given script in hive and returns stdout.
- class luigi.contrib.hive.HiveClient[source]¶
Bases:
object
- abstract table_location(table, database='default', partition=None)[source]¶
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
- abstract table_schema(table, database='default')[source]¶
Returns list of [(name, type)] for each column in database.table.
- class luigi.contrib.hive.HiveCommandClient[source]¶
Bases:
HiveClient
Uses hive invocations to find information.
- table_location(table, database='default', partition=None)[source]¶
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
- table_exists(table, database='default', partition=None)[source]¶
Returns true if db.table (or db.table.partition) exists. partition is a dict of partition key to value.
- class luigi.contrib.hive.ApacheHiveCommandClient[source]¶
Bases:
HiveCommandClient
A subclass for the HiveCommandClient to (in some cases) ignore the return code from the hive command so that we can just parse the output.
- class luigi.contrib.hive.MetastoreClient[source]¶
Bases:
HiveClient
- table_location(table, database='default', partition=None)[source]¶
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
- table_exists(table, database='default', partition=None)[source]¶
Returns true if db.table (or db.table.partition) exists. partition is a dict of partition key to value.
- class luigi.contrib.hive.HiveThriftContext[source]¶
Bases:
object
Context manager for hive metastore client.
- class luigi.contrib.hive.WarehouseHiveClient(hdfs_client=None, warehouse_location=None)[source]¶
Bases:
HiveClient
Client for managed tables that makes decision based on presence of directory in hdfs
- table_schema(table, database='default')[source]¶
Returns list of [(name, type)] for each column in database.table.
- table_location(table, database='default', partition=None)[source]¶
Returns location of db.table (or db.table.partition). partition is a dict of partition key to value.
- class luigi.contrib.hive.HiveQueryTask(*args, **kwargs)[source]¶
Bases:
BaseHadoopJobTask
Task to run a hive query.
- n_reduce_tasks = None¶
- bytes_per_reducer = None¶
- reducers_max = None¶
- hiverc()[source]¶
Location of an rc file to run before the query if hiverc-location key is specified in luigi.cfg, will default to the value there otherwise returns None.
Returning a list of rc files will load all of them in order.
- hivevars()[source]¶
Returns a dict of key=value settings to be passed along to the hive command line via –hivevar. This option can be used as a separated namespace for script local variables. See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+VariableSubstitution
- hiveconfs()[source]¶
Returns a dict of key=value settings to be passed along to the hive command line via –hiveconf. By default, sets mapred.job.name to task_id and if not None, sets:
mapred.reduce.tasks (n_reduce_tasks)
mapred.fairscheduler.pool (pool) or mapred.job.queue.name (pool)
hive.exec.reducers.bytes.per.reducer (bytes_per_reducer)
hive.exec.reducers.max (reducers_max)
- class luigi.contrib.hive.HiveQueryRunner[source]¶
Bases:
JobRunner
Runs a HiveQueryTask by shelling out to hive.
- class luigi.contrib.hive.HivePartitionTarget(table, partition, database='default', fail_missing_table=True, client=None)[source]¶
Bases:
Target
Target representing Hive table or Hive partition
@param table: Table name @type table: str @param partition: partition specificaton in form of dict of {“partition_column_1”: “partition_value_1”, “partition_column_2”: “partition_value_2”, … } If partition is None or {} then target is Hive nonpartitioned table @param database: Database name @param fail_missing_table: flag to ignore errors raised due to table nonexistence @param client: HiveCommandClient instance. Default if client is None
- property path¶
Returns the path for this HiveTablePartitionTarget’s data.
- class luigi.contrib.hive.HiveTableTarget(table, database='default', client=None)[source]¶
Bases:
HivePartitionTarget
Target representing non-partitioned table
@param table: Table name @type table: str @param partition: partition specificaton in form of dict of {“partition_column_1”: “partition_value_1”, “partition_column_2”: “partition_value_2”, … } If partition is None or {} then target is Hive nonpartitioned table @param database: Database name @param fail_missing_table: flag to ignore errors raised due to table nonexistence @param client: HiveCommandClient instance. Default if client is None
- class luigi.contrib.hive.ExternalHiveTask(*args, **kwargs)[source]¶
Bases:
ExternalTask
External task that depends on a Hive table/partition.
- database = Parameter (defaults to default)¶
- table = Parameter¶
- partition = DictParameter (defaults to {}): Python dictionary specifying the target partition e.g. {"date": "2013-01-25"}¶
- output()[source]¶
The output that this Task produces.
The output of the Task determines if the Task needs to be run–the task is considered finished iff the outputs all exist. Subclasses should override this method to return a single
Target
or a list ofTarget
instances.- Implementation note
If running multiple workers, the output must be a resource that is accessible by all workers, such as a DFS or database. Otherwise, workers might compute the same output since they don’t see the work done by other workers.
See Task.output