luigi.contrib.bigquery module

luigi.contrib.bigquery.is_error_5xx(err)[source]
class luigi.contrib.bigquery.CreateDisposition[source]

Bases: object

CREATE_IF_NEEDED = 'CREATE_IF_NEEDED'
CREATE_NEVER = 'CREATE_NEVER'
class luigi.contrib.bigquery.WriteDisposition[source]

Bases: object

WRITE_TRUNCATE = 'WRITE_TRUNCATE'
WRITE_APPEND = 'WRITE_APPEND'
WRITE_EMPTY = 'WRITE_EMPTY'
class luigi.contrib.bigquery.QueryMode[source]

Bases: object

INTERACTIVE = 'INTERACTIVE'
BATCH = 'BATCH'
class luigi.contrib.bigquery.SourceFormat[source]

Bases: object

AVRO = 'AVRO'
CSV = 'CSV'
DATASTORE_BACKUP = 'DATASTORE_BACKUP'
NEWLINE_DELIMITED_JSON = 'NEWLINE_DELIMITED_JSON'
PARQUET = 'PARQUET'
class luigi.contrib.bigquery.FieldDelimiter[source]

Bases: object

The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and then uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence “ “ to specify a tab separator. The default value is a comma (‘,’).

https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load

COMMA = ','
TAB = '\t'
PIPE = '|'
class luigi.contrib.bigquery.PrintHeader[source]

Bases: object

TRUE = True
FALSE = False
class luigi.contrib.bigquery.DestinationFormat[source]

Bases: object

AVRO = 'AVRO'
CSV = 'CSV'
NEWLINE_DELIMITED_JSON = 'NEWLINE_DELIMITED_JSON'
class luigi.contrib.bigquery.Compression[source]

Bases: object

GZIP = 'GZIP'
NONE = 'NONE'
class luigi.contrib.bigquery.Encoding[source]

Bases: object

[Optional] The character encoding of the data. The supported values are UTF-8 or ISO-8859-1. The default value is UTF-8.

BigQuery decodes the data after the raw, binary data has been split using the values of the quote and fieldDelimiter properties.

UTF_8 = 'UTF-8'
ISO_8859_1 = 'ISO-8859-1'
class luigi.contrib.bigquery.BQDataset(project_id, dataset_id, location)

Bases: tuple

Create new instance of BQDataset(project_id, dataset_id, location)

dataset_id

Alias for field number 1

location

Alias for field number 2

project_id

Alias for field number 0

class luigi.contrib.bigquery.BQTable(project_id, dataset_id, table_id, location)[source]

Bases: BQTable

Create new instance of BQTable(project_id, dataset_id, table_id, location)

property dataset
property uri
class luigi.contrib.bigquery.BigQueryClient(oauth_credentials=None, descriptor='', http_=None)[source]

Bases: object

A client for Google BigQuery.

For details of how authentication and the descriptor work, see the documentation for the GCS client. The descriptor URL for BigQuery is https://www.googleapis.com/discovery/v1/apis/bigquery/v2/rest

dataset_exists(dataset)[source]

Returns whether the given dataset exists. If regional location is specified for the dataset, that is also checked to be compatible with the remote dataset, otherwise an exception is thrown.

param dataset:

type dataset:

BQDataset

table_exists(table)[source]

Returns whether the given table exists.

Parameters:

table (BQTable) –

make_dataset(dataset, raise_if_exists=False, body=None)[source]

Creates a new dataset with the default permissions.

Parameters:
  • dataset (BQDataset) –

  • raise_if_exists – whether to raise an exception if the dataset already exists.

Raises:

luigi.target.FileAlreadyExists – if raise_if_exists=True and the dataset exists

delete_dataset(dataset, delete_nonempty=True)[source]

Deletes a dataset (and optionally any tables in it), if it exists.

Parameters:
  • dataset (BQDataset) –

  • delete_nonempty – if true, will delete any tables before deleting the dataset

delete_table(table)[source]

Deletes a table, if it exists.

Parameters:

table (BQTable) –

list_datasets(project_id)[source]

Returns the list of datasets in a given project.

Parameters:

project_id (str) –

list_tables(dataset)[source]

Returns the list of tables in a given dataset.

Parameters:

dataset (BQDataset) –

get_view(table)[source]

Returns the SQL query for a view, or None if it doesn’t exist or is not a view.

Parameters:

table (BQTable) – The table containing the view.

update_view(table, view)[source]

Updates the SQL query for a view.

If the output table exists, it is replaced with the supplied view query. Otherwise a new table is created with this view.

Parameters:
  • table (BQTable) – The table to contain the view.

  • view (str) – The SQL query for the view.

run_job(project_id, body, dataset=None)[source]

Runs a BigQuery “job”. See the documentation for the format of body.

Note

You probably don’t need to use this directly. Use the tasks defined below.

Parameters:

dataset (BQDataset) –

Returns:

the job id of the job.

Return type:

str

Raises:

luigi.contrib.BigQueryExecutionError – if the job fails.

copy(source_table, dest_table, create_disposition='CREATE_IF_NEEDED', write_disposition='WRITE_TRUNCATE')[source]

Copies (or appends) a table to another table.

Parameters:
class luigi.contrib.bigquery.BigQueryTarget(project_id, dataset_id, table_id, client=None, location=None)[source]

Bases: Target

classmethod from_bqtable(table, client=None)[source]

A constructor that takes a BQTable.

Parameters:

table (BQTable) –

exists()[source]

Returns True if the Target exists and False otherwise.

class luigi.contrib.bigquery.MixinBigQueryBulkComplete[source]

Bases: object

Allows to efficiently check if a range of BigQueryTargets are complete. This enables scheduling tasks with luigi range tools.

If you implement a custom Luigi task with a BigQueryTarget output, make sure to also inherit from this mixin to enable range support.

classmethod bulk_complete(parameter_tuples)[source]
class luigi.contrib.bigquery.BigQueryLoadTask(*args, **kwargs)[source]

Bases: MixinBigQueryBulkComplete, Task

Load data into BigQuery from GCS.

property source_format

The source format to use (see SourceFormat).

property encoding

The encoding of the data that is going to be loaded (see Encoding).

property write_disposition

What to do if the table already exists. By default this will fail the job.

See WriteDisposition

property schema

Schema in the format defined at https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schema.

If the value is falsy, it is omitted and inferred by BigQuery.

property max_bad_records

The maximum number of bad records that BigQuery can ignore when reading data.

If the number of bad records exceeds this value, an invalid error is returned in the job result.

property field_delimiter

The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character.

source_uris()[source]

The fully-qualified URIs that point to your data in Google Cloud Storage.

Each URI can contain one ‘*’ wildcard character and it must come after the ‘bucket’ name.

property skip_leading_rows

The number of rows at the top of a CSV file that BigQuery will skip when loading the data.

The default value is 0. This property is useful if you have header rows in the file that should be skipped.

property allow_jagged_rows

Accept rows that are missing trailing optional columns. The missing values are treated as nulls.

If false, records with missing trailing columns are treated as bad records, and if there are too many bad records,

an invalid error is returned in the job result. The default value is false. Only applicable to CSV, ignored for other formats.

property ignore_unknown_values

Indicates if BigQuery should allow extra values that are not represented in the table schema.

If true, the extra values are ignored. If false, records with extra columns are treated as bad records,

and if there are too many bad records, an invalid error is returned in the job result. The default value is false.

The sourceFormat property determines what BigQuery treats as an extra value:

CSV: Trailing columns JSON: Named values that don’t match any column names

property allow_quoted_new_lines

Indicates if BigQuery should allow quoted data sections that contain newline characters in a CSV file. The default value is false.

configure_job(configuration)[source]

Set additional job configuration.

This allows to specify job configuration parameters that are not exposed via Task properties.

Parameters:

configuration – Current configuration.

Returns:

New or updated configuration.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.bigquery.BigQueryRunQueryTask(*args, **kwargs)[source]

Bases: MixinBigQueryBulkComplete, Task

property write_disposition

What to do if the table already exists. By default this will fail the job.

See WriteDisposition

property create_disposition

Whether to create the table or not. See CreateDisposition

property flatten_results

Flattens all nested and repeated fields in the query results. allowLargeResults must be true if this is set to False.

property query

The query, in text form.

property query_mode

The query mode. See QueryMode.

property udf_resource_uris

Iterator of code resource to load from a Google Cloud Storage URI (gs://bucket/path).

property use_legacy_sql

Whether to use legacy SQL

configure_job(configuration)[source]

Set additional job configuration.

This allows to specify job configuration parameters that are not exposed via Task properties.

Parameters:

configuration – Current configuration.

Returns:

New or updated configuration.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.bigquery.BigQueryCreateViewTask(*args, **kwargs)[source]

Bases: Task

Creates (or updates) a view in BigQuery.

The output of this task needs to be a BigQueryTarget. Instances of this class should specify the view SQL in the view property.

If a view already exist in BigQuery at output(), it will be updated.

property view

The SQL query for the view, in text form.

complete()[source]

If the task has any outputs, return True if all outputs exist. Otherwise, return False.

However, you may freely override this method with custom logic.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

class luigi.contrib.bigquery.ExternalBigQueryTask(*args, **kwargs)[source]

Bases: MixinBigQueryBulkComplete, ExternalTask

An external task for a BigQuery target.

class luigi.contrib.bigquery.BigQueryExtractTask(*args, **kwargs)[source]

Bases: Task

Extracts (unloads) a table from BigQuery to GCS.

This tasks requires the input to be exactly one BigQueryTarget while the output should be one or more GCSTargets from luigi.contrib.gcs depending on the use of destinationUris property.

property destination_uris

The fully-qualified URIs that point to your data in Google Cloud Storage. Each URI can contain one ‘*’ wildcard character and it must come after the ‘bucket’ name.

Wildcarded destinationUris in GCSQueryTarget might not be resolved correctly and result in incomplete data. If a GCSQueryTarget is used to pass wildcarded destinationUris be sure to overwrite this property to suppress the warning.

property print_header

Whether to print the header or not.

property field_delimiter

The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character.

property destination_format

The destination format to use (see DestinationFormat).

property compression

Whether to use compression.

configure_job(configuration)[source]

Set additional job configuration.

This allows to specify job configuration parameters that are not exposed via Task properties.

Parameters:

configuration – Current configuration.

Returns:

New or updated configuration.

run()[source]

The task run method, to be overridden in a subclass.

See Task.run

luigi.contrib.bigquery.BigqueryClient

alias of BigQueryClient

luigi.contrib.bigquery.BigqueryTarget

alias of BigQueryTarget

luigi.contrib.bigquery.MixinBigqueryBulkComplete

alias of MixinBigQueryBulkComplete

luigi.contrib.bigquery.BigqueryLoadTask

alias of BigQueryLoadTask

luigi.contrib.bigquery.BigqueryRunQueryTask

alias of BigQueryRunQueryTask

luigi.contrib.bigquery.BigqueryCreateViewTask

alias of BigQueryCreateViewTask

luigi.contrib.bigquery.ExternalBigqueryTask

alias of ExternalBigQueryTask

exception luigi.contrib.bigquery.BigQueryExecutionError(job_id, error_message)[source]

Bases: Exception

Parameters:
  • job_id (str) – BigQuery Job ID

  • error_message (str) – status[‘status’][‘errorResult’] for the failed job