luigi.contrib.gcs module

luigi bindings for Google Cloud Storage

luigi.contrib.gcs.is_error_5xx(err)[source]
exception luigi.contrib.gcs.InvalidDeleteException[source]

Bases: FileSystemException

class luigi.contrib.gcs.GCSClient(oauth_credentials=None, descriptor='', http_=None, chunksize=10485760, **discovery_build_kwargs)[source]

Bases: FileSystem

An implementation of a FileSystem over Google Cloud Storage.

There are several ways to use this class. By default it will use the app default credentials, as described at https://developers.google.com/identity/protocols/application-default-credentials . Alternatively, you may pass an google-auth credentials object. e.g. to use a service account:

 credentials = google.auth.jwt.Credentials.from_service_account_info(
     '012345678912-ThisIsARandomServiceAccountEmail@developer.gserviceaccount.com',
     'These are the contents of the p12 file that came with the service account',
     scope='https://www.googleapis.com/auth/devstorage.read_write')
 client = GCSClient(oauth_credentials=credentails)

The chunksize parameter specifies how much data to transfer when downloading
or uploading files.

Warning

By default this class will use “automated service discovery” which will require a connection to the web. The google api client downloads a JSON file to “create” the library interface on the fly. If you want a more hermetic build, you can pass the contents of this file (currently found at https://www.googleapis.com/discovery/v1/apis/storage/v1/rest ) as the descriptor argument.

exists(path)[source]

Return True if file or directory at path exist, False otherwise

Parameters:

path (str) – a path within the FileSystem to check for existence.

isdir(path)[source]

Return True if the location at path is a directory. If not, return False.

Parameters:

path (str) – a path within the FileSystem to check as a directory.

Note: This method is optional, not all FileSystem subclasses implements it.

remove(path, recursive=True)[source]

Remove file or directory at location path

Parameters:
  • path (str) – a path within the FileSystem to remove.

  • recursive (bool) – if the path is a directory, recursively remove the directory and all of its descendants. Defaults to True.

put(filename, dest_path, mimetype=None, chunksize=None)[source]
put_multiple(filepaths, remote_directory, mimetype=None, chunksize=None, num_process=1)[source]
put_string(contents, dest_path, mimetype=None)[source]
mkdir(path, parents=True, raise_if_exists=False)[source]

Create directory at location path

Creates the directory at path and implicitly create parent directories if they do not already exist.

Parameters:
  • path (str) – a path within the FileSystem to create as a directory.

  • parents (bool) – Create parent directories when necessary. When parents=False and the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory

  • raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already exists.

copy(source_path, destination_path)[source]

Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only single file copying but S3Client copies either a file or a directory as required.

rename(*args, **kwargs)[source]

Alias for move()

move(source_path, destination_path)[source]

Rename/move an object from one GCS location to another.

listdir(path)[source]

Get an iterable with GCS folder contents. Iterable contains paths relative to queried path.

list_wildcard(wildcard_path)[source]

Yields full object URIs matching the given wildcard.

Currently only the ‘*’ wildcard after the last path delimiter is supported.

(If we need “full” wildcard functionality we should bring in gsutil dependency with its https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/wildcard_iterator.py…)

download(path, chunksize=None, chunk_callback=<function GCSClient.<lambda>>)[source]

Downloads the object contents to local file system.

Optionally stops after the first chunk for which chunk_callback returns True.

class luigi.contrib.gcs.AtomicGCSFile(path, gcs_client)[source]

Bases: AtomicLocalFile

A GCS file that writes to a temp file and put to GCS on close.

move_to_final_destination()[source]
class luigi.contrib.gcs.GCSTarget(path, format=None, client=None)[source]

Bases: FileSystemTarget

Initializes a FileSystemTarget instance.

Parameters:

path – the path associated with this FileSystemTarget.

fs = None
open(mode='r')[source]

Open the FileSystem target.

This method returns a file-like object which can either be read from or written to depending on the specified mode.

Parameters:

mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas w will open the FileSystemTarget in write mode. Subclasses can implement additional options. Using b is not supported; initialize with format=Nop instead.

class luigi.contrib.gcs.GCSFlagTarget(path, format=None, client=None, flag='_SUCCESS')[source]

Bases: GCSTarget

Defines a target directory with a flag-file (defaults to _SUCCESS) used to signify job success.

This checks for two things:

  • the path exists (just like the GCSTarget)

  • the _SUCCESS file exists within the directory.

Because Hadoop outputs into a directory and not a single file, the path is assumed to be a directory.

This is meant to be a handy alternative to AtomicGCSFile.

The AtomicFile approach can be burdensome for GCS since there are no directories, per se.

If we have 1,000,000 output files, then we have to rename 1,000,000 objects.

Initializes a GCSFlagTarget.

Parameters:
  • path (str) – the directory where the files are stored.

  • client

  • flag (str) –

fs = None
exists()[source]

Returns True if the path for this FileSystemTarget exists; False otherwise.

This method is implemented by using fs.