luigi bindings for Google Cloud Storage
GCSClient(oauth_credentials=None, descriptor='', http_=None, chunksize=10485760, **discovery_build_kwargs)¶
An implementation of a FileSystem over Google Cloud Storage.
There are several ways to use this class. By default it will use the app default credentials, as described at https://developers.google.com/identity/protocols/application-default-credentials . Alternatively, you may pass an google-auth credentials object. e.g. to use a service account:
credentials = google.auth.jwt.Credentials.from_service_account_info( '012345678912-ThisIsARandomServiceAccountEmail@developer.gserviceaccount.com', 'These are the contents of the p12 file that came with the service account', scope='https://www.googleapis.com/auth/devstorage.read_write') client = GCSClient(oauth_credentials=credentails) The chunksize parameter specifies how much data to transfer when downloading or uploading files.
By default this class will use “automated service discovery” which will require a connection to the web. The google api client downloads a JSON file to “create” the library interface on the fly. If you want a more hermetic build, you can pass the contents of this file (currently found at https://www.googleapis.com/discovery/v1/apis/storage/v1/rest ) as the
Trueif file or directory at
Parameters: path (str) – a path within the FileSystem to check for existence.
Trueif the location at
pathis a directory. If not, return
Parameters: path (str) – a path within the FileSystem to check as a directory.
Note: This method is optional, not all FileSystem subclasses implements it.
Remove file or directory at location
- path (str) – a path within the FileSystem to remove.
- recursive (bool) – if the path is a directory, recursively remove the directory and all
of its descendants. Defaults to
put(filename, dest_path, mimetype=None, chunksize=None)¶
put_multiple(filepaths, remote_directory, mimetype=None, chunksize=None, num_process=1)¶
put_string(contents, dest_path, mimetype=None)¶
mkdir(path, parents=True, raise_if_exists=False)¶
Create directory at location
Creates the directory at
pathand implicitly create parent directories if they do not already exist.
- path (str) – a path within the FileSystem to create as a directory.
- parents (bool) – Create parent directories when necessary. When parents=False and the parent directory doesn’t exist, raise luigi.target.MissingParentDirectory
- raise_if_exists (bool) – raise luigi.target.FileAlreadyExists if the folder already exists.
Copy a file or a directory with contents. Currently, LocalFileSystem and MockFileSystem support only single file copying but S3Client copies either a file or a directory as required.
Rename/move an object from one GCS location to another.
Get an iterable with GCS folder contents. Iterable contains paths relative to queried path.
Yields full object URIs matching the given wildcard.
Currently only the ‘*’ wildcard after the last path delimiter is supported.
(If we need “full” wildcard functionality we should bring in gsutil dependency with its https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/wildcard_iterator.py…)
download(path, chunksize=None, chunk_callback=<function <lambda>>)¶
Downloads the object contents to local file system.
Optionally stops after the first chunk for which chunk_callback returns True.
A GCS file that writes to a temp file and put to GCS on close.
GCSTarget(path, format=None, client=None)¶
Open the FileSystem target.
This method returns a file-like object which can either be read from or written to depending on the specified mode.
Parameters: mode (str) – the mode r opens the FileSystemTarget in read-only mode, whereas w will open the FileSystemTarget in write mode. Subclasses can implement additional options. Using b is not supported; initialize with format=Nop instead.
GCSFlagTarget(path, format=None, client=None, flag='_SUCCESS')¶
Defines a target directory with a flag-file (defaults to _SUCCESS) used to signify job success.
This checks for two things:
- the path exists (just like the GCSTarget)
- the _SUCCESS file exists within the directory.
Because Hadoop outputs into a directory and not a single file, the path is assumed to be a directory.
This is meant to be a handy alternative to AtomicGCSFile.
The AtomicFile approach can be burdensome for GCS since there are no directories, per se.
If we have 1,000,000 output files, then we have to rename 1,000,000 objects.
Initializes a GCSFlagTarget.
- path (str) – the directory where the files are stored.
- client –
- flag (str) –