luigi.contrib.hdfs.config

You can configure what client by setting the “client” config under the “hdfs” section in the configuration, or using the --hdfs-client command line option. “hadoopcli” is the slowest, but should work out of the box.

Functions

get_configured_hadoop_version()

CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs via the command line.

get_configured_hdfs_client()

This is a helper that fetches the configuration value for 'client' in the [hdfs] section.

load_hadoop_cmd()

tmppath([path, include_unix_username])

@param path: target path for which it is needed to generate temporary location @type path: str @type include_unix_username: bool @rtype: str

Classes

hadoopcli(*args, **kwargs)

hdfs(*args, **kwargs)

class luigi.contrib.hdfs.config.hdfs(*args, **kwargs)[source]
client_version

Parameter whose value is an int.

namenode_host

Class to parse optional parameters.

namenode_port

Parameter whose value is an int.

client

Parameter whose value is a str, and a base class for other parameter types.

Parameters are objects set on the Task class level to make it possible to parameterize tasks. For instance:

class MyTask(luigi.Task):
    foo = luigi.Parameter()

class RequiringTask(luigi.Task):
    def requires(self):
        return MyTask(foo="hello")

    def run(self):
        print(self.requires().foo)  # prints "hello"

This makes it possible to instantiate multiple tasks, eg MyTask(foo='bar') and MyTask(foo='baz'). The task will then have the foo attribute set appropriately.

When a task is instantiated, it will first use any argument as the value of the parameter, eg. if you instantiate a = TaskA(x=44) then a.x == 44. When the value is not provided, the value will be resolved in this order of falling priority:

  • Any value provided on the command line:

    • To the root task (eg. --param xyz)

    • Then to the class, using the qualified task name syntax (eg. --TaskA-param xyz).

  • With [TASK_NAME]>PARAM_NAME: <serialized value> syntax. See Parameters from config Ingestion

  • Any default value set using the default flag.

Parameter objects may be reused, but you must then set the positional=False flag.

tmp_dir

Class to parse optional parameters.

class luigi.contrib.hdfs.config.hadoopcli(*args, **kwargs)[source]
command

Parameter whose value is a str, and a base class for other parameter types.

Parameters are objects set on the Task class level to make it possible to parameterize tasks. For instance:

class MyTask(luigi.Task):
    foo = luigi.Parameter()

class RequiringTask(luigi.Task):
    def requires(self):
        return MyTask(foo="hello")

    def run(self):
        print(self.requires().foo)  # prints "hello"

This makes it possible to instantiate multiple tasks, eg MyTask(foo='bar') and MyTask(foo='baz'). The task will then have the foo attribute set appropriately.

When a task is instantiated, it will first use any argument as the value of the parameter, eg. if you instantiate a = TaskA(x=44) then a.x == 44. When the value is not provided, the value will be resolved in this order of falling priority:

  • Any value provided on the command line:

    • To the root task (eg. --param xyz)

    • Then to the class, using the qualified task name syntax (eg. --TaskA-param xyz).

  • With [TASK_NAME]>PARAM_NAME: <serialized value> syntax. See Parameters from config Ingestion

  • Any default value set using the default flag.

Parameter objects may be reused, but you must then set the positional=False flag.

version

Parameter whose value is a str, and a base class for other parameter types.

Parameters are objects set on the Task class level to make it possible to parameterize tasks. For instance:

class MyTask(luigi.Task):
    foo = luigi.Parameter()

class RequiringTask(luigi.Task):
    def requires(self):
        return MyTask(foo="hello")

    def run(self):
        print(self.requires().foo)  # prints "hello"

This makes it possible to instantiate multiple tasks, eg MyTask(foo='bar') and MyTask(foo='baz'). The task will then have the foo attribute set appropriately.

When a task is instantiated, it will first use any argument as the value of the parameter, eg. if you instantiate a = TaskA(x=44) then a.x == 44. When the value is not provided, the value will be resolved in this order of falling priority:

  • Any value provided on the command line:

    • To the root task (eg. --param xyz)

    • Then to the class, using the qualified task name syntax (eg. --TaskA-param xyz).

  • With [TASK_NAME]>PARAM_NAME: <serialized value> syntax. See Parameters from config Ingestion

  • Any default value set using the default flag.

Parameter objects may be reused, but you must then set the positional=False flag.

luigi.contrib.hdfs.config.load_hadoop_cmd()[source]
luigi.contrib.hdfs.config.get_configured_hadoop_version()[source]

CDH4 (hadoop 2+) has a slightly different syntax for interacting with hdfs via the command line.

The default version is CDH4, but one can override this setting with “cdh3” or “apache1” in the hadoop section of the config in order to use the old syntax.

luigi.contrib.hdfs.config.get_configured_hdfs_client()[source]

This is a helper that fetches the configuration value for ‘client’ in the [hdfs] section. It will return the client that retains backwards compatibility when ‘client’ isn’t configured.

luigi.contrib.hdfs.config.tmppath(path=None, include_unix_username=True)[source]

@param path: target path for which it is needed to generate temporary location @type path: str @type include_unix_username: bool @rtype: str

Note that include_unix_username might work on windows too.