added chicken param to remove methods, consistent apis #628

coderfi · 2015-01-22T05:39:49Z

I added the chicken parameter to the HdfsClient remove methods.
By default, this is True, which tells the remove method to raise an exception if the requested remove path is dangerous.
Such paths include the root and single level root paths, i.e. '/' or '/etc'.
Hopefully this prevents embarrassing 'hadoop fs -rmr /' attempts. :)
In addition, the HdfsClientCdh3 mkdir and remove method was api incompatible with the base class's methods, I went ahead and fixed those.

Tarrasch · 2015-01-22T09:31:24Z

luigi/hdfs.py

+        return True
+
+    if '//' in path:
+        raise ValueError('Funny path, double slash: %s' % path)


Why do you care about this at all?

You mean the '//' check? Somewhat a kludge to simplify the following code.
Also, since ultimately this path string might be passed to a hadoop fs -rmr process, I would definitely be concerned at the behavior of:
hadoop fs -rmr //

Not sure if the hadoop CLI would try to delete the root directory, or exit 1.

I don't think this code is neccessary but it can stay if you're worried about cases like //. The code is at least easy to digest here.

Tarrasch · 2015-01-22T09:32:05Z

You should amend your commits. Other than that this looks good I think.

Tarrasch · 2015-01-22T10:11:28Z

luigi/hdfs.py

+    False
+    """
+    if not path:
+        return True


What happend if we don't have this? I mean, can't we assume that the user passes in a string? Can something bad happen if we don't guard against this?

The method itself doesn't know that.

The original impetus of this whole check is that, is to help prevent developers writing Luigi tasks from shooting themselves in the foot and giving some sysadmin guy a very long night restoring from backups (if they exist!).

I've seen (and also caught myself), that through whatever obscure luigi command line mechanism / programmatic calls, the ultimate path passed in to the remove method can be basically anything -- and when such luigi task is invoked by a cron job or some other automatic process, the anything could be doubly anything. :)

The path could actually be None or a blank string, depending on the calling method and how that method got its parameters (through the luigi.Parameter via some shell command line argument, and whether said parameter had default=None, etc.).

In the case something bad happens... presumedly the caller (remove) method asks if path=None or path='' is dangerous.

If it is not dangerous, the invoked CLI process might look like:
1) hadoop fs -rmr None (if the caller did naive string concatenation)
or
2) hadoop fs -rmr

For #1, best case is we get a file not found error from hadoop cli. Worst case is we deleted someone's file called 'None' :(
For #2, best case is hadoop cli complains that a required command line argument is missing. Worst case is some hadoop distribution (or maybe someone created a hadoop wrapper), assumed something else and inadvertently deletes something.

Yea, your worst case for 2) is completly unlikely. You mean default would be to delete /? And your worst case for 1) is not that bad. Besides. case 1) will not happen. What will happen is that you get an exception thrown from path.strip().

I don't like this line as it's logic that gives no value, and I get confused reading it since I expect lines of code to be useful. :)

coderfi · 2015-01-23T03:42:56Z

Responded to the various comments from @Tarrasch , thanks for the feedback!
One thing I did not think to check is if the path is a full hdfs path, i.e. hdfs:/// or hdfs:// hdfs://master/
All three cases could potentially trigger a disastrous consequence.

I think this is worth looking into, but maybe add checks for that sort of thing in a separate commit?

Tarrasch · 2015-01-23T10:29:46Z

Yea, wait with the hdfs:/// stuff please. Meanwhile read through my comments and let me know what you think. :)

coderfi · 2015-01-29T07:35:35Z

Code updated per suggestions, hopefully it is more too your liking :)
I noticed conflicts since my branch was too far behind with your master, so I merged those in as well.

Tarrasch · 2015-01-29T11:04:17Z

The commits are messed up now. The "updates to is_dangerous_rm_path(): " commit is shown twice. Can you properly rebase your commits and force push? :)

coderfi · 2015-01-30T08:26:09Z

Ugh, you might have to help me on that.
I'm an old school svn guy, and it was miraculous for me to even get the
branches working -_-

How's the code aside from the source control mess?
On Jan 29, 2015 3:04 AM, "Arash Rouhani" notifications@github.com wrote:

The commits are messed up now. The "updates to is_dangerous_rm_path(): "
commit is shown twice. Can you properly rebase your commits and force push?
:)

—
Reply to this email directly or view it on GitHub
#628 (comment).

erikbern · 2015-01-30T13:38:03Z

try git rebase master

Then you can also try git rebase -i HEAD~9 to squash the last commits (it will let you cherry pick which ones)

I woke up this morning to find that very few jobs had run overnight because something weird happened with our hadoop cluster that caused all of our jobs to hang permanently. When I checked in the morning, all workers were busy running jobs that had been going for about 10 hours, and thousands of pending jobs that would normally have been run overnight had piled up. After killing and restarting all of my workers, everything started running fine again because whatever weird issue had caused everything to hang was over. In order to automate this fix, I've added a worker_timeout property to tasks that tells the worker how long to attempt to run the task before killing it. This property defaults to 0, meaning do not time tasks out. It can be changed individually for a task by override or globally for all non-overridden tasks via client.cfg. This only applies when running with more than 1 worker process, as we need to run the task in a separate process in order to be able to kill it after the timeout.

as opposed to expecting lists always also cleared up the docs a bit

I've noticed that my visualiser has become unusable recently when it has too many done tasks scheduled. The tab freezes up for a long time while processing all the tasks, and then is very sluggish afterward. When I have this many done tasks, it's too much to really browse through anyway, so this limits the information shown to just the count. This trades off available information with usability of the tool. The maximum number of shown tasks is controlled by the scheduler and can be adjusted in client.cfg. It may be better to make this part of the query so that it could potentially be controlled by the end user. We may also in the future want to add additional server queries to update larger sections for searches, as this removes useful search information.

Add sphinx docs about how to enable the db task history, and what functionality it provides (including screen shots). Adds a bit more context to the existing docs on configuring the scheduler.

Adds support for tasks with non-alphanumeric values (e.g. dates with dashes in them).

I've been seeing half-second get_work runtimes recently on my scheduler. The cause of this has been a large backfill that scheduled a lot of done tasks. With about 100k done tasks and about 350 pending tasks, a lot of time is wasted looping over done tasks just to ignore them. This essentially caches these computations by keeping a dictionary to group tasks by status and updating it whenever task statuses are changed. This has sped my scheduling up by an order of magnitude, now taking about 20ms per get_work. In order to ensure updates happen every time, the task state has to take over any functions that modify status. This is a necessary for remote task storage anyway, so it had to be done eventually. There are no new tests here because there are no functionality changes, only improvements in efficiency and different internal representations.

…ite client

* Use sphinx-build * User should not read tox.ini in order to understand how to use it, so use html as default output format. * Kept same output dir for html just in case.

$ tox -e docsapi $ git add doc/api

added docstring.

This class adds support for "taskless" parameters

Also made the default Config task include the section in the command line arguments

…e does not exist and cleanup pep8 warnings

…tement (see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)

This new style makes it possible to override on runtime (using eg. --hdfs-namenode-port) but also use configuration file like previously

Adapted from code submitted to Luigi user list by Felix Gao: * https://groups.google.com/forum/#!topic/luigi-user/BIqAChl8sdw

ignore autopep8 E309 since if class has docstring, it will be separated by blank line. changed sqlalchemy import statements. this is necessary to fix bug in sphinx-doc issue#1530: sphinx-doc/sphinx#1530 fixed luigi.interface.Interface.run sphinx complaints. improved luigi.interface.run docstring. fixed luigi.lock.getpcmd sphinx complaints. fixed luigi.contrib.mysqldb.MySqlTarget docstring complaints and made 1st character of parameter description lowercase. fixed luigi.contrib.redshift.RedshiftManifestTask sphinx complaints. fixed sphinx SEVERE: Title level inconsistent. fixed sphinx "ERROR: Unknown target name" errors. fixed sphinx "WARNING: Explicit markup ends without a blank line" errors. fixed sphinx "WARNING: Explicit markup ends without a blank line" errors. fixed sphinx "Duplicate explicit target name: 'spotify'" error. fixed broken URL to Yarn REST API. fixed URL to oozie. fixed URL for luigi-user Google group. fixed URL to spotify. changed URL to source code to make sphinx linkcheck pass. added new readme.rst to exclude_pattern to fix sphinx complaint that it's not part of a table of content. removed the html_static_path to fix sphinx build issue. sphinx warnings now make the build break. added docs-test tox environment to the travis build. problems with documentation will now make the build break. added more rules to .gitignore based on gitignore.io: Python, vim, OSX, Vagrant, PyCharm. fixed sphinx warning to include luigi.tools. merge tox docsapi with tox docs. tox docsapi environment creates new files that might break the documentation generation so moved the command into the tox docs environment. synchronize README.rst in github and readthedocs. copy of README.rst to doc directory. URLs pointing to github changed to point to local files. badges removed from the doc/README.rst. added comments to build doc commands. whitelist cp and sed commands. README.rst added to sphinx exclude_patterns. this will bypasses the warning sphinx gives due to the fact that README.rst is not part of a table of contents.

fixes 3 landscape errors on luigi.contrib.redshift.

During docs generation we copy README.rst to doc/ and clean it up a bit. Lets delete afterwards.

$ tox -e isort

such as mkdir and remove

coderfi · 2015-02-05T08:09:28Z

Bah, I think I really messed things up on my branch, will try one more time on a new pull.

coderfi mentioned this pull request Jan 22, 2015

check for null and remove safety checks #615

Closed

coderfi changed the title ~~added chicken param to remove methods, consistent remove api~~ added chicken param to remove methods, consistent apis Jan 22, 2015

Tarrasch reviewed Jan 22, 2015
View reviewed changes

daveFNbuck and others added 17 commits February 4, 2015 23:45

[scalding] allow single jobs in requires

9f76b31

as opposed to expecting lists always also cleared up the docs a bit

Remove deprecated apply() builtin.

fff5a48

Document Task History

c3b9daf

Add sphinx docs about how to enable the db task history, and what functionality it provides (including screen shots). Adds a bit more context to the existing docs on configuring the scheduler.

Make task_id parsing support additional ids.

6641136

Adds support for tasks with non-alphanumeric values (e.g. dates with dashes in them).

Initial implementation of the SQLAlchemy target

f70ea9b

Fixed the broken tests

af08487

Minor modifications

22fb2a2

Incorporated chunk size in the sqla implementation

aae9cb1

Added test for chunk size. Rearranged tests.

1d7b426

Fixed the test issues.

582d14e

Minor docs and copyright info

5762945

Add docs, incorporarte PR sugggestions, and improve test coverage

2b8db9e

Minor fixes and cleanups

7b729a5

Added support for specifying the effective user when using the Snakeb…

43595c3

…ite client

steenzout and others added 25 commits February 4, 2015 23:48

autopep8 changes.

87cbda1

Use sphinx-build as it is more Windows friendly.

3945c63

* Use sphinx-build * User should not read tox.ini in order to understand how to use it, so use html as default output format. * Kept same output dir for html just in case.

Add docsapi target to tox.

7fe72a6

Update doc/api

4be16e8

$ tox -e docsapi $ git add doc/api

fix for access to self.path is done before definition.

0a6cccf

added docstring.

New Config class

a80bac5

This class adds support for "taskless" parameters

Added deprecation warning for is_global

a4481da

Renamed is_config to is_without_section

bdc078b

Also made the default Config task include the section in the command line arguments

More tests

39414ce

Fix bug where Redshift began throwing a different exception when tabl…

2dc0273

…e does not exist and cleanup pep8 warnings

Add support for table truncation

109ecf8

Add support for passing Redshift table attributes to create table sta…

7bb45d6

…tement (see http://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html)

hdfs.py ported to use new-style parameters

a3ce173

This new style makes it possible to override on runtime (using eg. --hdfs-namenode-port) but also use configuration file like previously

refactored email tests to get test coverage a bit higher

372357d

Pig Task

4930397

Adapted from code submitted to Luigi user list by Felix Gao: * https://groups.google.com/forum/#!topic/luigi-user/BIqAChl8sdw

Add task to kill open redshift sessions

8fe967f

pep8: remove trailing whitespace

249568c

moving pig test to test.contrib

ea2dd52

usage of string format as logging function parameters.

147f271

fixes 3 landscape errors on luigi.contrib.redshift.

Use @ trick to avoid escaping every / in sed script.

6777b89

Remove the temporary README.rst

6ad0ec3

During docs generation we copy README.rst to doc/ and clean it up a bit. Lets delete afterwards.

Add isort target.

7c68e36

Sort all imports. Because we can.

e32aa51

$ tox -e isort

added chicken param to remove methods, consistent apis

644e642

such as mkdir and remove

coderfi closed this Feb 5, 2015

coderfi deleted the chicken branch February 5, 2015 08:13

coderfi restored the chicken branch February 5, 2015 08:13

coderfi mentioned this pull request Feb 5, 2015

added chicken param to remove methods, consistent apis #714

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added chicken param to remove methods, consistent apis #628

added chicken param to remove methods, consistent apis #628

coderfi commented Jan 22, 2015

Tarrasch Jan 22, 2015

coderfi Jan 23, 2015

Tarrasch Jan 23, 2015

Tarrasch commented Jan 22, 2015

Tarrasch Jan 22, 2015

coderfi Jan 23, 2015

Tarrasch Jan 23, 2015

coderfi commented Jan 23, 2015

Tarrasch commented Jan 23, 2015

coderfi commented Jan 29, 2015

Tarrasch commented Jan 29, 2015

coderfi commented Jan 30, 2015

erikbern commented Jan 30, 2015

coderfi commented Feb 5, 2015

added chicken param to remove methods, consistent apis #628

added chicken param to remove methods, consistent apis #628

Conversation

coderfi commented Jan 22, 2015

Tarrasch Jan 22, 2015

Choose a reason for hiding this comment

coderfi Jan 23, 2015

Choose a reason for hiding this comment

Tarrasch Jan 23, 2015

Choose a reason for hiding this comment

Tarrasch commented Jan 22, 2015

Tarrasch Jan 22, 2015

Choose a reason for hiding this comment

coderfi Jan 23, 2015

Choose a reason for hiding this comment

Tarrasch Jan 23, 2015

Choose a reason for hiding this comment

coderfi commented Jan 23, 2015

Tarrasch commented Jan 23, 2015

coderfi commented Jan 29, 2015

Tarrasch commented Jan 29, 2015

coderfi commented Jan 30, 2015

erikbern commented Jan 30, 2015

coderfi commented Feb 5, 2015