Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lzo.index.tmp files not deleted #87

Open
gszjulcsi opened this issue Jan 29, 2014 · 4 comments
Open

lzo.index.tmp files not deleted #87

gszjulcsi opened this issue Jan 29, 2014 · 4 comments

Comments

@gszjulcsi
Copy link

We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

@dvryaboy
Copy link
Contributor

We have not seen this in our self-hosted environment. Might be due something EC2 specific. Do you have any theories about the root cause?
gszjulcsi notifications@github.com wrote:We use distributed lzo indexer on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not deleted and cause problem when processing them with pig. No exception or error is thrown during the indexing and job is reported to run successfully.

—Reply to this email directly or view it on GitHub.

@gszjulcsi
Copy link
Author

Meanwhile we have noticed that these index.tmp files disappeared. We
suspect that was an s3 eventual consistency issue, namely it took s3 too
long (cc. 7 hours) to maintain consistency.

2014-01-29 dvryaboy notifications@github.com

We have not seen this in our self-hosted environment. Might be due
something EC2 specific. Do you have any theories about the root cause?
gszjulcsi notifications@github.com wrote:We use distributed lzo indexer
on EMR (hadoop version: 1.0.3), files stored on Amazon s3.

Sometimes (observed twice by now) we had the following issue:

all lzo.index is generated, but some of the lzo.index.tmp files are not
deleted and cause problem when processing them with pig. No exception or
error is thrown during the indexing and job is reported to run
successfully.

--Reply to this email directly or view it on GitHub.

Reply to this email directly or view it on GitHubhttps://github.com//issues/87#issuecomment-33571495
.

@dvryaboy
Copy link
Contributor

I see. Well perhaps it would make sense to add a filter to the lzo input formats so they ignore these temp files and you don't get an error. Feel free to send a pull request with such a change, we will be happy to take a look.

@rangadi
Copy link
Contributor

rangadi commented Jan 29, 2014

excluding .tmp files is a good fix.

There are other subtle issues with S3 because of these delays e.g. https://github.com/kevinweil/elephant-bird/issues/309

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants