Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsq_to_file: malformed gzip data #677

Closed
wants to merge 1 commit into from

Conversation

blakesmith
Copy link

Hey there!

We use a long running nsq_to_file process to backup our topics, for replay and offsite storage. We also use the -gzip flag to keep our log files compressed to save on disk space. Occasionally, one of the log files contains malformed gzip data at the end of the log. This tends to occur for our higher volume topics (100-200 messages / second), but not always. Using a sequence numbering scheme for all our messages, we tend to see 5-20 messages missing between the end of the corrupted log file, and the start of the next one.

If you try to unzip one of the log files, you get these errors:

blake@message01:~$ zcat /busmsgs/disaster_recovery/tw_event.message01-000000.2015-10-27_05.log.gz > /dev/null

gzip: /busmsgs/disaster_recovery/tw_event.message01-000000.2015-10-27_05.log.gz: unexpected end of file
blake@message01:~$

We see similar gzip errors in our java code that gunzips. The malformed log file is always cut off midway through a message.

I'm not sure if this PR will fix the change (I'd like to get feedback before I put it into production). Basically it seems that the sequence of file rotation doesn't close / sync in the correct order. In my mind, the correct order to close the file would be:

  1. Close gzip writer
  2. Sync file
  3. Close file

This PR changes the order, and adds some more logging. It seems possible that we could lose data if not closed in this order, but I might be missing some important detail here.

What do you think?

Thanks for nsq! We love it so far.

Blake

@jehiah
Copy link
Member

jehiah commented Oct 30, 2015

@blakesmith The way nsq_to_file structures it's gzip files it's the same as if you just concatted a bunch of small gzip files together. There are some compression tools that don't support that type of gzip file (I've had some issues getting this format to work on hadoop in the past for example).

The specific error you listed unexpected end of file is an expected thing to see if you try to zcat a file that is still in use by nsq_to_file. This is because it's started the framing for the last gzip chunk, but only closes it when it does a sync() thus leaving some data in a state that it can't be decompressed yet.

It would be helpful if you can clarify if you are seeing that file while nsq_to_file has the file open, or after it's been rotated by nsq_to_file.

All that said, this change looks good. While the Close() should flush data to the os, it doesn't mean it's Sync()'d so this is a good change.

@blakesmith
Copy link
Author

Hey @jehiah,

Thanks for the quick response. In our case, the corrupt files are older logs that have long since been closed by nsq_to_file - some from many months ago. Less than 1% of our log files exhibit this problem. In fact, most logs can be read entirely to completion. The logs that do have gzip problems fail not just with zcat, but also gunzip, and with java's GZIPInputStream:

java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(InflaterInputStream.java:240) ~na:1.8.0_45
at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:158) ~na:1.8.0_45
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:117) ~na:1.8.0_45
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~na:1.8.0_45

I'll do some further testing with my patch locally, and then against our production workload and let you know.

@jehiah
Copy link
Member

jehiah commented Oct 30, 2015

@blakesmith It occurred to me that the other potential cause of the unexpected end of file error is hard restarts (either kill -9 of the process, or of the host machine). It might be tougher for you to rule those out, but can you advise if either of those might be in play here?

In the hard restart case, while the file will have a corrupt unfinished gzip block, nsq_to_file will not have acknowledged the messages in that block, so NSQ will hand them out again. (ie: no data loss).

@blakesmith
Copy link
Author

@jehiah For sure. Before I started digging into the nsq_to_file source, I tried to rule nsq_to_file process restarts out. From our nsq_to_file process log data, the file corruptions don't seem correlated to process restarts. Also, because we employ a message sequencing system, we observe that the missing message sequences do not appear in the log file following the corrupt one.

I'm going to try running the patched nsq_to_file on a separate machine over the weekend and try to get some better data. If you have any other ideas or places to look, let me know.

Thanks for the help!

@mreiferson mreiferson added the bug label Oct 30, 2015
@mreiferson mreiferson changed the title nsq_to_file: Ensure we Sync() after we Close() the gzip writer. nsq_to_file: Ensure we Sync() after we Close() the gzip writer Oct 30, 2015
@blakesmith
Copy link
Author

Just wanted to give an update on this. After A/B testing this patch over the weekend, with the control group having the old 0.3.5 code, and the experiment running the patched 0.3.6 code, I'm seeing identical gzip log failures on each server / node:

INFO 11-02 21:23:34 [main] c.s.b.s.LogSanityCheck: Total logs processed: 2153, failed: 6

Next step for me is to try to get a deterministic test case that reproduces this bug. Will check back in once I do some more hunting.

@ploxiln
Copy link
Member

ploxiln commented Nov 3, 2015

If the server OS itself didn't crash, then it makes sense that sync() would not make a difference. But it is good that you verified :)

@mreiferson
Copy link
Member

Curiously awaiting more data, I'm sure this one will be a doozy 🔥

@mreiferson
Copy link
Member

@blakesmith any updates?

@blakesmith
Copy link
Author

@mreiferson My team and I have been unable to get a deterministic test case for this failure. We ended up switching to uncompressed logging, and then gzipping offline as a workaround, and our message logs are stable now. I might be able to spend some more time investigating soon.

@mreiferson
Copy link
Member

@blakesmith thanks for the update. Clearly something is going on, so I'm gonna leave this open for discussion.

@mreiferson mreiferson changed the title nsq_to_file: Ensure we Sync() after we Close() the gzip writer nsq_to_file: malformed gzip data Jan 19, 2016
@mreiferson
Copy link
Member

opened #716 to track this

@mreiferson mreiferson closed this Jan 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants