-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsq_to_file: malformed gzip data #677
Conversation
@blakesmith The way The specific error you listed It would be helpful if you can clarify if you are seeing that file while All that said, this change looks good. While the Close() should flush data to the os, it doesn't mean it's |
Hey @jehiah, Thanks for the quick response. In our case, the corrupt files are older logs that have long since been closed by
I'll do some further testing with my patch locally, and then against our production workload and let you know. |
@blakesmith It occurred to me that the other potential cause of the In the hard restart case, while the file will have a corrupt unfinished gzip block, nsq_to_file will not have acknowledged the messages in that block, so NSQ will hand them out again. (ie: no data loss). |
@jehiah For sure. Before I started digging into the I'm going to try running the patched Thanks for the help! |
Just wanted to give an update on this. After A/B testing this patch over the weekend, with the control group having the old 0.3.5 code, and the experiment running the patched 0.3.6 code, I'm seeing identical gzip log failures on each server / node:
Next step for me is to try to get a deterministic test case that reproduces this bug. Will check back in once I do some more hunting. |
If the server OS itself didn't crash, then it makes sense that sync() would not make a difference. But it is good that you verified :) |
Curiously awaiting more data, I'm sure this one will be a doozy 🔥 |
@blakesmith any updates? |
@mreiferson My team and I have been unable to get a deterministic test case for this failure. We ended up switching to uncompressed logging, and then gzipping offline as a workaround, and our message logs are stable now. I might be able to spend some more time investigating soon. |
@blakesmith thanks for the update. Clearly something is going on, so I'm gonna leave this open for discussion. |
opened #716 to track this |
Hey there!
We use a long running
nsq_to_file
process to backup our topics, for replay and offsite storage. We also use the-gzip
flag to keep our log files compressed to save on disk space. Occasionally, one of the log files contains malformed gzip data at the end of the log. This tends to occur for our higher volume topics (100-200 messages / second), but not always. Using a sequence numbering scheme for all our messages, we tend to see 5-20 messages missing between the end of the corrupted log file, and the start of the next one.If you try to unzip one of the log files, you get these errors:
We see similar gzip errors in our java code that gunzips. The malformed log file is always cut off midway through a message.
I'm not sure if this PR will fix the change (I'd like to get feedback before I put it into production). Basically it seems that the sequence of file rotation doesn't close / sync in the correct order. In my mind, the correct order to close the file would be:
This PR changes the order, and adds some more logging. It seems possible that we could lose data if not closed in this order, but I might be missing some important detail here.
What do you think?
Thanks for nsq! We love it so far.
Blake