-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nsqd: changes to support IoT usage, avoid data loss #855
Comments
I found #34 (nsqd: disk-backed deferred queue) and #376 related to this issue from 2012 and 2014 respectively. Both were closed in favour of #510 and #625. The WAL feature looks like it's on the back-burner. So it looks like persisting deferred queue is not trivial with the current codebase. In contrast I'm looking for something simple, and I don't have strong requirements for the semantics of deferred messages to be preserved. So here's my spec for a third attempt at poor-man's resilience:
This pretty simple (maybe 100 LoC) change will give a lifeline "resilience" mode where users can elect to restrict requeue times in order to achieve a degree of safety from powercuts. When the WAL feature @mreiferson is working on lands the above stopgap can be easily removed and users will just get improved functionality with a natural upgrade path. |
Hi @mcorb, thanks for the feedback and suggestions. As you've correctly discovered, the ultimate solution to this problem would be to complete #510 (as implemented by #625). While your proposal might address the issues, the changes don't feel like generally useful additions and would serve as a rather specific stop-gap for this particular use case. I'm curious about the system's architecture though — have you considered not putting |
@mcorb ~5 years ago I dealt with exact same problem, where user-mode code used sqlite. Ultimately, it turned out that no production ready (reliable) file-systems were good enough, as sqlite (with or without WAL) writes to same place on each transaction (typ. 3 or 4 times), and standard filesystems (ext, xfs, etc.) map that to exact same block on the disk. Flash cards advertise wear-levelling, but tests of both commercial and consumer cards showed that it's a joke. In fact, consumer cards were more reliable. The system solution would be to use a log-structured file system on the data (rw) volume. The ad-hoc solution was to buffer data in RAM in the user code for up to N events, up to M minutes. In any case, I would advise against placing this functionality into |
I am using nsq as a store-and-forward agent for IoT devices that report metrics to a central HTTPS API.
The small devices are offline for periods of time, and need to send back data opportunistically. The power supply is also intermittent.
You can think of the use case as a passenger bus instrumentation, collecting metrics between stops where wifi is available, and the power supply going up and down in between those bus stops.
As with most IoT devices the flash storage has really limited write lifetime.
I've identified the following
nsqd
options relevant to this use-case:Our first approach was to disable the memory queue fully with
-mem-queue-size 0
and setting-sync-timeout 20s
.However this isn't working out for a couple of reasons:
My next attempt was to implement flushing of the memory queue to to the disk queue as below (based on the first part of
Channel.flush()
) called by a new 20s ticker:This way we can enable the memory queue, allowing good performance and avoiding disk wear, while still achieving data safety guarantees.
Two problems with this:
I'd like to get a recommendation from the
nsqd
developer community on how to proceed. It looks like whatever change won't be more than a few lines of code, but I'm not sure of the best way to solve these requirements.I'll be glad to submit patches and documentation for IoT use once we hear the best approach from an experienced developer.
The text was updated successfully, but these errors were encountered: