Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: deleting a "live" topic #182

Merged
merged 1 commit into from
Apr 28, 2013
Merged

Conversation

mreiferson
Copy link
Member

there appears to be an issue with deleting a topic (like via nsqadmin) when that topic is still producing messages.

ie. the producers "disappear" from the list in nsqlookupd despite continuing to have the topic and produce messages.

it appears that nsqd is sending out of order REGISTER/UNREGISTER commands, presumably because they are just spawned goroutines and there is no guarantee which gets runtime cycles first.

cc @jehiah @elubow

@jehiah
Copy link
Member

jehiah commented Apr 25, 2013

I think this is because there is a race between when a topic is removed from the *NSQd.topicMap and the lock that is used.

We release the lock immediately before performing the cleanup steps so that other operations are not blocked, but this means the eventual propagation of deletion to nsqadmin, and the cleanup of files are delayed and could happen after a separate goroutine/connection creates a new topic with the same name.

To solve we need to add code to block accesses to that topic, while releasing the global lock.

Deletion would get split into the following steps

  • NSQd.Lock()
  • topic.MarkForDeletion() which set's topic.exitFlag
  • NSQd.Unlock()
  • topic.Delete() (all remaining steps, with an inlined topic.notifier.Notify(t))
  • NSQd.Lock()
  • delete from NSQd.topicMap
  • NSQd.Unlock()

We would then add topic.CloseWaitgroup sync.WaitGroup. In topic.NewTopic it would .Add(1) and after removed from NSQd.topicMap we would topic.CloseWaitgroup.Done()

This would allow GetTopic() to in detection of a topic with topic.exitFlag set block on topic.CloseWaitgroup.Wait() and then retry GetTopic().

@mreiferson
Copy link
Member Author

Commit that's pushed up for review takes the stance that "You're Doing It Wrong (tm)" if you're deleting a topic that is still being written to (and therefore it is entirely acceptable that some writes fail).

With that mindset, the only changes were re-ordering some of the events that happen to before the topic is removed from the lookup map, for correctness of lookupd state (the actual bug).

cc @jehiah

@mreiferson
Copy link
Member Author

I should add that this approach, coupled with #183 landing, makes it even harder to imagine a scenario where someone would want to delete a live topic.

@mreiferson
Copy link
Member Author

RFR @jehiah

@jehiah
Copy link
Member

jehiah commented Apr 28, 2013

LGTM

jehiah added a commit that referenced this pull request Apr 28, 2013
@jehiah jehiah merged commit 1f833d2 into nsqio:master Apr 28, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants