Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nsqd: message loss during topic/channel bootstrap from metadata #1032

Merged
merged 1 commit into from
May 22, 2018

Conversation

michaelyou
Copy link
Contributor

@michaelyou michaelyou commented May 10, 2018

#1031

I am not sure will this introduce other bugs~

This is what the pr do:

Initialize topic with pause=1, so when start t.messagePump goroutine,it won't read msg out.
In the method GetTopic,after NewTopic return, check n.isLoading to see if it is loading from metadata file(restart). If not, unpause it immediately, else, unpause it when all channels are created in LoadMetadata .

NewTopic is only called by GetTopic, so the change should not affect other code.


image

nsqd/nsqd.go Outdated
@@ -372,6 +372,7 @@ func (n *NSQD) LoadMetadata() error {
channel.Pause()
}
}
topic.UnPause()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above, you'll see that paused state is stored in the topic metadata, so it does:

 if t.Paused {
 	topic.Pause()
}

So you can remove that now, but here only un-pause if the topic was not originally paused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, i will correct it

nsqd/nsqd.go Outdated
@@ -515,6 +516,10 @@ func (n *NSQD) GetTopic(topicName string) *Topic {
t = NewTopic(topicName, &context{n}, deleteCallback)
n.topicMap[topicName] = t

if atomic.LoadInt32(&n.isLoading) == 0 {
t.UnPause()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can move this below the channels block, as an alternative to taking the topic lock to block PutMessages() while getting channels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The channelMap in getOrCreateChannel needs the topic's lock . I am not sure if we can do this~

// this expects the caller to handle locking
func (t *Topic) getOrCreateChannel(channelName string) (*Channel, bool) {
    channel, ok := t.channelMap[channelName]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - you'd call t.GetChannel() instead. But this is still tentative supposition :)

@mreiferson mreiferson changed the title nsqd: Fix lost message when loading from metadata file nsqd: message loss during topic/channel bootstrap from metadata May 12, 2018
@mreiferson mreiferson added the bug label May 12, 2018
@mreiferson
Copy link
Member

I wonder if we should just create a NewTopicWithChannels function that allows this to be handled in the correct order before goroutines are launched.

@michaelyou michaelyou force-pushed the hotfix-lost-msg-restart branch from 90a4f63 to 6505605 Compare May 14, 2018 08:55
@michaelyou
Copy link
Contributor Author

michaelyou commented May 14, 2018

@mreiferson Creating a NewTopicWithChannels looks like a big change. I modified my commit and does it look more natural now~

@mreiferson
Copy link
Member

It will be a slightly bigger change, but I feel like this code is a little too clever and surprising. We could comment why the ordering is what it is, but I feel like I'd rather have slightly more code that's clear in its intent?

What do you think @ploxiln?

@ploxiln
Copy link
Member

ploxiln commented May 15, 2018

I think the LoadMetadata() unpause-after style is reasonably elegant, and maybe we could improve the locking/latency of GetTopic() with that style. A NewTopicWithChannels() would not work for GetTopic() (because the nsqd lock would have to be held while querying lookupd). But I don't have a strong opinion - just adding a NewTopicWithChannels() just for LoadMetadata() would work fine too.

@ploxiln
Copy link
Member

ploxiln commented May 20, 2018

It looks like this changed since I last viewed it - instead of initializing a Topic with paused: true in NewTopic(), it calls Topic.Pause() just before releasing the nsqd lock in GetTopic() iff nsqd.isLoading. It seems to me like that works too.

Anyway, I think this is a fine minimal fix-up, and further refactoring can be proposed in a separate PR. @mreiferson?

Copy link
Member

@mreiferson mreiferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after adding these comments, thanks!

nsqd/nsqd.go Outdated
@@ -515,6 +515,10 @@ func (n *NSQD) GetTopic(topicName string) *Topic {
t = NewTopic(topicName, &context{n}, deleteCallback)
n.topicMap[topicName] = t

if atomic.LoadInt32(&n.isLoading) == 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment above this line that reads:

// if we're creating a new topic in this process while we're loading from metadata, the topic may already
// have message data persisted on disk. To ensure that all channels are created before message flow
// begins, we pause the topic.

nsqd/nsqd.go Outdated
@@ -372,6 +369,9 @@ func (n *NSQD) LoadMetadata() error {
channel.Pause()
}
}
if !t.Paused {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And a comment above this line:

// this logic is reversed, and done _after_ channel creation to ensure that all channels
// are created before messages begin flowing

@michaelyou michaelyou force-pushed the hotfix-lost-msg-restart branch from 6505605 to 1055073 Compare May 21, 2018 16:00
@michaelyou
Copy link
Contributor Author

@mreiferson done

@ploxiln
Copy link
Member

ploxiln commented May 21, 2018

Thanks for squashing down to one commit. The final little issue is that the commit title is not descriptive of the overall change. I suggest:

nsqd: pause topics while loading metadata

@michaelyou michaelyou force-pushed the hotfix-lost-msg-restart branch from 1055073 to 07f8e2f Compare May 22, 2018 02:49
@michaelyou
Copy link
Contributor Author

@ploxiln OK, done

@ploxiln
Copy link
Member

ploxiln commented May 22, 2018

Thanks!

@ploxiln ploxiln merged commit 1169772 into nsqio:master May 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants