Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(server): Fix crash in ZPOPMIN #977

Merged
merged 5 commits into from
Mar 25, 2023
Merged

bug(server): Fix crash in ZPOPMIN #977

merged 5 commits into from
Mar 25, 2023

Conversation

chakaz
Copy link
Collaborator

@chakaz chakaz commented Mar 21, 2023

Crash was due to an attempt to access nullptr[-1], which is a bad idea :)

Fixes #949

Crash was due to an attempt to access nullptr[-1], which is bad :)

Signed-off-by: chakaz <chakaz@chakaz>
src/redis/sds.h Outdated
@@ -85,6 +85,10 @@ struct __attribute__ ((__packed__)) sdshdr64 {
#define SDS_TYPE_5_LEN(f) ((f)>>SDS_TYPE_BITS)

static inline size_t sdslen(const sds s) {
if (s == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the other functions in this file?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bug is not there. And I prefer adding a dcheck in the code above instead of doing this.

null sds should not even be in sorted set. I am guessing the line above is the caller:
result_.emplace_back(string{ln->ele, sdslen(ln->ele)}, ln->score); so how come ln is in the set?
Do you have a minimal reproducible example with redis-cli?
lets start with it, then, we can track how null got there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I didn't know there shouldn't be nulls there.

BTW adding a DCHECK() isn't possible as this is a .c file :(

Re/ repro: when attempting to debug this, I learned that the issue happens upon the insertion of the 129th element. What happens there is that when the size bypasses 128, the set is converted (via zsetConvert()) from LISTPACK to SKIPLIST. As part of this conversion, we are left with a zset that has a header pointer pointing to an element with ele == nullptr.

My understanding is that this is in the redis codebase, so I imagine that either this is a bug with the way we use their API, or this is working as intended and we should skip such entries (which sounds weird, but much of the redis code there is weird as well I must say).

Just to clarify, here's a simple way to repro it:

ZADD myzset 1 element:1
ZADD myzset 2 element:2
ZADD myzset 3 element:3
ZADD myzset 4 element:4
ZADD myzset 5 element:5
ZADD myzset 6 element:6
ZADD myzset 7 element:7
ZADD myzset 8 element:8
ZADD myzset 9 element:9
ZADD myzset 10 element:10
ZADD myzset 11 element:11
ZADD myzset 12 element:12
ZADD myzset 13 element:13
ZADD myzset 14 element:14
ZADD myzset 15 element:15
ZADD myzset 16 element:16
ZADD myzset 17 element:17
ZADD myzset 18 element:18
ZADD myzset 19 element:19
ZADD myzset 20 element:20
ZADD myzset 21 element:21
ZADD myzset 22 element:22
ZADD myzset 23 element:23
ZADD myzset 24 element:24
ZADD myzset 25 element:25
ZADD myzset 26 element:26
ZADD myzset 27 element:27
ZADD myzset 28 element:28
ZADD myzset 29 element:29
ZADD myzset 30 element:30
ZADD myzset 31 element:31
ZADD myzset 32 element:32
ZADD myzset 33 element:33
ZADD myzset 34 element:34
ZADD myzset 35 element:35
ZADD myzset 36 element:36
ZADD myzset 37 element:37
ZADD myzset 38 element:38
ZADD myzset 39 element:39
ZADD myzset 40 element:40
ZADD myzset 41 element:41
ZADD myzset 42 element:42
ZADD myzset 43 element:43
ZADD myzset 44 element:44
ZADD myzset 45 element:45
ZADD myzset 46 element:46
ZADD myzset 47 element:47
ZADD myzset 48 element:48
ZADD myzset 49 element:49
ZADD myzset 50 element:50
ZADD myzset 51 element:51
ZADD myzset 52 element:52
ZADD myzset 53 element:53
ZADD myzset 54 element:54
ZADD myzset 55 element:55
ZADD myzset 56 element:56
ZADD myzset 57 element:57
ZADD myzset 58 element:58
ZADD myzset 59 element:59
ZADD myzset 60 element:60
ZADD myzset 61 element:61
ZADD myzset 62 element:62
ZADD myzset 63 element:63
ZADD myzset 64 element:64
ZADD myzset 65 element:65
ZADD myzset 66 element:66
ZADD myzset 67 element:67
ZADD myzset 68 element:68
ZADD myzset 69 element:69
ZADD myzset 70 element:70
ZADD myzset 71 element:71
ZADD myzset 72 element:72
ZADD myzset 73 element:73
ZADD myzset 74 element:74
ZADD myzset 75 element:75
ZADD myzset 76 element:76
ZADD myzset 77 element:77
ZADD myzset 78 element:78
ZADD myzset 79 element:79
ZADD myzset 80 element:80
ZADD myzset 81 element:81
ZADD myzset 82 element:82
ZADD myzset 83 element:83
ZADD myzset 84 element:84
ZADD myzset 85 element:85
ZADD myzset 86 element:86
ZADD myzset 87 element:87
ZADD myzset 88 element:88
ZADD myzset 89 element:89
ZADD myzset 90 element:90
ZADD myzset 91 element:91
ZADD myzset 92 element:92
ZADD myzset 93 element:93
ZADD myzset 94 element:94
ZADD myzset 95 element:95
ZADD myzset 96 element:96
ZADD myzset 97 element:97
ZADD myzset 98 element:98
ZADD myzset 99 element:99
ZADD myzset 100 element:100
ZADD myzset 101 element:101
ZADD myzset 102 element:102
ZADD myzset 103 element:103
ZADD myzset 104 element:104
ZADD myzset 105 element:105
ZADD myzset 106 element:106
ZADD myzset 107 element:107
ZADD myzset 108 element:108
ZADD myzset 109 element:109
ZADD myzset 110 element:110
ZADD myzset 111 element:111
ZADD myzset 112 element:112
ZADD myzset 113 element:113
ZADD myzset 114 element:114
ZADD myzset 115 element:115
ZADD myzset 116 element:116
ZADD myzset 117 element:117
ZADD myzset 118 element:118
ZADD myzset 119 element:119
ZADD myzset 120 element:120
ZADD myzset 121 element:121
ZADD myzset 122 element:122
ZADD myzset 123 element:123
ZADD myzset 124 element:124
ZADD myzset 125 element:125
ZADD myzset 126 element:126
ZADD myzset 127 element:127
ZADD myzset 128 element:128
ZADD myzset 129 element:129
ZPOPMIN myzset

If we were to move the ZPOPMIN command one line above it would not crash, as the zset has not yet been converted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely should be added to zset_family_test.cc to reproduce.

  1. it could be my bug when I was incorporating t_zset.c code into dragonfly.
  2. It could be a bug in redis in the version I took the code from.I am pretty sure Redis will not crash on this use-case.

in any case if you add it to the unit-test, I will be able to deep dive further. I prefer fixing the tzset.c if possible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I added a test to repro the bug (it should be part of the PR). It does sound like the bug is in t_zset.c code, but I was under the impression that it's been very stable for a long time, so maybe it's the way we use it.

@ashotland
Copy link
Contributor

Please also uncomment

# ('ZPOPMIN {k} 1', ValueType.ZSET), https://github.com/dragonflydb/dragonfly/issues/949

There's some leftover debugging statements, they're somewhat useful so I
kept them as the bug is not yet fixed.

Signed-off-by: chakaz <chakaz@chakaz>
@romange
Copy link
Collaborator

romange commented Mar 23, 2023

Hi @chakaz
this is the fix.
the bug was in zset_family code.

While searching for a root-cause, I compared our t_zset.c with redis 7.0 code, so I imported another bug fix from their code.

diff --git a/src/redis/t_zset.c b/src/redis/t_zset.c
index 2f4a0e1..9240288 100644
--- a/src/redis/t_zset.c
+++ b/src/redis/t_zset.c
@@ -1157,9 +1157,10 @@ void zsetConvert(robj *zobj, int encoding) {
         zs->zsl = zslCreate();
 
         eptr = lpSeek(zl,0);
-        serverAssertWithInfo(NULL,zobj,eptr != NULL);
-        sptr = lpNext(zl,eptr);
-        serverAssertWithInfo(NULL,zobj,sptr != NULL);
+        if (eptr != NULL) {
+            sptr = lpNext(zl,eptr);
+            serverAssertWithInfo(NULL,zobj,sptr != NULL);
+        }
 
         while (eptr != NULL) {
             score = zzlGetScore(sptr);
diff --git a/src/server/zset_family.cc b/src/server/zset_family.cc
index b079185..e51527b 100644
--- a/src/server/zset_family.cc
+++ b/src/server/zset_family.cc
@@ -576,7 +576,7 @@ void IntervalVisitor::PopSkipList(ZSetFamily::TopNScored sc) {
   if (params_.reverse) {
     ln = zsl->tail;
   } else {
-    ln = zsl->header;
+    ln = zsl->header->level[0].forward;
   }
 
   while (ln && sc--) {
diff --git a/src/server/zset_family_test.cc b/src/server/zset_family_test.cc
index e565648..941eef5 100644
--- a/src/server/zset_family_test.cc
+++ b/src/server/zset_family_test.cc
@@ -485,6 +485,6 @@ TEST_F(ZSetFamilyTest, ZAddPopCrash) {
   }
 
   auto resp = Run({"zpopmin", "key"});
-  EXPECT_THAT(resp, IntArg(1));
+  EXPECT_EQ(resp, "element:0");
 }
 }  // namespace dfly

Also re-enable (uncomment) the test in utility.py.

Signed-off-by: chakaz <chakaz@chakaz>
@chakaz
Copy link
Collaborator Author

chakaz commented Mar 24, 2023

Nice fix! It would have taken me a long while to find it, as I have very little idea re/ the zsl API :)

Signed-off-by: Chaka <chakaz@users.noreply.github.com>
@romange
Copy link
Collaborator

romange commented Mar 24, 2023

I do not know zsl API as well :)

I looked at Redis' corresponding code because if zsetConvert is the same for us and them, it means they probably handle it when reading the data.
https://github.com/redis/redis/blob/9e15b42fda8fa622edf6a1e15cba0fc066a35407/src/t_zset.c#L3937

@chakaz chakaz changed the title Fix crash in ZPOPMIN bug(server): Fix crash in ZPOPMIN Mar 25, 2023
@romange romange merged commit 7be98d9 into dragonflydb:main Mar 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dragonfly crash in ZPOPMIN
3 participants