Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory corruption while calling SysV msgsnd & msgrcv #551

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

arajkumar
Copy link
Contributor

Here is a snippet of man page for msgsnd & msgrcv,

      int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg);

       ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp,
                      int msgflg);

DESCRIPTION
       The  msgsnd() and msgrcv() system calls are used to send messages to, and receive messages from, a System V message queue.  The calling process must have write
       permission on the message queue in order to send a message, and read permission to receive a message.

       The msgp argument is a pointer to a caller-defined structure of the following general form:

           struct msgbuf {
               long mtype;       /* message type, must be > 0 */
               char mtext[1];    /* message data */
           };

       The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value.  Messages of zero length (i.e., no mtext field)
       are  permitted.   The  mtype field must have a strictly positive integer value.  This value can be used by the receiving process for message selection (see the
       description of msgrcv() below).

According to the man page, the 3rd argument(i.e.msgsz) of both msgsnd and msgrcv must pass the size of mtext i.e. excluding the size of mtype.

If we apply this logic into our context, we must pass the sizeof(QMessage->data), not sizeof(QMessage).

We can also find the corruption using valgrind like below,

valgrind --trace-children=yes --tool=memcheck --leak-check=no pgcopydb
copy table-data

It gives the following summary,

==640986== Syscall param msgsnd(msgp->mtext) points to uninitialised byte(s)
==640986==    at 0x4AC92AA: msgsnd (msgsnd.c:25)
==640986==    by 0x14E178: queue_send (queue_utils.c:100)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:403)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:295)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:232)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:209)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:117)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:89)
==640986==    by 0x162643: copydb_copy_all_table_data (table-data.c:57)
==640986==    by 0x11406E: cli_copy_table_data (cli_copy.c:395)
==640986==    by 0x1118B2: UnknownInlinedFun (commandline.c:71)
==640986==    by 0x1118B2: main (main.c:142)
==640986==  Address 0x1ffeeec5f0 is on thread 1's stack
==640986==  in frame #2, created by copydb_copy_all_table_data (table-data.c:38)
==640986==
==640986== Syscall param msgsnd(msgp->mtext) points to uninitialised byte(s)
==640986==    at 0x4AC92AA: msgsnd (msgsnd.c:25)
==640986==    by 0x14E178: queue_send (queue_utils.c:100)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:380)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:307)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:232)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:209)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:117)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:89)
==640986==    by 0x162AD8: copydb_copy_all_table_data (table-data.c:57)
==640986==    by 0x11406E: cli_copy_table_data (cli_copy.c:395)
==640986==    by 0x1118B2: UnknownInlinedFun (commandline.c:71)
==640986==    by 0x1118B2: main (main.c:142)
==640986==  Address 0x1ffeeec5f0 is on thread 1's stack
==640986==  in frame #2, created by copydb_copy_all_table_data (table-data.c:38)
==640986==

...

The above error disappers with our fix.

Here is a snippet of man page for msgsnd & msgrcv,

```
      int msgsnd(int msqid, const void *msgp, size_t msgsz, int msgflg);

       ssize_t msgrcv(int msqid, void *msgp, size_t msgsz, long msgtyp,
                      int msgflg);

DESCRIPTION
       The  msgsnd() and msgrcv() system calls are used to send messages to, and receive messages from, a System V message queue.  The calling process must have write
       permission on the message queue in order to send a message, and read permission to receive a message.

       The msgp argument is a pointer to a caller-defined structure of the following general form:

           struct msgbuf {
               long mtype;       /* message type, must be > 0 */
               char mtext[1];    /* message data */
           };

       The mtext field is an array (or other structure) whose size is specified by msgsz, a nonnegative integer value.  Messages of zero length (i.e., no mtext field)
       are  permitted.   The  mtype field must have a strictly positive integer value.  This value can be used by the receiving process for message selection (see the
       description of msgrcv() below).
```

According to it, the 3rd arg of both msgsnd and msgrcv must pass the
size of mtext i.e. excluding the mtype.

If we apply this logic into our context, we must pass the sizeof(QMessage->data), not sizeof(QMessage).

We can also find the corruption using valgrind like below,

```
valgrind --trace-children=yes --tool=memcheck --leak-check=no pgcopydb
copy table-data
```

It gives the following summary,

```
==640986== Syscall param msgsnd(msgp->mtext) points to uninitialised byte(s)
==640986==    at 0x4AC92AA: msgsnd (msgsnd.c:25)
==640986==    by 0x14E178: queue_send (queue_utils.c:100)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:403)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:295)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:232)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:209)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:117)
==640986==    by 0x162643: UnknownInlinedFun (table-data.c:89)
==640986==    by 0x162643: copydb_copy_all_table_data (table-data.c:57)
==640986==    by 0x11406E: cli_copy_table_data (cli_copy.c:395)
==640986==    by 0x1118B2: UnknownInlinedFun (commandline.c:71)
==640986==    by 0x1118B2: main (main.c:142)
==640986==  Address 0x1ffeeec5f0 is on thread 1's stack
==640986==  in frame #2, created by copydb_copy_all_table_data (table-data.c:38)
==640986==
==640986== Syscall param msgsnd(msgp->mtext) points to uninitialised byte(s)
==640986==    at 0x4AC92AA: msgsnd (msgsnd.c:25)
==640986==    by 0x14E178: queue_send (queue_utils.c:100)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:380)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:307)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:232)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:209)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:117)
==640986==    by 0x162AD8: UnknownInlinedFun (table-data.c:89)
==640986==    by 0x162AD8: copydb_copy_all_table_data (table-data.c:57)
==640986==    by 0x11406E: cli_copy_table_data (cli_copy.c:395)
==640986==    by 0x1118B2: UnknownInlinedFun (commandline.c:71)
==640986==    by 0x1118B2: main (main.c:142)
==640986==  Address 0x1ffeeec5f0 is on thread 1's stack
==640986==  in frame #2, created by copydb_copy_all_table_data (table-data.c:38)
==640986==

...
```

The above error disappers with the fix.

Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>
@arajkumar
Copy link
Contributor Author

arajkumar commented Nov 24, 2023

I think we should also revert #534 and part of #533 which converts QMessage allocation to heap.

CC: @shubhamdhama @dimitri

@dimitri dimitri merged commit ac9c72f into dimitri:main Nov 27, 2023
16 checks passed
@dimitri
Copy link
Owner

dimitri commented Nov 27, 2023

Well I don't know if we should revert previous patches. It seems like we put too much pressure on the stack allocation everywhere in pgcopydb and should use more of the heap instead...

@arajkumar
Copy link
Contributor Author

@dimitri I see lots of noise created by free due to heap allocation for smaller objects. IMHO, stack is always better for smaller short lived objects.

@dimitri
Copy link
Owner

dimitri commented Nov 27, 2023

That's my opinion too. Can you create a PR that gets back to using the stack for these small objects?

@shubhamdhama
Copy link
Contributor

@arajkumar I'm really happy that the root cause for this issue is discovered. And thanks for the detailed description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants