Fix some incorrect memory handling #5317

Medvecrab · 2023-02-14T06:21:15Z

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling.

Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there.
Changes in metadata.c fix this.

Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there.

Fixes #5311

Disable-check: commit-count

github-actions · 2023-02-14T06:21:39Z

@mahipv, @nikkhils: please review this pull request.

Powered by pull-review

codecov · 2023-02-14T07:28:10Z

Codecov Report

Merging #5317 (c770a86) into main (7e43c70) will increase coverage by 0.18%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5317      +/-   ##
==========================================
+ Coverage   90.70%   90.89%   +0.18%     
==========================================
  Files         225      225              
  Lines       52020    45792    -6228     
==========================================
- Hits        47183    41621    -5562     
+ Misses       4837     4171     -666

Impacted Files	Coverage Δ
src/process_utility.c	`94.87% <100.00%> (+0.40%)`	⬆️
src/ts_catalog/metadata.c	`91.89% <100.00%> (-0.88%)`	⬇️
src/bgw_policy/chunk_stats.c	`81.53% <0.00%> (-4.01%)`	⬇️
tsl/test/src/remote/dist_commands.c	`90.62% <0.00%> (-3.97%)`	⬇️
src/nodes/chunk_append/transform.c	`96.66% <0.00%> (-3.34%)`	⬇️
src/loader/bgw_message_queue.c	`86.33% <0.00%> (-2.87%)`	⬇️
src/loader/bgw_launcher.c	`89.22% <0.00%> (-2.85%)`	⬇️
tsl/src/fdw/estimate.c	`77.18% <0.00%> (-2.36%)`	⬇️
src/estimate.c	`88.11% <0.00%> (-2.06%)`	⬇️
src/planner/partialize.c	`97.95% <0.00%> (-2.05%)`	⬇️
... and 194 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kgyrtkirk · 2023-02-14T08:25:57Z

src/ts_catalog/metadata.c

@@ -156,8 +156,13 @@ ts_metadata_insert(const char *metadata_key, Datum metadata_value, Oid type,
 	}

 	/* We have to copy the key here because heap_form_tuple will copy NAMEDATALEN
-	 * into the tuple instead of checking length. */
-	strlcpy(key_data, metadata_key, NAMEDATALEN);


could you please remove this part as strlcpy should take care of the trailing zero - I've checked with valgrind and this doesn't cause/fix any issue

I'd like to defend this change.

Firstly, the first patch and the second patch solve different problems, that are only grouped in one issue because they are were found by valgrind together.

The thing with strlcpy is that it only copies amount of bytes up to length of source string and then terminates the string with zero.
In this case, key_data is a char array the size of NAMEDATALEN (which is 64). And metadata_key that is being copied there isn't always NAMEDATALEN long. And key_data is allocated at the top of the stack, and can contain junk. Or whatever was at the top of the stack. So it's better to fill it it with zeroes and only then copy something there.

For me, valgrind still shows first error stated in #5311, and pageinspect still shows junk. And with patch (initial or the ones linked below) this junk is gone and there are only zeroes.

Reproduction is simple - start postgres under valgrind and execute 'CREATE EXTENSION timescaledb;' with psql.

I can provide at least two alternative patches that will first fill the string with zeroes: see attachements

metadata_memory_fix_memcpy.patch
metadata_memory_fix_namedata.patch

If you'll like any of above patches, I'll replace current code in commit with that patch.

If you're still not convinced and can't reproduce it, I'll remove changes in metadata.c.

The thing with strlcpy is that it only copies amount of bytes up to length of source string and then terminates the string with zero.
In this case, key_data is a char array the size of NAMEDATALEN (which is 64). And metadata_key that is being copied there isn't always NAMEDATALEN long. And key_data is allocated at the top of the stack, and can contain junk. Or whatever was at the top of the stack. So it's better to fill it it with zeroes and only then copy something there.

Not sure I follow the reasoning here: strlcpy() copies the bytes from source to target and is guaranteed to null-terminate the string as long as the size is larger than 0.

From the manual of strlcpy():

Unlike [strncpy(3) and strncat(3)], strlcpy() and strlcat() take the full size of the buffer (not just the length) and guarantee to NUL-terminate the result (as long as size is larger than 0 ...)

So, if the source string is shorter, it will be copied fully into the target, and if it is longer, just the initial prefix will be copied and the string null-terminated, so it basically does the same as the code that you're replacing it with.

Okay, maybe we aren't on the same page here. When manual of strlcpy says 'guarantee to NUL-terminate' it means the following: a sequence of symbols starting at an address with a symbolic name "buffer" will end with '\0', guaranteed. And if the buffer was a char pointer that is supposed to be dynamically allocated, that's fine. But in this case, key_data isn't allocated dynamically. It's an array, size of NAMEDATALEN (64). And it's initially not guaranteed to be empty! When we copy metadata_key (which isn't 100% confirmed to be exactly 64 chars long) into it with strlcpy, we get something like this inside: "metadata_key_value\0junk_symbols_until_64th...". key_data is a char pointer, yes, and if we tried to pass it to, for example, strlen after copying something into it with strlcpy, it would return us it's current length (amount of chars befor '\0'), not 64.

But you say here in the comment (metadata.c, lines 158-159):

/* We have to copy the key here because heap_form_tuple will copy NAMEDATALEN * into the tuple instead of checking length. */

And that means, that any info left after that terminating zero will be pasted onto a page (as can be seen in #5311):

-[ RECORD 3 ]----------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------- lp | 3 lp_off | 7816 lp_flags | 1 lp_len | 126 t_xmin | 755 t_xmax | 0 t_field3 | 0 t_ctid | (0,3) t_infomask2 | 3 t_infomask | 2306 t_hoff | 24 t_bits | t_oid | t_data | \x6578706f727465645f7575696400000000000000000000006822cd0400000000080000000000000010000000000000007827510900000000e0235109000000004b3138303062 3539332d383261362d346365382d623561652d66626235363234666162316401

This can be dangerous, because when we calculate checksums of this page, if this junk if not the same every time, it can be problematic and lead to checksum missmatch.
The simplest solution, in my opinion, is to memset key_data with zeroes before copying anything into it.

You actually have a very good point: the value for the Name field is a fixed-size char array (effectively, at least), which means that there will be random data added to the tuple.

Note that this has nothing to do with whether the field is dynamically allocated or not, it is just a side-effect of copying NAMEDATALEN bytes from the source.

However, your patch does not solve the problem in that case, since all you do is replicating the behavior of strlcpy with this change.

I agree that a working change would be what you suggest: initialize the key_data variable with zeroes (it is an auto variable, so it is not automatically zero initialized) and then use strlcpy. You need to use strlcpy since you need to guarantee that the string is null-terminated even if metadata_key is too long for the buffer.

Please let me chime in. As strncpy(3) says:

If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.

So, in general, strncpy and strlcpy are not equivalent. The Postgres function namestrcpy deliberately uses strncpy instead of strlcpy to zero-pad the destination.

void namestrcpy(Name name, const char *str) { /* NB: We need to zero-pad the destination. */ strncpy(NameStr(*name), str, NAMEDATALEN); NameStr(*name)[NAMEDATALEN - 1] = '\0'; }

I think it would be better to replace strlcpy here with namestrcpy to avoid potential confusion in future (please see the alternative patch metadata_memory_fix_namedata.patch mentioned above). I can imagine someone replacing strncpy with strlcpy again, because it's considered a safer alternative. But they are not really interchangable.

Please let me chime in. As strncpy(3) says:

If the length of src is less than n, strncpy() writes additional null bytes to dest to ensure that a total of n bytes are written.

Thank you, I missed that comment. In that case, moving away from strlcpy is an improvement, but I agree that using namestrcpy is the better alternative.

mkindahl · 2023-02-14T13:09:41Z

src/process_utility.c

+		Name relname = palloc0(NAMEDATALEN);
+		namestrcpy (relname, NameStr(((Form_pg_class) GETSTRUCT(tuple))->relname));


Function namestrcpy will automatically null-terminate the buffer so there it is unnecessary use palloc0 since palloc suffices.

As @shinderuk pointed out, you do not need palloc0 here.

mkindahl · 2023-02-14T13:20:13Z

src/ts_catalog/metadata.c

@@ -156,8 +156,13 @@ ts_metadata_insert(const char *metadata_key, Datum metadata_value, Oid type,
 	}

 	/* We have to copy the key here because heap_form_tuple will copy NAMEDATALEN
-	 * into the tuple instead of checking length. */
-	strlcpy(key_data, metadata_key, NAMEDATALEN);


The thing with strlcpy is that it only copies amount of bytes up to length of source string and then terminates the string with zero.
In this case, key_data is a char array the size of NAMEDATALEN (which is 64). And metadata_key that is being copied there isn't always NAMEDATALEN long. And key_data is allocated at the top of the stack, and can contain junk. Or whatever was at the top of the stack. So it's better to fill it it with zeroes and only then copy something there.

Not sure I follow the reasoning here: strlcpy() copies the bytes from source to target and is guaranteed to null-terminate the string as long as the size is larger than 0.

From the manual of strlcpy():

Unlike [strncpy(3) and strncat(3)], strlcpy() and strlcat() take the full size of the buffer (not just the length) and guarantee to NUL-terminate the result (as long as size is larger than 0 ...)

So, if the source string is shorter, it will be copied fully into the target, and if it is longer, just the initial prefix will be copied and the string null-terminated, so it basically does the same as the code that you're replacing it with.

kgyrtkirk · 2023-02-16T21:18:49Z

@Medvecrab could you please squash the commits?

mkindahl · 2023-02-17T10:54:14Z

@Medvecrab Before merging this, we need to add changelog entries to the commit, so either you allow changes to a pull request branch created from a fork and we can add this as an additional commit, or you add the lines to CHANGELOG.md yourself. You can look at #5170 to see what we do with the changelog.

Medvecrab · 2023-02-17T11:08:50Z

I've allowed edits by maintainers, @mkindahl

While running TimescaleDB under valgrind I've found two cases of incorrect memory handling. Case 1: When creating timescaledb extension, during the insertion of metadata there is some junk in memory that is not zeroed before writing there. Changes in metadata.c fix this. Case 2: When executing GRANT smth ON ALL TABLES IN SCHEMA some_schema and deconstructing this statement into granting to individual tables, process of copying names of those tables is wrong. Currently, you aren't copying the data itself, but an address to data on a page in some buffer. There's a problem - when the page in this buffer changes, copied address would lead to wrong data. Changes in process_utility.c fix this by allocating memory and then copying needed relname there. Fixes timescale#5311

timescale-automation · 2023-03-03T11:58:08Z

Automated backport to 2.10.x not done: cherry-pick failed.

Git status

HEAD detached at origin/2.10.x
You are currently cherry-picking commit 8a51a76d.
  (all conflicts fixed: run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

nothing to commit, working tree clean

Job log

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5362 Make copy fetcher more async * timescale#5336 Use NameData and namestrcpy for names * timescale#5317 Fix some incorrect memory handling * timescale#5367 Rename columns in old-style continuous aggregates * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size * timescale#5153 Fix concurrent locking with chunk_data_node table **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5362 Make copy fetcher more async * timescale#5336 Use NameData and namestrcpy for names * timescale#5317 Fix some incorrect memory handling * timescale#5367 Rename columns in old-style continuous aggregates * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size * timescale#5153 Fix concurrent locking with chunk_data_node table **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size * timescale#5226 Fix concurrent locking with chunk_data_node table * timescale#5317 Fix some incorrect memory handling * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * #5159 Support Continuous Aggregates names in hypertable_(detailed_)size * #5226 Fix concurrent locking with chunk_data_node table * #5317 Fix some incorrect memory handling * #5336 Use NameData and namestrcpy for names * #5343 Set PortalContext when starting job * #5360 Fix uninitialized bucket_info variable * #5362 Make copy fetcher more async * #5364 Fix num_chunks inconsistency in hypertables view * #5367 Fix column name handling in old-style continuous aggregates * #5378 Fix multinode DML HA performance regression * #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * timescale#5159 Support Continuous Aggregates names in hypertable_(detailed_)size * timescale#5226 Fix concurrent locking with chunk_data_node table * timescale#5317 Fix some incorrect memory handling * timescale#5336 Use NameData and namestrcpy for names * timescale#5343 Set PortalContext when starting job * timescale#5360 Fix uninitialized bucket_info variable * timescale#5362 Make copy fetcher more async * timescale#5364 Fix num_chunks inconsistency in hypertables view * timescale#5367 Fix column name handling in old-style continuous aggregates * timescale#5378 Fix multinode DML HA performance regression * timescale#5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

@justinozavala

This release contains bug fixes since the 2.10.0 release. We recommend that you upgrade at the next available opportunity. **Bugfixes** * #5159 Support Continuous Aggregates names in hypertable_(detailed_)size * #5226 Fix concurrent locking with chunk_data_node table * #5317 Fix some incorrect memory handling * #5336 Use NameData and namestrcpy for names * #5343 Set PortalContext when starting job * #5360 Fix uninitialized bucket_info variable * #5362 Make copy fetcher more async * #5364 Fix num_chunks inconsistency in hypertables view * #5367 Fix column name handling in old-style continuous aggregates * #5378 Fix multinode DML HA performance regression * #5384 Fix Hierarchical Continuous Aggregates chunk_interval_size **Thanks** * @justinozavala for reporting an issue with PL/Python procedures in the background worker * @Medvecrab for discovering an issue with copying NameData when forming heap tuples. * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns * @pushpeepkmonroe for discovering an issue in upgrading old-style continuous aggregates with renamed columns

github-actions bot assigned Medvecrab Feb 14, 2023

github-actions bot requested review from mahipv and nikkhils February 14, 2023 06:21

kgyrtkirk reviewed Feb 14, 2023

View reviewed changes

kgyrtkirk requested a review from mkindahl February 14, 2023 08:26

mkindahl reviewed Feb 14, 2023

View reviewed changes

mkindahl approved these changes Feb 15, 2023

View reviewed changes

Medvecrab force-pushed the fix_memory_handling_branch branch from 499302f to 9cec610 Compare February 15, 2023 13:07

kgyrtkirk approved these changes Feb 16, 2023

View reviewed changes

mkindahl mentioned this pull request Feb 16, 2023

Use NameData and namestrcpy for names #5336

Merged

jnidzwetzki mentioned this pull request Feb 16, 2023

Added coccinelle rule to find strlcpy on NameData #5340

Merged

Medvecrab force-pushed the fix_memory_handling_branch branch from 88680a0 to 15c5c7b Compare February 17, 2023 01:43

Medvecrab force-pushed the fix_memory_handling_branch branch from 15c5c7b to f426122 Compare February 20, 2023 03:25

mkindahl self-assigned this Feb 20, 2023

mkindahl force-pushed the fix_memory_handling_branch branch from f426122 to 0b8d94a Compare February 20, 2023 11:26

Fix changelog message for NameData issue

c770a86

mkindahl force-pushed the fix_memory_handling_branch branch from 0b8d94a to c770a86 Compare February 20, 2023 12:09

mkindahl merged commit 8a51a76 into timescale:main Feb 20, 2023

timescale-automation mentioned this pull request Mar 1, 2023

Backport to 2.10.x: #5317: Fix some incorrect memory handling #5372

Merged

timescale-automation added the auto-backport-not-done Automated backport of this PR has failed non-retriably (e.g. conflicts) label Mar 3, 2023

svenklemm mentioned this pull request Mar 6, 2023

Release 2.10.1 #5386

Merged

timescale-automation added the backported-2.10.x label Apr 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some incorrect memory handling #5317

Fix some incorrect memory handling #5317

Medvecrab commented Feb 14, 2023 •

edited by mkindahl

Loading

github-actions bot commented Feb 14, 2023

codecov bot commented Feb 14, 2023 •

edited

Loading

kgyrtkirk Feb 14, 2023

Medvecrab Feb 14, 2023

mkindahl Feb 14, 2023

Medvecrab Feb 14, 2023

mkindahl Feb 15, 2023

shinderuk Feb 15, 2023

mkindahl Feb 15, 2023 •

edited

Loading

mkindahl Feb 14, 2023

mkindahl Feb 15, 2023

mkindahl Feb 14, 2023

kgyrtkirk commented Feb 16, 2023

mkindahl commented Feb 17, 2023

Medvecrab commented Feb 17, 2023

timescale-automation commented Mar 3, 2023

		Name relname = palloc0(NAMEDATALEN);
		namestrcpy (relname, NameStr(((Form_pg_class) GETSTRUCT(tuple))->relname));

Fix some incorrect memory handling #5317

Fix some incorrect memory handling #5317

Conversation

Medvecrab commented Feb 14, 2023 • edited by mkindahl Loading

github-actions bot commented Feb 14, 2023

codecov bot commented Feb 14, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkindahl Feb 15, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kgyrtkirk commented Feb 16, 2023

mkindahl commented Feb 17, 2023

Medvecrab commented Feb 17, 2023

timescale-automation commented Mar 3, 2023

Git status

Medvecrab commented Feb 14, 2023 •

edited by mkindahl

Loading

codecov bot commented Feb 14, 2023 •

edited

Loading

mkindahl Feb 15, 2023 •

edited

Loading