Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flaky test, test dict popteim ...get: wrong variant type #2866

Closed
gabrieldemarmiesse opened this issue May 28, 2024 · 6 comments
Closed
Labels
bug Something isn't working help wanted Extra attention is needed mojo-repo Tag all issues with this label [stdlib] core-library-q2-2024

Comments

@gabrieldemarmiesse
Copy link
Contributor

gabrieldemarmiesse commented May 28, 2024

Bug description

Test test dict popteim ...get: wrong variant type

--
Command Output (stderr):
--
RUN: at line 13: mojo /Users/runner/work/mojo/mojo/stdlib/test/collections/test_dict.mojo
+ mojo /Users/runner/work/mojo/mojo/stdlib/test/collections/test_dict.mojo
Please submit a bug report to https://github.com/modularml/mojo/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.	Program arguments: mojo /Users/runner/work/mojo/mojo/stdlib/test/collections/test_dict.mojo
 #0 0x00000001002d64b8 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000be4b8)
 #1 0x00000001002d47a4 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000bc7a4)
 #2 0x00000001002d6b58 llvm_strlcpy (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x1000beb58)
 #3 0x000000018b99f584 (/usr/lib/system/libsystem_platform.dylib+0x180477584)
 #4 0x000000030013051c 
 #5 0x0000000100669b90 __jit_debug_register_code (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x100[45](https://github.com/modularml/mojo/actions/runs/9267159499/job/25492971477?pr=2865#step:9:46)1b90)
 #6 0x0000000100236e00 _mh_execute_header (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x10001ee00)
 #7 0x00000001002367f4 _mh_execute_header (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x10001e7f4)
 #8 0x000000010021f48c _mh_execute_header (/Users/runner/.modular/pkg/packages.modular.com_nightly_mojo/bin/mojo+0x10000748c)
 #9 0x000000018b5e60e0 
mojo crashed!
Please file a bug report.
[7850:3[46](https://github.com/modularml/mojo/actions/runs/9267159499/job/25492971477?pr=2865#step:9:47)89:20240528,094957.676839:WARNING crash_report_exception_handler.cc:257] UniversalExceptionRaise: (os/kern) failure (5)
/Users/runner/work/mojo/mojo/build/stdlib/collections/Output/test_dict.mojo.script: line 1:  78[48](https://github.com/modularml/mojo/actions/runs/9267159499/job/25492971477?pr=2865#step:9:49) Trace/BPT trap: 5       mojo /Users/runner/work/mojo/mojo/stdlib/test/collections/test_dict.mojo

--

Steps to reproduce

See https://github.com/modularml/mojo/actions/runs/9267159499/job/25492971477?pr=2865

System information

macos
modular v0.8.0
mojo 2024.5.2805
@gabrieldemarmiesse gabrieldemarmiesse added bug Something isn't working mojo-repo Tag all issues with this label labels May 28, 2024
@JoeLoser JoeLoser added help wanted Extra attention is needed [stdlib] core-library-q2-2024 labels May 28, 2024 — with Linear
@gabrieldemarmiesse
Copy link
Contributor Author

gabrieldemarmiesse commented May 28, 2024

Since popitem uses reverse(self.items()) and that reverse is already buggy (see #2369) I guess that the issue is either in the compiler or the implementation of reverse(Dict). But if it's in the compiler, reverse(Dict) triggers it.

@jayzhan211
Copy link
Contributor

The compiler is not open-source, so I have no idea what is going on :(

@JoeLoser
Copy link
Collaborator

This is likely a stdlib bug I think. It was flaky all weekend and we actually reverted the popitem PR yesterday internally as you'll see in this morning nightly release. FYI @Mogball

@JoeLoser
Copy link
Collaborator

I think we should close this GitHub issue since we reverted it and then can reapproach popitem implementation entirely and make it not flake.

@msaelices
Copy link
Contributor

@JoeLoser I had this other approach of implementation: #2794

I initially closed it as there was a previous implementation and I did not know. Maybe this another approach is not causing flaky tests to fail

@JoeLoser
Copy link
Collaborator

@JoeLoser I had this other approach of implementation: #2794

I initially closed it as there was a previous implementation and I did not know. Maybe this another approach is not causing flaky tests to fail

I'd recommend we re-open your PR and proceed with that. I commented on the PR directly. What do you think?

modularbot pushed a commit that referenced this issue May 31, 2024
…(Dict.items())` (#40974)

[External] [stdlib] Fix UB in `reversed(Dict.values())` and
`reversed(Dict.items())`

Finally found the culprit in the flakyness that plagued us since a few
week in the `test_reversed.mojo`.

### The actual bug:

When iterating over a list in reverse order, we should start at
`len(my_list) - 1` not at `len(my_list)`.
That triggered an out of bounds access and thus was undefined behavior.

### The effect on our CI
As you know, we have been seeing flakyness lately. It was documented a
number of times and always related to `reverse`:
* #2866 (comment)
* #2369

### Why was it passing sometimes?
This is because there were `Optional[...]` in the List. Thus if the flag
of the `Optional` says that no element is present, it's just skipped
(the dict doesn't have an entry at this index). So the list of the Dict
would often look like this: `["a", "b", "c", "d"] None`
but the last `None` is actually memory that we don't have access to.
Sometimes it's then skipped in the iteration making the tests pass.
Sometimes it would cause segfaults because the test dict worked with
strings. Sometimes we would get `wrong variant type` since we don't know
what happens to the memory between None check and access.

### Why wasn't it found earlier?

First of all, our Dict implementation is too complexe for what it does
and thus is very good at hiding bugs.

Well we did have `debug_assert` before getting the element of the
`List`, but this `debug_assert` looked like this in the dict iterator:
```mojo
            @parameter
            if forward:
                debug_assert(
                    self.index < self.src[]._reserved, "dict iter bounds"
                )
            else:
                debug_assert(self.index >= 0, "dict iter bounds")
```
So one bound was checked when reading in one direction and the other
bound was checked in the other direction. A better `debug_assert` would
have been
```mojo
debug_assert(0 <= self.index < self.src[]._reserved, "dict iter bounds")
```
When I worked on my PR #2718 the
condition `self.index < self.src[]._reserved` didn't trigger anything
since it was in the wrong branch, it was never executed.

Also before, `__get_ref` didn't have any bounds checks, even when
assertions were enabled.

A recent commit
8d0870e
adds `unsafe_get()` in List and make `__get_ref` use it. It also adds
`debug_assert` to `unsafe_get()`, which means that now `__get_ref` has
bounds checks if run with assertions enabled. This allowed me to catch
the out of bounds access when updating
#2718 making the fail
deterministic and debuggable.

Since we have this, the `debug_assert` in `dict.mojo` isn't necessary
anymore.

### Consequences on ongoing work:
* This fix have been also added to
#2718
* The PR #2701 that we did with
@jayzhan211 was actually correct. It was just using
`reverse(Dict.items())` which was buggy at the time. After the fix is
merged, we can re-revert this PR.
* #2794 is not necessary anymore
since the implementation by @jayzhan211 was correct.
* The real cause of #2866 was
found, the issue has already been closed though.
* #2369 can be closed for good.
* #2832 can be closed for good.

### Closing thoughts
* We really need to run the unit tests with assertions enabled and add
assertions whenever necessary
* The dict implementation is a bit too complicated. For example,
`self._reserved` is the length of the internal list. There is no need to
store the length of the list twice. Let's drop this variable and use
`len(self._entries)` instead. I guess this is a relic of the time when
`List` wasn't completely flushed out. If had done so, it would have been
ovious that we can't do `my_list.__get_ref(len(my_list))`
* Iterating manually over a list like this is bug-prone. The
implementation we have especially is, since
```mojo
                @parameter
                if forward:
                    self.index += 1
                else:
                    self.index -= 1
```
is done twice in the code, it should only be done once. While there is
no bug, code duplication and complexity hides bugs.
* We should iterate over the list with a list iterator, not with a
custom-made iterator. This will remove a lot of code in the `dict.mojo`.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2896
MODULAR_ORIG_COMMIT_REV_ID: b65009dc51f1e3027f91b5b61a5b7003cb022b87
modularbot pushed a commit that referenced this issue Jun 7, 2024
…(Dict.items())` (#40974)

[External] [stdlib] Fix UB in `reversed(Dict.values())` and
`reversed(Dict.items())`

Finally found the culprit in the flakyness that plagued us since a few
week in the `test_reversed.mojo`.

### The actual bug:

When iterating over a list in reverse order, we should start at
`len(my_list) - 1` not at `len(my_list)`.
That triggered an out of bounds access and thus was undefined behavior.

### The effect on our CI
As you know, we have been seeing flakyness lately. It was documented a
number of times and always related to `reverse`:
* #2866 (comment)
* #2369

### Why was it passing sometimes?
This is because there were `Optional[...]` in the List. Thus if the flag
of the `Optional` says that no element is present, it's just skipped
(the dict doesn't have an entry at this index). So the list of the Dict
would often look like this: `["a", "b", "c", "d"] None`
but the last `None` is actually memory that we don't have access to.
Sometimes it's then skipped in the iteration making the tests pass.
Sometimes it would cause segfaults because the test dict worked with
strings. Sometimes we would get `wrong variant type` since we don't know
what happens to the memory between None check and access.

### Why wasn't it found earlier?

First of all, our Dict implementation is too complexe for what it does
and thus is very good at hiding bugs.

Well we did have `debug_assert` before getting the element of the
`List`, but this `debug_assert` looked like this in the dict iterator:
```mojo
            @parameter
            if forward:
                debug_assert(
                    self.index < self.src[]._reserved, "dict iter bounds"
                )
            else:
                debug_assert(self.index >= 0, "dict iter bounds")
```
So one bound was checked when reading in one direction and the other
bound was checked in the other direction. A better `debug_assert` would
have been
```mojo
debug_assert(0 <= self.index < self.src[]._reserved, "dict iter bounds")
```
When I worked on my PR #2718 the
condition `self.index < self.src[]._reserved` didn't trigger anything
since it was in the wrong branch, it was never executed.

Also before, `__get_ref` didn't have any bounds checks, even when
assertions were enabled.

A recent commit
8d0870e
adds `unsafe_get()` in List and make `__get_ref` use it. It also adds
`debug_assert` to `unsafe_get()`, which means that now `__get_ref` has
bounds checks if run with assertions enabled. This allowed me to catch
the out of bounds access when updating
#2718 making the fail
deterministic and debuggable.

Since we have this, the `debug_assert` in `dict.mojo` isn't necessary
anymore.

### Consequences on ongoing work:
* This fix have been also added to
#2718
* The PR #2701 that we did with
@jayzhan211 was actually correct. It was just using
`reverse(Dict.items())` which was buggy at the time. After the fix is
merged, we can re-revert this PR.
* #2794 is not necessary anymore
since the implementation by @jayzhan211 was correct.
* The real cause of #2866 was
found, the issue has already been closed though.
* #2369 can be closed for good.
* #2832 can be closed for good.

### Closing thoughts
* We really need to run the unit tests with assertions enabled and add
assertions whenever necessary
* The dict implementation is a bit too complicated. For example,
`self._reserved` is the length of the internal list. There is no need to
store the length of the list twice. Let's drop this variable and use
`len(self._entries)` instead. I guess this is a relic of the time when
`List` wasn't completely flushed out. If had done so, it would have been
ovious that we can't do `my_list.__get_ref(len(my_list))`
* Iterating manually over a list like this is bug-prone. The
implementation we have especially is, since
```mojo
                @parameter
                if forward:
                    self.index += 1
                else:
                    self.index -= 1
```
is done twice in the code, it should only be done once. While there is
no bug, code duplication and complexity hides bugs.
* We should iterate over the list with a list iterator, not with a
custom-made iterator. This will remove a lot of code in the `dict.mojo`.

Co-authored-by: Gabriel de Marmiesse <gabriel.demarmiesse@datadoghq.com>
Closes #2896
MODULAR_ORIG_COMMIT_REV_ID: b65009dc51f1e3027f91b5b61a5b7003cb022b87
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed mojo-repo Tag all issues with this label [stdlib] core-library-q2-2024
Projects
None yet
Development

No branches or pull requests

4 participants