Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEGFAULT on macOS with the latest version of watchdog #762

Closed
bstaletic opened this issue Feb 11, 2021 · 17 comments · Fixed by #763
Closed

SEGFAULT on macOS with the latest version of watchdog #762

bstaletic opened this issue Feb 11, 2021 · 17 comments · Fixed by #763

Comments

@bstaletic
Copy link
Contributor

Unfortunately, this will be tough to describe, since I don't have a mac and only my CI is complaining.

I'm using watchdog as a submodule and today tried to update it. One specific test consistently makes pytest segfault.

This is the log from a SEGFAULTING run: https://dev.azure.com/borisstaletic/3ce92110-caa5-4c49-b8c3-44a433da676b/_apis/build/builds/1535/logs/64
The way I'm using watchdog is by talking to, in this case, jdt.ls. When I open this file it sends... these patterns:

[
  {
    "globPattern": "**/*.java"
  },
  {
    "globPattern": "**/.project"
  },
  {
    "globPattern": "**/.classpath"
  },
  {
    "globPattern": "**/.settings/*.prefs"
  },
  {
    "globPattern": "**/src/**"
  },
  {
    "globPattern": "**/*.gradle"
  },
  {
    "globPattern": "**/gradle.properties"
  },
  {
    "globPattern": "**/pom.xml"
  },
  {
    "globPattern": "/home/bstaletic/work/ycmd/ycmd/tests/java/testdata/simple_eclipse_project",
    "kind": 4
  }
]

Which we then edit, like this: https://github.com/ycm-core/ycmd/blob/master/ycmd/completers/language_server/language_server_completer.py#L590-L591

And finally use that with this handler: https://github.com/ycm-core/ycmd/blob/master/ycmd/completers/language_server/language_server_completer.py#L3219

Any further information, as well as running the CI, I can gladly provide.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 12, 2021

Thanks for the report.

For the history, you are talking about the release v2.0.0.

Now, that will be a though one to debug 🤔 Do you think you can reduce the code to ease reproducting the issue? I guess this is releated to some C code that was modified in FSEvents. Maybe @samschott or @CCP-Aporia could have a look at changes introduced in v2.0.0 to see if there is a missing refcount change or something like that?

@bstaletic
Copy link
Contributor Author

I do have an idea how to reduce the code. What I think would repro:

  1. Have directory structure as the simple_eclipse_project
  2. Prepare watchdog with hardcoded patterns that jdt.ls would use.
  3. Start jdt.ls, so that it triggers some events.
  4. Wait.
  5. Hopefully segfault.

I'll try that later today.

@samschott
Copy link
Contributor

@bstaletic, could you possibly get the crash report from the CI runner? It should be the latest .crash file in ~/Library/Logs/DiagnosticReports/.

@puremourning
Copy link

I repro'd this locally. this is what crash reporter says: http://vpaste.net/XouKt

Python_2021-02-12-104711_BeniMac2020.txt

Python_2021-02-12-105901_BeniMac2020.txt

@puremourning
Copy link

BTW the reproduction occurs when we add raise RuntimeError( 'test' ) in a thread that's handling IO events from a subprocess stdout. not sure if that helps explain it.

@samschott
Copy link
Contributor

samschott commented Feb 12, 2021

Thanks @puremourning, that's very helpful! Could you also provide a minimum example for us to reproduce the crash?

If I am reading the report correctly, the segfault happens because NULL gets passed to CFString_AsPyUnicode here:

path = CFString_AsPyUnicode(cf_path);

@CCP-Aporia, would you agree?

Now from the FSEvents docs, I would not have expected the event path to ever be null. But maybe the actual issue lies earlier in the code at:

py_event_paths = PyList_New(num_events);
py_event_inodes = PyList_New(num_events);
py_event_flags = PyList_New(num_events);
py_event_ids = PyList_New(num_events);
if (G_NOT(event_path_info_array_ref && py_event_flags && py_event_ids))
{
Py_XDECREF(py_event_ids);
Py_XDECREF(py_event_flags);
return /*NULL*/;
}

This check looks a bit odd to me. Shouldn't it be something like the following instead:

    /* Convert event flags and paths to Python ints and strings. */
    py_event_paths = PyList_New(num_events);
    py_event_inodes = PyList_New(num_events);
    py_event_flags = PyList_New(num_events);
    py_event_ids = PyList_New(num_events);
    if (G_NOT(py_event_paths && py_event_inodes && py_event_flags && py_event_ids))
    {
        Py_XDECREF(py_event_paths);
        Py_XDECREF(py_event_inodes);
        Py_XDECREF(py_event_flags);
        Py_XDECREF(py_event_ids);
        return /*NULL*/;
    }

@CCP-Aporia
Copy link
Contributor

Hey, that's a great a great bug report - thanks a lot for all the details!

@samschott - I agree that the crash likely happens because of cf_path being a nullptr. CFDictionaryGetValue may return NULL "if no key-value pair matching key exists". Presumably there is no kFSEventStreamEventExtendedDataPathKey item available for the given path info, and we need to cater for that scenario.

About the if (G_NOT( block ... yeah, that also looks like it's not covering everything in its condition and is not freeing up everything. But at the same time, if PyList_New fails then we have bigger problems anyway; in general it only fails when we're out of memory on a machine.

@samschott
Copy link
Contributor

@CCP-Aporia, thanks for the explanation. I was wondering which scenario the G_NOT check was guarding against.

Presumably there is no kFSEventStreamEventExtendedDataPathKey item available for the given path info, and we need to cater for that scenario.

It appears like that. I am still confused as to why this would happen and how this is related to raising a RuntimeError in a different thread which handles the stdout from a subprocess.

But if there is a scenario where cf_path can be NULL, we clearly need to accommodate that.

@puremourning
Copy link

It appears like that. I am still confused as to why this would happen and how this is related to raising a RuntimeError in a different thread which handles the stdout from a subprocess.

Turns out that it's probably not.

The thread that is throwing the exception is actually the same thread that created the observer objects. In this case the throwing of the exception causes us to destroy the observers:

        observer = Observer()
        observer.schedule( self._watchdog_factory( globs ),
                           self._project_directory,
                           recursive = True )
        observer.start()
        self._observers.append( observer )

later (triggered by the exception)...

    for observer in self._observers:
      observer.stop()
      observer.join()

@puremourning
Copy link

TL;DR - I think the crash happens when there are multiple events in the callback, but I don't know why.

I am able to reproduce this with a debug version of watchdog and a debugger attached.

It seems that it crashes every time it enters this code:

    if (G_IS_NULL(c_string_ptr)) {
        cf_string_length = CFStringGetLength(cf_string_ref);
        CFStringGetCString(cf_string_ref, c_string_buff, cf_string_length, 0);
        py_string = PyUnicode_FromString(c_string_buff);
    } else {

Neither cf_path nor cf_inode are NULL:

Watches: ----
Expression: i
  - Result: 1
Expression: path_info_dict
  - Result: 0x00007ff8815e6f70
    + [0] (__lldb_autogen_nspair): {key:0x00007fff88f72fb8, value:0x00007ff8815e0ef0}
    + [1] (__lldb_autogen_nspair): {key:0x00007fff88f72fd8, value:0x428cef0e47b0896d}
    + [raw] (): CFDictionaryRef
Expression: cf_path
  - Result: 0x00007ff8815e0ef0
Expression: cf_inode
  - Result: 0x428cef0e47b0896d
Expression: cf_string_length
 *- Result: 156
Expression: num_events
 *- Result: 17

Screenshot 2021-02-16 at 11 58 31

If it helps, these are the flags for the one that causes it to break:

Expression: event_flags[i]
 *- Result: 66048

The only useful other thing that I notice is that it crashes consistently when i > 0 (actually always when i == 1). Perhaps this is useful, I couldn't see anything specifically in the code that meant this was obviously broken for event lists with more than 1 entry, but you guys would have a better idea than me.

@samschott
Copy link
Contributor

samschott commented Feb 16, 2021

@puremourning, this is very helpful, thanks! It also explains why the segfault didn't show up in our testing: the code block which you show is only called as a backup when CFStringGetCStringPtr returns NULL. I suspect that the string length is incorrectly calculated there.

Could you try applying this patch and see if it fixes the problem:

diff --git a/src/watchdog_fsevents.c b/src/watchdog_fsevents.c
index e9f4e6aced800936af2778cc046fa6382dceae13..92f3ed0c7e067bda0f362cf022d7af2f99c5f69e 100644
--- a/src/watchdog_fsevents.c
+++ b/src/watchdog_fsevents.c
@@ -293,17 +293,27 @@ static void watchdog_pycapsule_destructor(PyObject *ptr)
  */
 PyObject * CFString_AsPyUnicode(CFStringRef cf_string_ref)
 {
-    CFIndex cf_string_length;
-    char *c_string_buff = NULL;
+
+    if (G_IS_NULL(cf_string_ref)) {
+        return PyUnicode_FromString("");
+    }
+
     const char *c_string_ptr;
     PyObject *py_string;
 
-    c_string_ptr = CFStringGetCStringPtr(cf_string_ref, 0);
+    c_string_ptr = CFStringGetCStringPtr(cf_string_ref, kCFStringEncodingUTF8);
 
     if (G_IS_NULL(c_string_ptr)) {
-        cf_string_length = CFStringGetLength(cf_string_ref);
-        CFStringGetCString(cf_string_ref, c_string_buff, cf_string_length, 0);
-        py_string = PyUnicode_FromString(c_string_buff);
+        CFIndex length = CFStringGetLength(cf_string_ref);
+        CFIndex max_length = CFStringGetMaximumSizeForEncoding(length, kCFStringEncodingUTF8) + 1;
+        char *buffer = (char *)malloc(max_length);
+        if (CFStringGetCString(cf_string_ref, buffer, max_length, kCFStringEncodingUTF8)) {
+            py_string = PyUnicode_FromString(buffer);
+        }
+        else {
+            py_string = PyUnicode_FromString("");
+        }
+        free(buffer);
     } else {
         py_string = PyUnicode_FromString(c_string_ptr);
     }

@puremourning
Copy link

applying that patch does indeed resolve the crash for the test case ! thanks.

@samschott
Copy link
Contributor

samschott commented Feb 16, 2021

Great! I'll submit a PR with the changes.

It would still be good to have an example which we can test against, if only to prevent future regressions. But this might be difficult.

Do you know which file path caused the segfault? It will likely fulfil two criteria:

  1. The number of bytes to hold the encoded string exceeds the number of unicode characters.
  2. CFStringGetCStringPtr returns NULL.

The second criterion is pretty vague. According to the documentation for CFStringGetCStringPtr:

Whether or not this function returns a valid pointer or NULL depends on many factors, all of which depend on how the string was created and its properties. In addition, the function result might change between different releases and on different platforms. So do not count on receiving a non-NULL result from this function under any circumstances.

@bstaletic
Copy link
Contributor Author

The number of bytes to hold the encoded string exceeds the number of unicode characters.

If I'm not mistaken, the watched directory is this one. When our tests are run, jdt.ls creates a directory in there named target in which there's only one file with unicode characters.

simple_eclipse_project/target/classes/com/youcompleteme/Test$TéstClass.class

That thing is the compiled version of this class.

@samschott
Copy link
Contributor

samschott commented Feb 16, 2021

Yes, that is it. I can reproduce the segfault with that path. Many thanks for your help in debugging this :)

BoboTiG pushed a commit that referenced this issue Feb 17, 2021
…762)

* fix converting CFStringRef to PyUnicode

* fix handling when creation of PyList fails

* added test case for string conversion

* declare c_string_ptr in-line

* remove trailing slash from tmpdir

* use decomposed e-acute in filename for tests

this prevents unicode normalisation on HFS+
@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 17, 2021

Thanks everyone for quick analysis and resolution, I love OSS for that kind of actions. ❤️ 🍾

I will release v2.0.1 with the patch.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 17, 2021

Release v2.0.1 online.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants