Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry server randomly crashes with segmentation fault #51

Closed
nunoguedelha opened this issue Oct 25, 2021 · 29 comments
Closed

Telemetry server randomly crashes with segmentation fault #51

nunoguedelha opened this issue Oct 25, 2021 · 29 comments
Assignees
Labels
bug Something isn't working

Comments

@nunoguedelha
Copy link
Collaborator

Error message log:

$ npm start

> iCubTelemVizServer@1.0.0 start /Users/nunoguedelha/dev/yarp-openmct/iCubTelemVizServer
> . ${NVM_DIR}/nvm.sh; nvm use v4.2.2; node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js

Now using node v4.2.2 (npm v2.14.7)
iCub Telemetry server launched!
[INFO] |yarp.os.Port| Port /yarpjs/inertial:i active at tcp://192.168.1.100:10117/
[INFO] |yarp.os.impl.PortCoreInputUnit| Receiving input from /icubSim/inertial to /yarpjs/inertial:i using tcp
[INFO] |yarp.os.Port| Port /yarpjs/left_leg/stateExt:o active at tcp://192.168.1.100:10118/
[INFO] |yarp.os.impl.PortCoreInputUnit| Receiving input from /icubSim/left_leg/stateExt:o to /yarpjs/left_leg/stateExt:o using tcp
[INFO] |yarp.os.Port| Port /yarpjs/camLeftEye:i active at tcp://192.168.1.100:10119/
[INFO] |yarp.os.Port| Port /yarpjs/camRightEye:i active at tcp://192.168.1.100:10120/
[INFO] |yarp.os.Port| Port /yarpjs/left_leg/cartesianEndEffectorWrench:i active at tcp://192.168.1.100:10121/
[INFO] |yarp.os.impl.PortCoreInputUnit| Receiving input from /wholeBodyDynamics/left_leg/cartesianEndEffectorWrench:o to /yarpjs/left_leg/cartesianEndEffectorWrench:i using tcp
[INFO] |yarp.os.Port| Port /yarpjs/right_leg/cartesianEndEffectorWrench:i active at tcp://192.168.1.100:10122/
[INFO] |yarp.os.impl.PortCoreInputUnit| Receiving input from /wholeBodyDynamics/right_leg/cartesianEndEffectorWrench:o to /yarpjs/right_leg/cartesianEndEffectorWrench:i using tcp
[INFO] |yarp.os.Port| Port /yarpjs/battery/data:i active at tcp://192.168.1.100:10123/
[INFO] |yarp.os.Port| Port /yarpjs/sysCmdsGenerator/rpc active at tcp://192.168.1.100:10124/
{ status: 'OK',
  message: 'Opem-MCT static server process started.' }
ICubTelemetry History hosted at http://localhost:8081/history
ICubTelemetry Realtime hosted at ws://localhost:8081/realtime
listening on http://localhost:3000
[OPEN-MCT STATIC SERVER] stdout: Now using node v14.17.0 (npm v6.14.13)

[OPEN-MCT STATIC SERVER] ipc: {"pid":53686}
[OPEN-MCT STATIC SERVER] stdout: iCub Telemetry Visualizer (Open MCT based) hosted at http://localhost:8080

sh: line 1: 53399 Segmentation fault: 11  node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js

npm ERR! Darwin 19.6.0
npm ERR! argv "/Users/nunoguedelha/.nvm/versions/node/v4.2.2/bin/node" "/Users/nunoguedelha/.nvm/versions/node/v4.2.2/bin/npm" "start"
npm ERR! node v4.2.2
npm ERR! npm  v2.14.7
npm ERR! code ELIFECYCLE
npm ERR! iCubTelemVizServer@1.0.0 start: `. ${NVM_DIR}/nvm.sh; nvm use v4.2.2; node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js`
npm ERR! Exit status 139
npm ERR!
npm ERR! Failed at the iCubTelemVizServer@1.0.0 start script '. ${NVM_DIR}/nvm.sh; nvm use v4.2.2; node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js'.
npm ERR! This is most likely a problem with the iCubTelemVizServer package,
npm ERR! not with npm itself.
npm ERR! Tell the author that this fails on your system:
npm ERR!     . ${NVM_DIR}/nvm.sh; nvm use v4.2.2; node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js
npm ERR! You can get their info via:
npm ERR!     npm owner ls iCubTelemVizServer
npm ERR! There is likely additional logging output above.

npm ERR! Please include the following file with any support request:
npm ERR!     /Users/nunoguedelha/dev/yarp-openmct/iCubTelemVizServer/npm-debug.log
@nunoguedelha nunoguedelha self-assigned this Oct 25, 2021
@nunoguedelha nunoguedelha added the bug Something isn't working label Oct 25, 2021
@nunoguedelha
Copy link
Collaborator Author

Today it has been difficult to reproduce the issue.

recurrent error:

Every time it occurred, the terminal error log is always the same as the one in the description.

npm ERR! Exit status 139

Node.js exited with error 139. As per the exit codes in https://nodejs.org/api/process.html#exit-codes, the error code = (128+11),

Signal Exits: If Node.js receives a fatal signal such as SIGKILL or SIGHUP, then its exit code will be 128 plus the value of the signal code.

...so the actual signal code is 11, which matches the same segmentation we see in the terminal log.

Which process died?

That segmentation fault occurred in the Node.js main process running the script iCubTelemVizServer.js. Actually, after the fatal error shuts down the process, the child process running the Open MCT Static Server openmctStaticServer/server.js is still on and we can still switch between Telemetry Domain Elements in the visualizer interface.

npm log content

The generated npm-debug.log is always the same, so the very same problem is causing this segmentation fault.

9 verbose stack Exit status 139
9 verbose stack     at EventEmitter.<anonymous> (/Users/nunoguedelha/.nvm/versions/node/v4.2.2/lib/node_modules/npm/lib/utils/lifecycle.js:214:16)
9 verbose stack     at emitTwo (events.js:87:13)
9 verbose stack     at EventEmitter.emit (events.js:172:7)
9 verbose stack     at ChildProcess.<anonymous> (/Users/nunoguedelha/.nvm/versions/node/v4.2.2/lib/node_modules/npm/lib/utils/spawn.js:24:14)
9 verbose stack     at emitTwo (events.js:87:13)
9 verbose stack     at ChildProcess.emit (events.js:172:7)
9 verbose stack     at maybeClose (internal/child_process.js:818:16)
9 verbose stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:211:5)

The stack log isn't deep enough as it starts at the callback onexit (9 verbose stack at Process.ChildProcess._handle.onexit...), so too late. We'll try to get more insight using breakpoint.

@nunoguedelha
Copy link
Collaborator Author

I wasn't able to break at the segmentation fault using a breakpoint in internal/child_process.js. the breakpoint works when the intermediate child process (running runModule.sh I suppose) stops, but later on, when the segmentation fault occurs, the process just ends without stopping at the breakpoint.

Here is the breakpoint location w.r.t. the function call tree we expect when the segmentation fault occurs:

Process.ChildProcess._handle.onexit (internal/child_process.js:211:5)
    => maybeClose(subprocess) (internal/child_process.js:818:16)
        => subprocess.emit('close', subprocess.exitCode, subprocess.signalCode);
            => ChildProcess.emit (events.js:172:7)
               ...
                  => procError(er) (lifecycle.js:212)

Another way is to add a listener to the signal SIGSEGV to display stack information or other context information. Although it is not recommended to ad such a listener (https://nodejs.org/api/process.html#signal-events)...

'SIGBUS', 'SIGFPE', 'SIGSEGV' and 'SIGILL', when not raised artificially using kill(2), inherently leave the process in a state from which it is not safe to call JS listeners. Doing so might cause the process to stop responding.

...it's worth trying.

There is actually a package that kind of does the same thing: segfault-handler (https://httptoolkit.tech/blog/how-to-debug-node-segfaults/). We first install the segmentation fault handler:

npm install segfault-handler

then add the following lines to the very beginning of the main application script:

const SegfaultHandler = require('segfault-handler');
SegfaultHandler.registerHandler('crash.log');

The module listens for any SIGSEGV signal, and reports the detailed stack trace that caused it before the process shuts down.

@nunoguedelha
Copy link
Collaborator Author

Segfault Handler output

After the segmentation fault occurs, we get the following output log:

PID 61103 received SIGSEGV for address: 0x1ca4be280
0   segfault-handler.node               0x00000001015d40b0 _ZL16segfault_handleriP9__siginfoPv + 304
1   libsystem_platform.dylib            0x00007fff6d0685fd _sigtramp + 29
2   ???                                 0x00000001030288c8 0x0 + 4345465032
3   libYARP_os.3.4.3.dylib              0x0000000153392252 _ZNK4yarp2os6Bottle8toStringEv + 18
4   YarpJS.node                         0x00000001059120f2 _ZN26YarpJS_BufferedPort_Bottle16_callback_onReadERNSt3__16vectorIN2v85LocalINS2_5ValueEEENS0_9allocatorIS5_EEEE + 50
5   YarpJS.node                         0x0000000105914a9a _ZN15YarpJS_CallbackI19YarpJS_BufferedPortIN4yarp2os6BottleEEE20_internal_worker_endEP9uv_work_si + 90
6   node                                0x0000000100768f02 uv__work_done + 175
7   node                                0x000000010076a9f4 uv__async_event + 65
8   node                                0x000000010076ab77 uv__async_io + 136
9   node                                0x0000000100778f96 uv__io_poll + 1554
10  node                                0x000000010076b02b uv_run + 321
11  node                                0x000000010064825c _ZN4node5StartEiPPc + 583
12  node                                0x0000000100001834 start + 52
13  ???                                 0x0000000000000002 0x0 + 2
sh: line 1: 61103 Segmentation fault: 11  node ${NODE_DEBUG_OPTION} iCubTelemVizServer.js

So, the issue occurs in YarpJS.node, which is the C++ Addon generated by the yarp.js module implementing the Yarp Javascript bindings and socket bridge.

More precisely, the issue occurs in:

<yarp.js repo>/YarpJS/src/YarpJS_BufferedPort_Bottle.cpp:
    --> YarpJS_BufferedPort_Bottle::_callback_onRead(...)
        --> YarpJS_Bottle::ToString()

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Nov 25, 2021

I realised that the problem might be related to the Yarp version used during the server installation and the one used to run the server are different.

YarpJS.node was generated installation of the servers (using a conda environment "A")and depends on the library libYARP_os.3.4.3.dylib, from Yarp 3.4.3. At least in my case, I'm running the same telemetry server with a different conda environment "B").

I'm doing some further checks:

  • Run the setup with only one environment ("A")
  • Get more information from the stack trace (*)
  • Check the library dependencies in YarpJS.node => no specific YARP version specified in the CMakeLists.txt

(*) Using https://www.npmjs.com/package/segfault-handler (bsed on the project https://github.com/JochenKalmbach/StackWalker) and objdump command to disassemble and understand the stack.

@traversaro
Copy link
Contributor

Not sure if it can be useful to try to run the server under valgrind or DrMemory to understand if there is some wrong memory access that with low probability results in a segfault.

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Nov 30, 2021

I'm doing in parallel:

  • The analysis of the YarpJS_BufferedPort_Bottle::_callback_onRead and yarp::os::Bottle::toString()
  • Memory leak analysis using Valgrind or DrMemory

@nunoguedelha
Copy link
Collaborator Author

WIP: Analyse YarpJS_BufferedPort_Bottle::_callback_onRead

4   YarpJS.node                         0x00000001059120f2 _ZN26YarpJS_BufferedPort_Bottle16_callback_onReadERNSt3__16vectorIN2v85LocalINS2_5ValueEEENS0_9allocatorIS5_EEEE + 50

Disassemble YarpJS.node:

$ objdump -dS --line-numbers ./node_modules/YarpJS/build/Release/YarpJS.node

=> address 0x120c0, __ZN26YarpJS_BufferedPort_Bottle16_callback_onReadERNSt3__16vectorIN2v85LocalINS2_5ValueEEENS0_9allocatorIS5_EEEE

Demangled format...

$ objdump -dS --demangle --line-numbers ./node_modules/YarpJS/build/Release/YarpJS.node
00000000000120c0 YarpJS_BufferedPort_Bottle::_callback_onRead(std::__1::vector<v8::Local<v8::Value>, std::__1::allocator<v8::Local<v8::Value> > >&):
   120c0: 55                            pushq   %rbp
   120c1: 48 89 e5                      movq    %rsp, %rbp
   120c4: 41 57                         pushq   %r15
   120c6: 41 56                         pushq   %r14
   ...
0000000000007030 non-virtual thunk to yarp::os::Bottle::toString() const:
    7030: 55                            pushq   %rbp
    7031: 48 89 e5                      movq    %rsp, %rbp
    7034: 53                            pushq   %rbx
    7035: 50                            pushq   %rax
    7036: 48 89 fb                      movq    %rdi, %rbx
    7039: 48 8b 76 28                   movq    40(%rsi), %rsi
    703d: e8 4e b0 06 00                callq   438350 <__ZNK4yarp2os4impl10BottleImpl8toStringEv>
    7042: 48 89 d8                      movq    %rbx, %rax
    7045: 48 83 c4 08                   addq    $8, %rsp
    7049: 5b                            popq    %rbx
    704a: 5d                            popq    %rbp
    704b: c3                            retq
    704c: 0f 1f 40 00                   nopl    (%rax)

Don't know yet how to generate the source code to better understand the failure...

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Nov 30, 2021

First Memory leak analysis using Valgrind or DrMemory

It was easier to go for Valgrind.

It seems that Valgrind official release (latest is not yet supported on MacOS Catalina (10.15). But there is a modified fork fixing that problem (https://github.com/LouisBrunner/valgrind-macos), which can be installed from source by downloading the repository and building or via homebrew package.

Installation from source:

  1. git clone https://github.com/LouisBrunner/valgrind-macos.git
  2. cd valgrind
  3. sudo ./autogen.sh
  4. sudo ./configure --prefix=/where/you/want/it/installed --enable-only64bit
  5. copy the [contents][1] to ./coregrind/m_mach/mach_basics.c
  6. sudo make
  7. sudo make install

Installation via homebrew:

brew tap LouisBrunner/valgrind
brew install --HEAD LouisBrunner/valgrind/valgrind

Using Valgring requires installing the target modules (YARP) with debugging symbols. So, after building the whole superbuild in Release mode, we build YARP in Debug mode...

cmake --build . --config Debug --target YARP

  • Update Homebrew on machine.
  • Update Conda installation and install latest packages dependencies of the superbuild.
    conda update -n base conda
    conda activate base
    conda update all
    
  • Install the superbuild (tag 2021.08.1) Release config.
  • Re-build and re-install YARP sub-project with debug symbols.
    cmake --build . --config Debug --target YARP
    
  • Reproduce the issue with this build.
  • Install Valgrind
  • Re-install YarpJS in debug mode:
    • Install yarp-openmct dependency YarpJS from local repository. Change package.json from yarp-openmct acordingly:
    "dependencies": {
        ...
        "YarpJS": "<local-path-to-yarp.js>"
    },
    • Set debug flag in cmake-js command in YarpJS's package.json:
    "scripts": {
        "install": "cmake-js -D"
    },
    • Install iCubTelemetry module npm install.
  • Run the memory analysis:
    valgrind --leak-check=full --show-leak-kinds=all -s node iCubTelemVizServer.js
    

⚠️ running valgrind on npm start instead of node iCubTelemVizServer.js will give no results as npm start runs just an intermediate shell which in turn runs the node process.

Valgrind output

...
...
==97351==
==97351== LEAK SUMMARY:
==97351==    definitely lost: 2,192 bytes in 62 blocks
==97351==    indirectly lost: 2,074 bytes in 22 blocks
==97351==      possibly lost: 1,288 bytes in 8 blocks
==97351==    still reachable: 5,249 bytes in 130 blocks
==97351==         suppressed: 33,290 bytes in 140 blocks
==97351==
==97351== ERROR SUMMARY: 16 errors from 16 contexts (suppressed: 10 from 9)
--97351--
--97351-- used_suppression:     16 OSX1015:dyld-7 /usr/local/Cellar/valgrind/HEAD-3568e0b/libexec/valgrind/default.supp:781 suppressed: 19,696 bytes in 102 blocks
--97351-- used_suppression:      3 OSX1015:dyld-4 /usr/local/Cellar/valgrind/HEAD-3568e0b/libexec/valgrind/default.supp:751 suppressed: 10,296 bytes in 3 blocks
--97351-- used_suppression:     20 OSX1015:dyld-8 /usr/local/Cellar/valgrind/HEAD-3568e0b/libexec/valgrind/default.supp:791 suppressed: 2,754 bytes in 25 blocks
--97351-- used_suppression:      5 OSX1015:dyld-5 /usr/local/Cellar/valgrind/HEAD-3568e0b/libexec/valgrind/default.supp:761 suppressed: 544 bytes in 10 blocks
--97351-- used_suppression:      2 OSX1015:dyld-2 /usr/local/Cellar/valgrind/HEAD-3568e0b/libexec/valgrind/default.supp:733
==97351==
==97351== ERROR SUMMARY: 16 errors from 16 contexts (suppressed: 10 from 9)
Abort trap: 6

I have to rule out false errors, so for now the output doesn't seem very helpful.

@nunoguedelha
Copy link
Collaborator Author

ping @traversaro @S-Dafarra

@traversaro
Copy link
Contributor

If we are blocked on this, could it make sense to work on robotology/yarp.js#19 and hope that the situation changes/improves with that? Not really a rational move, but sometimes working on something else leads to fixes to other issues (so-called "serendipity").

@nunoguedelha
Copy link
Collaborator Author

Yes, good point. I actually prepared for that as I duplicated my setups, having one for resuming the other development/fixes, and one for running the memory tests on the background.

@nunoguedelha
Copy link
Collaborator Author

I'm currently configuring CLion setup for debugging and memory leek analysis. This tool has the approximately the same workflow than Webstorm and integrates already Valgrind, so it's quite promising, even for future eventual issues.

@nunoguedelha
Copy link
Collaborator Author

I'm currently configuring CLion setup for debugging and memory leek analysis. This tool has the approximately the same workflow than Webstorm and integrates already Valgrind, so it's quite promising, even for future eventual issues.

Refer to issue robotology/yarp.js#31 and particularly robotology/yarp.js#31 (comment).

@nunoguedelha
Copy link
Collaborator Author

Valgrind Configuration

Preferences > Build, Execution, Deployment > Dynamic Analysis Tools > Valgrind:

image

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Jan 4, 2022

Valgrind Memceck Analysis of YarpJS Target

Configuration

We setup a Run/Debug configuration of type "CMake Application" with the following settings:

  • target library code YarpJS.node C++ adon implementing the Javascript-C++ bindings
  • executable (has to be the node.js process running the javascript server application): ~/.nvm/versions/node/v14.17.0/bin/node
  • node argument iCubTelemVizServer.js
  • Remove the default "before launch" operation build for not building at each run.

image

Run

Run the YarpJS target with Valgrind Memcheck (Run > Run 'YarpJS' with Valgring Memcheck or second button from the right as depicted below):

image

After a couple of hours running, no segmentation fault occurred so the program was stopped manually and take a look at the warnings in the Valgrind pane. We can see warnings in the "InvalidRead" and "Leak_DefinitelyLost" sections.

  • "Leak_DefinitelyLost" section:
    image
  • "InvalidRead" section:
    image

We can see that the memory leak and access problem occurs in the function call YarpJS_BufferedPort_Bottle::_callback_onRead() -> yarp::os::Bottle::toString().

@S-Dafarra
Copy link
Member

S-Dafarra commented Jan 4, 2022

I think that a memory leak does not explain a segmentation fault (unless you run out of memory I suppose). What other warnings are there? Anything related to uninitialized memory for example?

Edit:

The "InvalidRead" seems more juicy instead.

Also it might be worth compiling in RelWithDebInfo. If you are running in debug it is possible that the memory is initialized with zeros, that might explain why the segfault is not happening after hours of usage.

@nunoguedelha
Copy link
Collaborator Author

Yes, the invalid read was the warning that most worried me, I just did't complete the description in the above comment yet. Anyway, the point is that the issue is occurring in that function call sequence.

Also it might be worth compiling in RelWithDebInfo. If you are running in debug it is possible that the memory is initialized with zeros, that might explain why the segfault is not happening after hours of usage.

Indeed, but I had already reproduced the segfault in the past with the debug build. I'll try anyway with RelWithDebInfo to see if it improves. It might also be related to how the memory is aligned.

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Jan 10, 2022

I was able to reproduce the segmentation fault after loading the Yarp project under Xcode and attaching to the Node.js process. The debugger breaked on the line causing the segmentation fault occurring after the call sequence:

YarpJS_BufferedPort_Bottle::_callback_onRead() -> yarp::os::Bottle::toString() -> yarp::os::Impl::BottleImpl::toString()

image

Analysing...

@nunoguedelha
Copy link
Collaborator Author

I will open later an issue describing the procedure to setup Xcode.

@nunoguedelha
Copy link
Collaborator Author

I was able to reproduce the segmentation fault after loading the Yarp project under Xcode and attaching to the Node.js process. The debugger breaked on the line causing the segmentation fault occurring after the call sequence:

YarpJS_BufferedPort_Bottle::_callback_onRead() -> yarp::os::Impl::BottleImpl::toString()
image

Analysing...

Quick initial analysis...

In line 322 we are dereferencing an invalid address since content size is 0.

Storable& s = *content[i];

content[0] should contain a valid pointer to yarp::os::impl::Storable, but instead has a some random value.

By the way, in line 318

for (unsigned int i = 0; i < content.size(); i++)

i=0 implies that i does fulfill the for loop condition. It's been a while I don't do C++, so I have a doubt here, but I think we shouldn't go into the loop right? @S-Dafarra @traversaro

@traversaro
Copy link
Contributor

I was able to reproduce the segmentation fault after loading the Yarp project under Xcode and attaching to the Node.js process. The debugger breaked on the line causing the segmentation fault occurring after the call sequence:

YarpJS_BufferedPort_Bottle::_callback_onRead() -> yarp::os::Impl::BottleImpl::toString()
image Analysing...

Quick initial analysis...

In line 322 we are dereferencing an invalid address since content size is 0.

Storable& s = *content[i];

content[0] should contain a valid pointer to yarp::os::impl::Storable, but instead has a some random value.

By the way, in line 318

for (unsigned int i = 0; i < content.size(); i++)

i=0 implies that i does fulfill the for loop condition. It's been a while I don't do C++, so I have a doubt here, but I think we shouldn't go into the loop right? @S-Dafarra @traversaro

This is correct, at least as long as content.size() is 0.

@S-Dafarra
Copy link
Member

S-Dafarra commented Jan 10, 2022

I think there is something fishy. You wrote that in #51 (comment) the size is 0, but still you get into the for loop, and you shouldn't. Infact i=0 does not satisfy i < content.size() if content.size() == 0, hence, you should not get into the foor loop. If you do, maybe the method size() is also returning garbage, meaning that also content is garbage. So probably, there is where it crashes, but probably the problem is a level above in the stack trace. I wonder if the BottleImpl object on which you are calling toString is valid or not.

@nunoguedelha
Copy link
Collaborator Author

I think there is something fishy. You wrote that in #51 (comment) the size is 0, but still you get into the for loop, and you shouldn't. Infact i=0 does not satisfy i < content.size() if content.size() == 0, hence, you should not get into the foor loop.

yes that was what stroke me most.

If you do, maybe the method size() is also returning garbage...I wonder if the BottleImpl object on which you are calling toString is valid or not.

I have the full function call sequence before the segfault in Xcode environment, so we can check all that.

@nunoguedelha
Copy link
Collaborator Author

Just a clarification with respect to a previous discussion we had during the last software engineering update:

the problem is not related to argv being in a local variable used in the function call tmp_arguments.push_back(cons->NewInstance(Nan::GetCurrentContext(), argc, argv).ToLocalChecked());, as the segfault occurs before that:

image

@nunoguedelha
Copy link
Collaborator Author

nunoguedelha commented Jan 11, 2022

I was observing a bottle content always of size 0 even before the segfault, just randomly breaking on the Read callback, so I crosschecked the setup by debugging the code in CLion through the project YarpJS, while using the conda view where I built YARP alone. This time CLion gets the full trace and breaks in the BottleImpl::toString() function as expected:

image

We can indeed see garbage in the content vector.

But this was a nested print s.toStringNested() -> ... -> BottleImpl::toString(). Earlier in the stack, there is another call to BottleImpl::toString():

image

  • Although the content size is 12, the index reaches 18.
"(0.0132306088536312625031 -0.0174941918296984133541 0.00310490714804459054174 0.00594535884887671412652 -0.00481470955559468605928 -0.00200226795433027654481) [ok] (9.28886226916454660341 -4.38552476668789292802 -0.413779535225290651113 -3.32214321589641325971 -0.747183437784831094852 1.41577756042523827951) [ok] (0.0 0.0 0.0 0.0 0.0 0.0) [ok] (0.0 0.0 0.0 0.0 0.0 0.0) [fail] (0.0 0.0 0.0 0.0 0.0 0.0) [fail] (0.0 0.0 0.0 0.0 0.0 0.0) [fail] (-22.1830123044724203396 -3.65663312748801505592 -4.52793511387240332766 -13.0736765879589000861 3.06625141844129212032 -1.79803874509720595931) [ok] (-22.1830123044724203396 -3.65663312748801505592 -4.52793511387240332766 -13.0736765879589000861 3.06625141844129212032 -1.79803874509720595931) [ok] (-22.1830123044724203396 -3.65663312748801505592 -4.52793511387240332766 -13.0736765879589000861) DR9\U00000001\0\0\0\xd0d\xa1\U00000010\U00000001\0\0\0P\v\xa1\U00000010\U00000001\0\0\0 \x8a\xa9\U00000004\U00000001\0\0\0\xd6r\xe2q\xff\U0000007f\0\0\xe0 "
  • we can see corrupted data in the last two element groups 17 (nested structure):
(-22.1830123044724203396 -3.65663312748801505592 -4.52793511387240332766 -13.0736765879589000861) DR9\U00000001\0\0\0\xd0d\xa1\U00000010\U00000001\0\0\0P\v\xa1\U00000010\U00000001\0\0\0 \x8a\xa9\U00000004\U00000001\0\0\0\xd6r\xe2q\xff\U0000007f\0\0\xe0
  • as per what we get from running in the terminal:

    $ node
    Welcome to Node.js v14.17.0.
    Type ".help" for more information.
    > var yarp = require('./yarp')
    undefined
    > var port = yarp.portHandler.open('/yarpjs/left_leg/stateExt:i')
    [INFO] |yarp.os.Port| Port /yarpjs/left_leg/stateExt:i active at tcp://xxx.xxx.xxx.xxx:xxxxx/
    undefined
    > port.onRead(function (bottle){console.log(bottle.toArray())})
    undefined
    > yarp.Network.connect('/icubSim/left_leg/stateExt:o','/yarpjs/left_leg/stateExt:i')
    [INFO] |yarp.os.impl.PortCoreInputUnit| Receiving input from /icubSim/left_leg/stateExt:o to /yarpjs/left_leg/stateExt:i using tcp
    undefined
    > 
    

    ...there are 20 elements to read:

    • 1: (0.0132306088536312625031 -0.0174941918296984133541 0.00310490714804459054174 0.00594535884887671412652 -0.00481470955559468605928 -0.00200226795433027654481)
    • 2: [ok]
    • 3: (9.28886226916454660341 -4.38552476668789292802 -0.413779535225290651113 -3.32214321589641325971 -0.747183437784831094852 1.41577756042523827951)
    • ...
      So I was expecting to see content size 20 and max index 19.
  • content[18] returns float value 0 instead of a Bottle and we end up calling NULL->toString and SIGFAULT.

I wonder why content size varies in this simple test. We are reading the port from iCub model in a Gazebo simulation, so I would expect a fixed size.

Analyzing now what size() returns...

@nunoguedelha
Copy link
Collaborator Author

Another occurrence of the issue gives:
image

image

image

The elements of the content vector should always be, in this case, one of the following:

  • yarp::os::impl::StoreList for a nested Bottle,
  • yarp::os::impl::StoreVocab for a VOCAB,
  • yarp::os::impl::StoreInt32 or yarp::os::impl::StoreFloat64 for numeric values.

...but neve yarp::os::impl::Storable whcih is the parent more generic class.

We shall analyze if the data was corrupted when the process actually reads the port.

@nunoguedelha
Copy link
Collaborator Author

Note

I'm reproducing the issue with a minimalist javascript code executed under the node command line interface:

var yarp = require('./yarp')
var port = yarp.portHandler.open('/yarpjs/left_leg/stateExt:i')
port.onRead(function (bottle){console.log(bottle.toArray())})
yarp.Network.connect('/icubSim/left_leg/stateExt:o','/yarpjs/left_leg/stateExt:i')

@nunoguedelha
Copy link
Collaborator Author

Analyzing now what size() returns...

This became irrelevant as we can have corrupted data at the very beginning of the content vector.

@nunoguedelha
Copy link
Collaborator Author

Moving the issue to robotology/yarp.js#32 since we can reproduce it without running yarp-openmct server.

@traversaro @S-Dafarra @RiccardoGrieco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants