Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain usage broken in 0.12.x and later #4359

Closed
okeuday opened this issue Dec 20, 2015 · 21 comments
Closed

Domain usage broken in 0.12.x and later #4359

okeuday opened this issue Dec 20, 2015 · 21 comments
Labels
domain Issues and PRs related to the domain subsystem.

Comments

@okeuday
Copy link

okeuday commented Dec 20, 2015

I understand the contents of #66 . This issue is not meant to focus on the removal of domain. Instead, it is meant to report broken behavior when using domain with node.js versions 0.12.x and later when using the source code https://github.com/CloudI/CloudI/blob/develop/src/api/javascript/CloudI.js (the integration test which is unable to function is https://github.com/CloudI/CloudI/blob/develop/src/tests/msg_size/msg_size.js).

I do not see a way to avoid domain usage (https://github.com/strongloop/zone doesn't support cluster/crypto/fs/process/tls/https/udp/zlib, async-listener doesn't support exceptions (nodejs/node-v0.x-archive#6011 (comment))) and with domain usage being broken, I do not see a way to fix the source code for node.js versions 0.12.x and higher.

If you recreate the failure in the msg_size integration test, you will see in the cloudi.log output that forwards are unable to take place. The exception created by the forward_ function call (in msg_size.js) is not seen as it was before, with domain in 0.10.x. The more complicated messaging integration test (https://github.com/CloudI/CloudI/blob/develop/src/tests/messaging/messaging.js) does not have problems running with node.js versions 0.12.x and higher.

@mscdex mscdex added the domain Issues and PRs related to the domain subsystem. label Dec 20, 2015
@Fishrock123
Copy link
Contributor

cc @misterdjules

@misterdjules
Copy link

@okeuday I haven't had the time to take a closer look at this yet, I'll do that in a few hours. Is this a regression from v0.12.8 to v0.12.9, or does this not work for any v0.12 version?

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@misterdjules I have checked to determine the msg_size test doesn't work with v0.12.8. I am not sure about v0.12.x versions previous to 8. The output when the msg_size test doesn't work is similar to:

$ grep forward PREFIX/var/log/cloudi/cloudi.log 
 forward #1 erlang to /tests/msg_size/cxx (with timeout 600000 ms)
 forward #1 erlang to /tests/msg_size/java (with timeout 600000 ms)
  forward #2 c++ to /tests/msg_size/erlang (with timeout 599399 ms)
 forward #3 erlang to /tests/msg_size/cxx (with timeout 599298 ms)
  forward #4 c++ to /tests/msg_size/erlang (with timeout 599196 ms)
  forward #2 java to /tests/msg_size/erlang (with timeout 599398 ms)
 forward #5 erlang to /tests/msg_size/java (with timeout 599096 ms)
 forward #3 erlang to /tests/msg_size/javascript (with timeout 599292 ms)
  forward #6 java to /tests/msg_size/erlang (with timeout 598987 ms)
 forward #7 erlang to /tests/msg_size/perl (with timeout 598887 ms)
  forward #4 javascript to /tests/msg_size/erlang (with timeout 599190 ms)
  forward #8 perl to /tests/msg_size/erlang (with timeout 598787 ms)
 forward #9 erlang to /tests/msg_size/javascript (with timeout 598678 ms)

(when the msg_size test is working, the forward output is much larger, since it forwards until the timeout value expires) The PREFIX above is the prefix provided to the configure script for CloudI. The msg_size test is forwarding service requests between all the different CloudI API programming language implementations and it shows that it is unable to handle the exception used to return back to polling on its socket, within nodejs. CloudI is an application server for all the services, including the nodejs services.

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@misterdjules I have confirmed that v0.12.0 works fine with the msg_size test. So, the problem is somewhere between 1 and 8.

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@misterdjules The problem occurs with v0.12.2 but not with v0.12.1, so it is related to a change that was put into the v0.12.2 release.

@evanlucas
Copy link
Contributor

Could it have been f2a45ca?

That was landed right before 0.12.1 though

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@evanlucas It shouldn't be, since v0.12.1 works dependably for the msg_size test. The domain.js file didn't change between the v0.12.1 tag and the v0.12.2 tag, so this must be related to C++ source code.

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@evanlucas Perhaps it is f2a45ca , I misread the version information about it.

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@evanlucas I really am not familiar with the node.js source code, however, I don't see how f2a45ca would be responsible due to the differences in the github repo based on the version tags v0.12.1 and v0.12.2. The main change that I have seen is moving deps/uv from version 1.0.2 to version 1.4.2 which would relate to socket use in the CloudI API. The other changes for the v0.12.2 release don't seem like they should be related, but only based on a brief glance without previous knowledge of the node.js source code.

@misterdjules
Copy link

@okeuday How can I run the msg_size.js test once I built CloudI with ./configure && make?

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@misterdjules You need to do make install. The msg_size.js test is an integration test and they are all installed and ran by default, if all the programming languages are enabled. All the programming languages are enabled by default (you are not disabling them with configure flags). So, make install and cloudi start, then /usr/local/var/log/cloudi/cloudi.log has the output from the integration tests (msg_size included).

@misterdjules
Copy link

@okeuday When running cloudi start with node v0.12.1, here's what I get in the log file:

➜  src git:(develop) ✗ grep forward /usr/local/var/log/cloudi/cloudi.log
 forward #1 erlang to /tests/msg_size/cxx (with timeout 600000 ms)
 msg_size 0 requests/second forwarded for
 msg_size 4566.2 requests/second (during 0.0 seconds) forwarded for
 msg_size 0 requests/second forwarded for
➜  src git:(develop) ✗ 

Here's what I get with node v0.12.2:

➜  src git:(develop) ✗ grep forward /usr/local/var/log/cloudi/cloudi.log
 forward #1 erlang to /tests/msg_size/cxx (with timeout 600000 ms)
 forward #1 erlang to /tests/msg_size/java (with timeout 600000 ms)
  forward #2 c++ to /tests/msg_size/erlang (with timeout 599399 ms)
  forward #2 java to /tests/msg_size/erlang (with timeout 599399 ms)
 forward #3 erlang to /tests/msg_size/javascript (with timeout 599298 ms)
  forward #4 javascript to /tests/msg_size/erlang (with timeout 599197 ms)
 forward #3 erlang to /tests/msg_size/perl (with timeout 599289 ms)
  forward #4 perl to /tests/msg_size/erlang (with timeout 599188 ms)
 forward #5 erlang to /tests/msg_size/php (with timeout 599082 ms)
  forward #6 php to /tests/msg_size/erlang (with timeout 598981 ms)
➜  src git:(develop) ✗

So the output with v0.12.1 doesn't seem "larger" than with v0.12.2. It will be very time consuming for me to make sense of the whole cloudi code base, and to figure out what is expected and what is a bug.

So instead of having me use the whole cloudi code base, can you please isolate the msg_size test into the minimum amount of code that reproduces the regression that was introduced between v0.12.1 and v0.12.2, without using any dependency?

@okeuday
Copy link
Author

okeuday commented Dec 21, 2015

@misterdjules Your output from node v0.12.1 shows that CloudI didn't stay up due to one or more of the services failing initialization, so it was probably a file path problem for node as you were switching versions (since the file path to node is detected by the configure script and used in /usr/local/etc/cloudi/cloudi.conf).

There isn't an easy way to do this test outside of CloudI, since it is testing CloudI and its integration with the various programming languages. It is easy to remove all the integration tests, but msg_size within the configuration file (/usr/local/etc/cloudi/cloudi.conf) to avoid problems with other tests, if any occur (though there shouldn't be any problems with the other tests).

If it is too problematic to reproduce, I can try to narrow it down further. I have been compiling the node repository tagged at v0.12.2 with specific sections from the v0.12.1 tag, to try and see what is responsible for the problem. So far, it doesn't look like the deps/uv update is related to the problem.

@misterdjules
Copy link

If it is too problematic to reproduce, I can try to narrow it down further.

Yes, that would be very much appreciated.

@okeuday
Copy link
Author

okeuday commented Dec 22, 2015

@misterdjules The commit that causes the problem is 7e9d2f8 and is related to socket usage. Domain usage may be related to reproducing this problem, but it should be mainly a socket problem. I am not sure about how to fix it and have not attempted to revert the change on a particular (local) branch (to test).

@misterdjules
Copy link

@okeuday I'm not sure I understand how 7e9d2f8 is related to socket usage. Would you mind clarifying?

The Watchdog class is used to implement a watchdog that makes sure that a JavaScript script started with the vm module doesn't run indefinitely. AFAIK it doesn't have anything to do with sockets per se.

@okeuday
Copy link
Author

okeuday commented Dec 22, 2015

@misterdjules Both https://github.com/nodejs/node/commits/v0.12.2-release and git log --pretty=oneline v0.12.1...v0.12.2 show the commit sequence goes 0b5731a right before 7e9d2f8 . I tested 0b5731a and other commits before that, which allow the msg_size test to work without problems and 7e9d2f8 was the first commit in the release to break the msg_size test, if that sequence is correct.

I assumed it was related to sockets due to uv usage. However the file src/node_watchdog.cc didn't exist in the revision 0b5731a so I must not be looking at this properly?

@misterdjules
Copy link

@okeuday 0b5731a is a commit that was merged in the v0.12 branch as part of regular merges we used to do of the v0.10 branch into the v0.12 branch. That commit points to a tree that is node v0.10.x, not v0.12.x, which is why the file src/node_watchdog.cc doesn't exist in the tree for that commit. The commit that merged it into v0.12 is 0b4e67f.

When bisecting commits to find the one that caused a regression in v0.12 you should not use these v0.10 commits, only the merge commits that merged them into v0.12.

@okeuday
Copy link
Author

okeuday commented Dec 22, 2015

@misterdjules I used git bisect to find the commit between the tags v0.12.1 and v0.12.2 to find that 7a37910 is where the msg_size test breaks. So, this should be related to socket usage facilitated by libuv.

The main socket usage by the msg_size test is initialized at https://github.com/CloudI/CloudI/blob/develop/src/api/javascript/CloudI.js#L220-L223 (CloudI has an OS process forking the process used for executing node.js, creating a socket file descriptor for the javascript source code to use). Separate objects are being used for the same file descriptor due to problems doing it as a single socket object (which I don't remember). If this approach is not meant to work going forward (beyond the 7a37910 commit), the https://github.com/CloudI/CloudI/blob/develop/src/api/javascript/CloudI.js#L220-L223 code could change, as long as some approach may work.

@misterdjules
Copy link

@okeuday It's great that you found the commit that is responsible for the regression.

Now, libuv is responsible for many things other than providing an abstraction for network sockets. So if the problem was triggered by a libuv upgrade, I wouldn't narrow it down to a network socket issue just yet.

What would be great is to have a small program that reproduces the problem outside of Cloudi. Without such a thing, it will be difficult for us to help you find the source of the problem and fix it.

Because it's not clear what the problem is, I'll close this issue until we have more information. We can still assist you in coming up with that minimal script that reproduces the issue by continuing the discussion here or in #node-dev on Freenode.

@okeuday
Copy link
Author

okeuday commented Dec 23, 2015

I used the commit CloudI/CloudI@071d716 to fix the problem in the Javascript CloudI API and that works with node.js 4.2.3. I won't have time to write a separate test, but this information should be enough to help someone else do so. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain Issues and PRs related to the domain subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants