Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UX: rework stdout error when ray fails to start #35378

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mattip
Copy link
Contributor

@mattip mattip commented May 16, 2023

Why are these changes needed?

When ray fails to start the dashboard, it prints out the last 20 lines of the dashboard.log file. Here is what it printed when I had a problem with grpcio 1

2023-05-16 00:38:43,039	ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-05-16 00:38:43,040	ERROR services.py:1222 -- Error should be written to 'dashboard.log' 
or 'dashboard.err'. We are printing the last 20 lines for you. See 
'https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure'
to find where the log file is.
2023-05-16 00:38:43,040	ERROR services.py:1266 -- 
The last 20 lines of /tmp/ray/session_2023-05-16_00-38-41_610459_85992/logs/dashboard.log 
(it contains the error message from the dashboard): 
2023-05-16 00:38:42,899	WARNING head.py:255 -- `prometheus_client` not found, so metrics will not be exported.

2023-05-16 00:38:43,157	INFO worker.py:1625 -- Started a local Ray instance.
[2023-05-16 00:38:43,781 E 85992 85992] core_worker.cc:191: Failed to register worker 0100000
0ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable
to register worker with raylet. No such file or directory

There are a few UX tweaks that can be made to the message:

  • The log file was only 2 lines long, not 20, so the number is not helpful. It is repeated twice.
  • The first line was not printed
  • The code knows whether it printed 'dashboard.log' or 'dashboard.err', so no need to be indecisive.
  • The link is not needed, the file name is printed

Note that in my case the error was elsewhere, so perhaps this whole effort to print the dashboard.log should be a little less certain that the error is exactly here.

After this PR, the same error prints out

2023-05-16 10:04:17,920 ERROR services.py:1197 -- Failed to start the dashboard , return code -11
2023-05-16 10:04:17,920 ERROR services.py:1248 -- Error should be written to 'dashboard.log' We are
printing the last lines
2023-05-16 10:04:17,920 ERROR services.py:1268 -- 
The last 2 lines of /tmp/ray/session_2023-05-16_10-04-16_493412_122666/logs/dashboard.log (it may
contain an error message from the dashboard): 
2023-05-16 10:04:17,780 INFO head.py:135 -- Dashboard head grpc address: 0.0.0.0:4289
2023-05-16 10:04:17,782 WARNING head.py:255 -- `prometheus_client` not found, so metrics will not be
exported.
2023-05-16 10:04:18,038 INFO worker.py:1625 -- Started a local Ray instance.
[2023-05-16 10:04:18,662 E 122666 122666] core_worker.cc:191: Failed to register worker
01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable
to register worker with raylet. No such file or directory

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Footnotes

  1. This appeared while updating the conda recipe for v2.4.0 in resync grpcio requirements conda-forge/ray-packages-feedstock#98, I am still searching for the root cause, probably related to the last line with a bogus worker number and an IOError

@xieus
Copy link

xieus commented May 25, 2023

@rkooo567 could you review this PR?

@@ -1258,6 +1254,11 @@ def read_log(filename, lines_to_read):
"Ray github. "
"https://github.com/ray-project/ray/issues"
)
logger.error(
"Error should be written to 'dashboard.log' "
"We are printing the last lines"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you keep the link to the log directory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"We are printing the last lines"
"We are printing the last {len(lines)} lines"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the log directory and the number of lines is contained a few lines lower. I think it should only appear once:

The last 2 lines of /tmp/ray/session_2023-05-16_10-04-16_493412_122666/logs/dashboard.log
(it may contain an error message from the dashboard): 

lines = []
with open(dashboard_log, "rb") as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
end = mm.size()
for _ in range(lines_to_read):
sep = mm.rfind(b"\n", 0, end - 1)
if sep == -1:
if end > 1:
lines.append(mm[: end - 1].decode("utf-8"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test for this. case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is to add the first line, which was missing before this PR. Note that the line

2023-05-16 10:04:17,780 INFO head.py:135 -- Dashboard head grpc address: 0.0.0.0:4289

is missing. Where should I add a test: is there currently a test for this part of the code that I can add to?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 1, 2023
@stale
Copy link

stale bot commented Jul 19, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 19, 2023
@rkooo567 rkooo567 removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jul 19, 2023
@rkooo567
Copy link
Contributor

@mattip is it ready to review again?

@mattip
Copy link
Contributor Author

mattip commented Jul 19, 2023

Yes, I am still waiting for guidance how to write a test. I would need to check the output of the worker when it starts up and cannot access the HEAD node.

@stale
Copy link

stale bot commented Sep 17, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 17, 2023
@h-vetinari
Copy link

h-vetinari commented Oct 13, 2023

It would be great if this could be merged (test or not); it would help debug a long-standing issue (#35383) where we're stuck on not knowing what's going wrong at all. Perhaps after debugging that issue, it'll be easier to write a test.

Signed-off-by: mattip <matti.picus@gmail.com>
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 13, 2023
@mattip
Copy link
Contributor Author

mattip commented Oct 13, 2023

I rebased off master, maybe it will help move the review forward

Copy link

stale bot commented Dec 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. stale The issue is stale. It will be closed within 7 days unless there are further conversation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants