Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: OSError: [Errno 24] Too many open files #769

Open
charliedream1 opened this issue Apr 29, 2024 · 5 comments
Open

BUG: OSError: [Errno 24] Too many open files #769

charliedream1 opened this issue Apr 29, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@charliedream1
Copy link

Describe the bug

A clear and concise description of what the bug is.

To Reproduce

To help us to reproduce this bug, please provide information below:

  1. Your Python version
  2. The version of Xorbits you use
  3. Versions of crucial packages, such as numpy, scipy and pandas
  4. Full stack of the error.
  5. Minimized code to reproduce the error.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

I met this issue again. Please help to solve.

===================================
I create a dataframe

N1 N2
query ori_txt
text len 10-50 300-5000

running code is as below:

import xorbits.pandas as pd
from xorbits.experimental import dedup

data_lst = [{'query': 'xxx',  'ori_txt': 'xxx'}, {'query': 'xxx',  'ori_txt': 'xxx'}]
df = pd.DataFrame(data_lst])
res = dedup(df, col="query", method="minhash", threshold=threshold,
            num_perm=num_perm, min_length=min_length, ngrams=ngrams, seed=seed,
            verbose=True)
print('ori len: ', len(data_lst))
print('dedup len: ', len(res['query'].tolist()))

data_lst contains 10k+ data.

===================================
I got error as below:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/forkserver.py", line 258, in main
    fds = reduction.recvfds(s, MAXFDS_TO_SEND + 1)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError
Traceback (most recent call last):
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/session.py", line 1954, in get_default_or_create
    session = new_session("127.0.0.1", init_local=True, **kwargs)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/session.py", line 1924, in new_session
    session = SyncSession.init(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/session.py", line 1550, in init
    isolated_session = fut.result()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/session.py", line 775, in init
    await new_cluster_in_isolation(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/local.py", line 101, in new_cluster_in_isolation
    await cluster.start()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/local.py", line 344, in start
    await self._start_worker_pools()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/local.py", line 386, in _start_worker_pools
    worker_pool = await create_worker_actor_pool(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xorbits/_mars/deploy/oscar/pool.py", line 310, in create_worker_actor_pool
    return await create_actor_pool(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/api.py", line 179, in create_actor_pool
    return await get_backend(scheme).create_actor_pool(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/backend.py", line 49, in create_actor_pool
    return await create_actor_pool(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/pool.py", line 1585, in create_actor_pool
    pool: MainActorPoolType = await pool_cls.create(
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/pool.py", line 1282, in create
    processes, ext_addresses = await cls.wait_sub_pools_ready(tasks)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py", line 221, in wait_sub_pools_ready
    process, status = await task
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py", line 213, in start_sub_pool
    return await create_pool_task
  File "/home//miniconda3/envs/train_py310/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py", line 203, in start_pool_in_process
    process.start()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_forkserver.py", line 58, in _launch
    f.write(buf.getbuffer())
BrokenPipeError: [Errno 32] Broken pipe
2024-04-12 13:59:09,343 asyncio      671836 ERROR    Task was destroyed but it is pending!
task: <Task pending name='Task-10' coro=<MainActorPoolBase.monitor_sub_pools() running at /home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/pool.py:1458> wait_for=<Future pending cb=[Task.task_wakeup()]>>
2024-04-12 13:59:09,344 asyncio      671836 ERROR    Task exception was never retrieved
future: <Task finished name='Task-402' coro=<MainActorPool.start_sub_pool() done, defined at /home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py:180> exception=OSError(24, 'Too many open files')>
Traceback (most recent call last):
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py", line 213, in start_sub_pool
    return await create_pool_task
  File "/home//miniconda3/envs/train_py310/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/site-packages/xoscar/backends/indigen/pool.py", line 203, in start_pool_in_process
    process.start()
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/context.py", line 300, in _Popen
    return Popen(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_forkserver.py", line 35, in __init__
    super().__init__(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/popen_forkserver.py", line 51, in _launch
    self.sentinel, w = forkserver.connect_to_new_process(self._fds)
  File "/home//miniconda3/envs/train_py310/lib/python3.10/multiprocessing/forkserver.py", line 87, in connect_to_new_process
    with socket.socket(socket.AF_UNIX) as client:
  File "/home//miniconda3/envs/train_py310/lib/python3.10/socket.py", line 232, in __init__
    _socket.socket.__init__(self, family, type, proto, fileno)
OSError: [Errno 24] Too many open files

===================================
I did operations as below:

sudo sysctl -w fs.file-max=100000
ulimit -S -n 1048576

It still gives out error. Error comes from len(res['query'].tolist()), so how to parse the result.

Thanks for your reply.

@XprobeBot XprobeBot added the bug Something isn't working label Apr 29, 2024
@XprobeBot XprobeBot added this to the v0.7.3 milestone Apr 29, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Apr 29, 2024

@Hank0626 Can you look at this issue?

@Hank0626
Copy link
Contributor

@charliedream1 May I ask if you are using multiple machines together for the task?

@charliedream1
Copy link
Author

only one machine

@Hank0626
Copy link
Contributor

Could you please provide me with your data for testing?

@charliedream1
Copy link
Author

charliedream1 commented Apr 29, 2024

I used private data. You may reproduce with some open source LLM sft data. Data format is as below:

I create a dataframe

N1 N2
query ori_txt
text len 10-50 300-5000

@XprobeBot XprobeBot modified the milestones: v0.7.3, v0.7.4 Aug 22, 2024
@luweizheng luweizheng removed this from the v0.7.4 milestone Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants