Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_excel prevent from further save of the Excel file while python session is running #2465

Closed
GentilsTo opened this issue Nov 19, 2020 · 4 comments · Fixed by #2514
Closed
Assignees
Labels
bug 🦗 Something isn't working

Comments

@GentilsTo
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 20H2
  • Modin version (modin.__version__): 0.8.2
  • Python version: Python 3.8.3

An really simple example file to reproduce:
test.xlsx

  • Code we can use to reproduce:
import modin.pandas as pd
df = pd.read_excel('test.xlsx')

Then open the Excel file (if not already open), and try to save it.

Describe the problem

Contrary to behavior in Pandas, using pd.read_excel with modin, prevent any further saving of the Excel files afterwards, until the python session is closed.
While trying to do it in Excel, get multiple error pop-up talking about "sharing issue", given some random number as Excel file then.
For my use case, it's quite handy to get the Excel file and the python session open in the same time, loading the data, working on it with pandas, and potentially (especially during developpement) making small change in Excel and reloading them right away in python. Also, that doesn't seem like a safe behavior and might hide something bigger ? don't know.

Source code / logs

image
image
image
(sorry it's in french)

@GentilsTo GentilsTo added the bug 🦗 Something isn't working label Nov 19, 2020
@devin-petersohn
Copy link
Collaborator

Hi @GentilsTo, thanks for posting!

We should make sure that Modin is not hanging on to any open file handles just in case, but I cannot see any cases of this at a glance in the code. It is a perfectly reasonable workflow to do this, and is safe as long as you are not saving a new copy and loading at the same time.

Out of curiosity, what kinds of changes do you make to the excel files? We are prototyping an excel interface that shares the execution, and I'm interested to understand your workflow because it seems like it would fit this interface quite well.

@vnlitvinov
Copy link
Collaborator

vnlitvinov commented Dec 5, 2020

I think this is caused by pandas-dev/pandas#29803, because in current implementation we just use pandas.read_excel() in each worker.

Even though it's still open, I think the issue itself was closed in pandas-dev/pandas#30096 and pandas-dev/pandas#32544.

All that said, this definitely needs further investigation, because we use Pandas 1.1.4 which should have the fix.

Scratch that, it only applies to single-threaded reading. We're having issues with parallel read here.

@vnlitvinov
Copy link
Collaborator

Self-contained reproducer:

import pandas
import modin.pandas as pd
from pandas.util._test_decorators import check_file_leaks
import os

@check_file_leaks
def reproduce(name):
    return pd.read_excel(name)

def main():
    df = pd.DataFrame({'a': [1,2], 'b': [3,4]})
    name = '2465-test.xlsx'
    df.to_excel(name)
    try:
        df2 = reproduce(name)
        assert df2.equals(df)
    finally:
        os.unlink(name)

if __name__ == '__main__':
    main()

Currently fails like this:

>python 2465.py
UserWarning: Distributing <class 'dict'> object. This may take some time.
UserWarning: `DataFrame.to_excel` defaulting to pandas implementation.
To request implementation, send an email to feature_requests@modin.org.
UserWarning: Distributing <class 'pandas.core.frame.DataFrame'> object. This may take some time.
UserWarning: Parallel `read_excel` is a new feature! Please email bug_reports@modin.org if you run into any problems.
Traceback (most recent call last):
  File "2465.py", line 15, in main
    df2 = reproduce(name)
  File "C:\miniconda3\envs\foo\lib\site-packages\pandas\util\_test_decorators.py", line 254, in new_func
    assert flist2 == flist
AssertionError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "2465.py", line 21, in <module>
    main()
  File "2465.py", line 18, in main
    os.unlink(name)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '2465-test.xlsx'

@vnlitvinov
Copy link
Collaborator

vnlitvinov commented Dec 5, 2020

After digging further I believe this is a bug / design decision (as implied by SO post) of openpyxl: https://foss.heptapod.net/openpyxl/openpyxl/-/issues/1450

vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Dec 5, 2020
…eak files

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Dec 5, 2020
… if needed

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Dec 7, 2020
Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
vnlitvinov added a commit to vnlitvinov/modin that referenced this issue Dec 7, 2020
… psutil limitation

Signed-off-by: Vasilij Litvinov <vasilij.n.litvinov@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants