Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When entering an excel document and extracting the document, if there are multiple sheets in excel, other sheets cannot be read #12895

Closed
5 tasks done
dshwei opened this issue Jan 21, 2025 · 2 comments
Labels
🌚 invalid This doesn't seem right outdated version smaller than 0.13.2

Comments

@dshwei
Copy link

dshwei commented Jan 21, 2025

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.11.2

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

When entering an excel document and extracting the document, if there are multiple sheets in excel, other sheets cannot be read, only the first sheet can be read

✔️ Expected Behavior

When loading an excel document and parsing an excel document, you can extract all the sheets in the excel document

❌ Actual Behavior

At present, when the document is extracted, only the first sheet in excel can be read, and this cannot handle the scenario where there are multiple sheets in excel

Copy link

dosubot bot commented Jan 21, 2025

To extract and read all sheets from an Excel document with multiple sheets, you can use the ExcelExtractor class. This class is designed to handle both .xlsx and .xls file formats and iterates over all sheets in the Excel file. Here's a brief overview of how it works:

For .xlsx files:

wb = load_workbook(self._file_path, data_only=True)
for sheet_name in wb.sheetnames:
    sheet = wb[sheet_name]
    data = sheet.values
    try:
        cols = next(data)
    except StopIteration:
        continue
    df = pd.DataFrame(data, columns=cols)
    df.dropna(how="all", inplace=True)
    # Process each row in the DataFrame

For .xls files:

excel_file = pd.ExcelFile(self._file_path, engine="xlrd")
for excel_sheet_name in excel_file.sheet_names:
    df = excel_file.parse(sheet_name=excel_sheet_name)
    df.dropna(how="all", inplace=True)
    # Process each row in the DataFrame

This code iterates over each sheet in the workbook, reads the data into a DataFrame, and processes each row, allowing you to extract data from all sheets in the Excel document [1].

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the 🐞 bug Something isn't working label Jan 21, 2025
@crazywoola crazywoola changed the title When entering an excel document and extracting the document, if there are multiple sheets in excel, other sheets cannot be read, o在输入excel文档,文档提取时,如果excel中有多个sheet,则无法读取其他sheet,只能读取第一个sheetnly the first sheet can be read When entering an excel document and extracting the document, if there are multiple sheets in excel, other sheets cannot be read Jan 21, 2025
@crazywoola
Copy link
Member

crazywoola commented Jan 21, 2025

We already supported this feature in this PR #11215
This should be a feature request instead of bug report, if you are submit bug report, please make sure you have listed necessary steps to reproduce the error and expected behavior instead of typing them 3 times.

@crazywoola crazywoola added 🌚 invalid This doesn't seem right outdated version smaller than 0.13.2 and removed 🐞 bug Something isn't working labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌚 invalid This doesn't seem right outdated version smaller than 0.13.2
Projects
None yet
Development

No branches or pull requests

2 participants