Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Closed
Sanix-Darker opened this issue Jun 9, 2022 · 4 comments
Closed
Labels
Enhancement IO Excel read_excel, to_excel Performance Memory or execution speed performance

Comments

@Sanix-Darker
Copy link

Sanix-Darker commented Jun 9, 2022

Is your feature request related to a problem?

I wish pandas read_excel could be faster.

Describe the solution you'd like

pandas.read_excel should get faster if we use engines iterators, here i created an offset variable for testing, that can go deeper to the engines, so for theese tests, there is a new parameter offset that goes deeply to the engine iterators functions.

API breaking implications

It should not break the actual API.

Describe alternatives you've considered

For now, this is the only way i found.

Additional context

Basically, the offset can keep the columns name and be use like this :
Screenshot from 2022-06-08 17-37-43

LIVE DEMO

Peek 2022-06-08 02-57

BENCHMARKS

And as benchmarks are runned with :

import pandas as pd
from timeit import default_timer


def bench_mark_func():
    for ext in ["xls", "xlsx"]:
        print(f"\n[{ext}] old way, no nrows, nor skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(f"benchmark_5000.{ext}")
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] actual way, with nrows and sometime skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, skiprows=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] new way, with nrows and offset :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, offset=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print("==" * 50)


if __name__ == "__main__":
    bench_mark_func()
OUTPUT
===============================================================================================
[xls] old way, no nrows, nor skiprows :
[xls] done in 7.315936979022808
******************************

[xls] actual way, with nrows and skiprows :
[xls] done in 6.850192359997891
******************************

[xls] new way, with nrows and offset :
[xls] done in 5.909526361967437
******************************
===============================================================================================

[xlsx] old way, no nrows, nor skiprows :
[xlsx] done in 39.742386338009965
******************************

[xlsx] actual way, with nrows and skiprows :
[xlsx] done in 31.731780430010986
******************************

[xlsx] new way, with nrows and offset :
[xlsx] done in 25.8184904170339
******************************
==============================================================================================

I made a Proof Of Concept for this update available here : https://github.com/Sanix-Darker/pandas/pull/1

@Sanix-Darker Sanix-Darker added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022
@Sanix-Darker Sanix-Darker changed the title ENH: Having pandas.read_excel FASTER by adding an offset attribute (with an available proof of concept) ENH: Having pandas.read_excel FASTER (with an available proof of concept) Jun 9, 2022
@ahawryluk
Copy link
Contributor

If I understand correctly, there are two topics here:

  1. How do we skip the first n rows of data without losing the header?
  2. Can we improve the performance in this scenario?

For the first question I'd suggest pd.read_excel('myfile.xlsx', skiprows=range(1, 3001)) rather than adding a new option. And for the second question the answer is yes, the performance can be improved at least a little bit, as you've seen and shown. The current implementation stops reading the file once it has satisfied nrows (if supplied), but it passes all skipped rows to TextParser, which adds a little bit of extra work in the case you've shown here. It's certainly a reasonable enhancement idea to make read_excel smart enough to do all the row skipping itself, and then not even have to pass skip_rows on to the TextParser.

I'm a bit curious how you came across the scenario. You want to load a small portion of the data, and you're unhappy with the performance. So I hope this isn't a silly question, but are you by any chance calling read_excel multiple times on the same file for different ranges of rows? If so, then you should load the whole table once and access subsets of it with indexing methods such as df.iloc[3000:]. Again, sorry if that's a silly question.

@Sanix-Darker
Copy link
Author

Thanks for the quick response and your feedbacks.

  • For the first point, you're right, i figured out that too !

  • For the second point :

    It's certainly a reasonable enhancement idea to make read_excel smart enough to do all the row skipping itself, and then not even have to pass skip_rows on to the TextParser.

    Does this means, an implementation in that direction is a valid thing to do at this moment (i can propose a Pull request for that) ?

    how you came across the scenario

    While working/playing with pandas read_excel, i wanted to know if i can make it fast, because on large files, i noticed some latency compared to openpyxl, so i checked the code and i wanted to know if it can be optimize.

    Again, sorry if that's a silly question.

    Don't worry, it's not a silly question :)

@ahawryluk
Copy link
Contributor

Yes, I think that a pull request that

  • improves performance and
  • preserves the current function signature and behaviour

would be welcome. I'd suggest that you compare the speed of the main branch and your feature branch in at least three cases:

  1. loading an entire worksheet
  2. loading just the first few rows
  3. loading a few rows near the end

and ideally for all four backends: xls, xlsx, xlsb, ods. For example, #46894 was the recent change to improve case #2.

A more experienced contributor than me would have to judge if the performance improvement is big enough to justify any increase in code complexity. I've found the devs to be quite helpful for us newer contributors.

@mzeitlin11 mzeitlin11 added Performance Memory or execution speed performance IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2022
@mroeschke
Copy link
Member

Thanks for the issue, but it appears this hasn't gotten traction in a while so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants