ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Sanix-Darker · 2022-06-09T08:11:05Z

Is your feature request related to a problem?

I wish pandas read_excel could be faster.

Describe the solution you'd like

pandas.read_excel should get faster if we use engines iterators, here i created an offset variable for testing, that can go deeper to the engines, so for theese tests, there is a new parameter offset that goes deeply to the engine iterators functions.

API breaking implications

It should not break the actual API.

Describe alternatives you've considered

For now, this is the only way i found.

Additional context

Basically, the offset can keep the columns name and be use like this :

LIVE DEMO

BENCHMARKS

And as benchmarks are runned with :

import pandas as pd
from timeit import default_timer


def bench_mark_func():
    for ext in ["xls", "xlsx"]:
        print(f"\n[{ext}] old way, no nrows, nor skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(f"benchmark_5000.{ext}")
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] actual way, with nrows and sometime skiprows :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, skiprows=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print(f"\n[{ext}] new way, with nrows and offset :")
        start = default_timer()
        for i in range(100):
            df_xls = pd.read_excel(
                f"benchmark_5000.{ext}", nrows=50 + i, offset=100 * (++i)
            )
        print(f"[{ext}] done in {default_timer() - start}")
        print("*" * 30)

        print("==" * 50)


if __name__ == "__main__":
    bench_mark_func()

OUTPUT

===============================================================================================
[xls] old way, no nrows, nor skiprows :
[xls] done in 7.315936979022808
******************************

[xls] actual way, with nrows and skiprows :
[xls] done in 6.850192359997891
******************************

[xls] new way, with nrows and offset :
[xls] done in 5.909526361967437
******************************
===============================================================================================

[xlsx] old way, no nrows, nor skiprows :
[xlsx] done in 39.742386338009965
******************************

[xlsx] actual way, with nrows and skiprows :
[xlsx] done in 31.731780430010986
******************************

[xlsx] new way, with nrows and offset :
[xlsx] done in 25.8184904170339
******************************
==============================================================================================

I made a Proof Of Concept for this update available here : https://github.com/Sanix-Darker/pandas/pull/1

The text was updated successfully, but these errors were encountered:

ahawryluk · 2022-06-11T21:52:47Z

If I understand correctly, there are two topics here:

How do we skip the first n rows of data without losing the header?
Can we improve the performance in this scenario?

For the first question I'd suggest pd.read_excel('myfile.xlsx', skiprows=range(1, 3001)) rather than adding a new option. And for the second question the answer is yes, the performance can be improved at least a little bit, as you've seen and shown. The current implementation stops reading the file once it has satisfied nrows (if supplied), but it passes all skipped rows to TextParser, which adds a little bit of extra work in the case you've shown here. It's certainly a reasonable enhancement idea to make read_excel smart enough to do all the row skipping itself, and then not even have to pass skip_rows on to the TextParser.

I'm a bit curious how you came across the scenario. You want to load a small portion of the data, and you're unhappy with the performance. So I hope this isn't a silly question, but are you by any chance calling read_excel multiple times on the same file for different ranges of rows? If so, then you should load the whole table once and access subsets of it with indexing methods such as df.iloc[3000:]. Again, sorry if that's a silly question.

Sanix-Darker · 2022-06-13T07:40:17Z

Thanks for the quick response and your feedbacks.

For the first point, you're right, i figured out that too !
For the second point :

It's certainly a reasonable enhancement idea to make read_excel smart enough to do all the row skipping itself, and then not even have to pass skip_rows on to the TextParser.

Does this means, an implementation in that direction is a valid thing to do at this moment (i can propose a Pull request for that) ?

how you came across the scenario

While working/playing with pandas read_excel, i wanted to know if i can make it fast, because on large files, i noticed some latency compared to openpyxl, so i checked the code and i wanted to know if it can be optimize.

Again, sorry if that's a silly question.

Don't worry, it's not a silly question :)

ahawryluk · 2022-06-14T03:34:24Z

Yes, I think that a pull request that

improves performance and
preserves the current function signature and behaviour

would be welcome. I'd suggest that you compare the speed of the main branch and your feature branch in at least three cases:

loading an entire worksheet
loading just the first few rows
loading a few rows near the end

and ideally for all four backends: xls, xlsx, xlsb, ods. For example, #46894 was the recent change to improve case #2.

A more experienced contributor than me would have to judge if the performance improvement is big enough to justify any increase in code complexity. I've found the devs to be quite helpful for us newer contributors.

mroeschke · 2024-08-24T20:09:27Z

Thanks for the issue, but it appears this hasn't gotten traction in a while so closing

Sanix-Darker added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 9, 2022

Sanix-Darker changed the title ~~ENH: Having pandas.read_excel FASTER by adding an offset attribute (with an available proof of concept)~~ ENH: Having pandas.read_excel FASTER (with an available proof of concept) Jun 9, 2022

Sanix-Darker mentioned this issue Jun 15, 2022

PERF: pandas read_excel perf optimisations #47376

Closed

5 tasks

mzeitlin11 added Performance Memory or execution speed performance IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 20, 2022

mroeschke closed this as completed Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Sanix-Darker commented Jun 9, 2022 •

edited

Loading

ahawryluk commented Jun 11, 2022

Sanix-Darker commented Jun 13, 2022

ahawryluk commented Jun 14, 2022

mroeschke commented Aug 24, 2024

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

ENH: Having pandas.read_excel FASTER (with an available proof of concept) #47290

Comments

Sanix-Darker commented Jun 9, 2022 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

LIVE DEMO

BENCHMARKS

OUTPUT

ahawryluk commented Jun 11, 2022

Sanix-Darker commented Jun 13, 2022

ahawryluk commented Jun 14, 2022

mroeschke commented Aug 24, 2024

Sanix-Darker commented Jun 9, 2022 •

edited

Loading