ENH: pd.read_excel with table parameter #38937

samukweku · 2021-01-04T04:22:24Z

Is your feature request related to a problem?

Excel has tables , that makes data managing within Excel easier and offers some other features.

At the moment, pandas cannot access those tables, it simply reads in all the data. I think it would be helpful if a table parameter was added to pd.read_excel to capture table(s) defined in a sheet.

Sample Data:

Attached is a sample excel file: 016-MSPTDA-Excel.xlsx

With the current implementation of read_excel, we cannot select specific tables (dSalesReps, dProduct, dCategory, or dSupplier). Even if we read the Tables sheet, it becomes quite hard to separate the data into individual tables.

Describe the solution you'd like

pd.read_excel(io=filename, sheet=sheetname, table=tablename, ...)

The table parameter can have a default of None, in which case the entire sheet is read in; if however table is not None, then the table or lists of tables only are read in. The openpyxl library would be used to implement this.

API breaking implications

I am not aware of any API breaking implications

Describe alternatives you've considered

It could be done outside Pandas, where you read in the data first through openpyxl, before passing it to Pandas. I wrote a blog post about it; I feel, however, that it may be more convenient to provide that same functionality within Pandas and let the user worry less about the abstractions.

Additional context

I did check to see if this has been raised before, but did not find any. If it already has, kindly point me to that issue, and I will gladly close this.

Thanks.

The text was updated successfully, but these errors were encountered:

shraddhazpy · 2021-02-09T06:50:28Z

Hi, if @samukweku is not working on it and this is up for grabs, I'd would want to contribute.

samukweku · 2021-02-09T12:22:20Z

@shraddhazpy pls go ahead

scotscotmcc · 2022-02-20T00:42:35Z

hi, @shraddhazpy, are you working on this? This is functionality I would definitely appreciate. I haven't contributed to any public projects before, so it may be too big for me to tackle, but I would be interested in helping if I can.

samukweku · 2022-02-20T01:32:52Z

@scotscotmcc I think U can just type take and it will be assigned to u

scotscotmcc · 2022-02-21T02:46:17Z

I'm stalling a bit on taking this; I did some playing around with the code yesterday, but today couldn't get the development environment with docker set up...

That said, I want to raise a question around how a table_name parameter would function with the existing sheet_name. I can think of three different reasonable options.

Require sheet_name and table_name, grab the named table(s) from the named sheet (this is what was in the opening post).
If table_name is used, then don't require sheet_name at all, and just read the table(s).
If both are used, then read the entire sheet(s) and separately read the entire table(s). (my favorite)

Excel forces each table in a workbook to have a unique name, and so any table listed in the parameter could only exist once. Therefore, we don't need to know the sheet_name when the function is ran, we can find it, and it means one less thing for the user. Providing sheet_name would mean we don't have to search every sheet to have the name of the table, but it would also mean we would have an issue if the table is one a different sheet.

Requiring sheet_name when using table_name would also make it more difficult to read multiple tables that are on different sheets. You could require that, if you wanted to read multiple tables, you provide a list for sheet_name and a list for table_name, and the nth element of each list are pairs, but that seems bad...

Further, if you wanted to read all of Sheet1 and also Table2 that is within Sheet2 (but don't want to read Sheet2), it would be great to be able to do both of these at the time. To that end, allowing a new table_name to actually not really interact with the sheet_name parameter would allow you to do just that. Providing sheet_name reads sheets, and providing table_name reads tables, and they do the same thing regardless of what is up with the other.

scotscotmcc · 2022-03-17T01:07:14Z

I think that this enhancement is / should be dependent on this other to allow for passing engine_kwargs to pd.read_excel(). #43053

Currently, pd.read_excel() with engine=openpyxl (default if filetype is .xlsx) will default to setting the read_only = True in the openpyxl.load_workbook() call. However, you can't get table information out of the workbook object in openpyxl then; you need to have read_only = False.

The linked enhancement is to allow passing explicit arguments into the engine. That framework could then be used for this enhancement to change the read_only argument. The alternative would be that this enhancement creates a separate process to change the engine arguments, which seems clunky, inefficient, and cumbersome.

samukweku · 2022-04-12T13:23:47Z

if this helps, there is an implementation for reading xlsx tables here; hopefully it can be done within Pandas (and one less library for users to install 😁

Meadiocre · 2024-02-03T02:35:56Z

Hello @samukweku and @mroeschke,

My project team is looking for Pandas enhancement features for our grad school semester long project. We saw this task and would like to contribute if possible!

samukweku · 2024-02-03T06:21:33Z

Hi @Meadiocre type take and it will be assigned to you

iangainey · 2024-02-03T14:11:17Z

take
I'm in his project team, so thank you!

iangainey · 2024-02-15T23:49:38Z

After looking into how this would be implemented, I had a couple of thoughts/wanted to clarify some desired behavior:
While this can currently be done by using openpyxl or pyjanitor, the intent is for this to be a more straightforward pandas user experience to be able to specify tables to read in data from without the more complex method in openpyxl or having to include pyjanitor-correct?
As the read_excel function has some features that are already dependent on openpyxl, I just wanted to clarify if the preference is to implement this without any dependency on openpyxl, or to just make it more clear & easy to use by adding the parameter to the read_Excel function with the dependency on openpyxl.

The other question has to do with a post by @scotscotmcc. To reiterate what they said, for if a table name and a sheet name are specified, either these options could be done:

Only look for the table on the sheet name (If specified, otherwise search all sheets). This would result in faster reading of the table in files with many sheets, so not every sheet's tables has to be checked. But, it may be more "clunky" for a user if they mess up what sheet they put the table in, or in cases they want to read a specific sheet in as well as a table on a different sheet this would not be possible.
Look for the tables in any sheet, no matter if sheet name is specified. Therefore the user could chose any sheet(s) they want to read in, and any tables and exactly those would be read in, even if those do not line up with what sheets the tables are on. This would cause any reading in of a table to search every sheet, but allowing the user more versatility in data selection.

I'm planning on proceeding with #2 unless there's any feedback otherwise with better suggestions. Additionally, I'm planning on pursuing the route of being dependent on openpyxl as the read_excel function already has many parts that are, just making it an easier use to read in tables with this function instead of using more complex methods through openpyxl or including pyjanitor, again unless there's any better feedback.

samukweku · 2024-02-16T00:20:20Z

@iangainey hi. I wrote the pyjanitor implementation, because at the time there wasnt any in pandas. It would be nice if this was done within Pandas, and not dependent on pyjanitor. instead the dependency should directly be on openpyxl(at the moment, they are the only library offering this option). it'd give the pyjanitor team a valid reason to deprecate the function (we try not to replicate anything that Pandas already covers well).

On the other part, I'd suggest we go for option 1. The onus should be on the user to specify the correct sheet where the table name is stored. If the user does not provide a table name, then all tables from all sheets can be read in, or all tables from the specified sheet(s) can be read in.

as always, we can lean on the guidance of the pandas dev team here - @mroeschke @phofl @scotscotmcc

scotscotmcc · 2024-02-16T03:17:11Z

Thanks for taking this on, @iangainey. As just a regular-old pandas user (and I use excel a bunch), I don't like option 1. As I said in my earlier post on it, excel requires the table name to be unique, and so passing in a table name uniquely and sufficiently identifies what the user is trying to read. Requiring the sheet name as well feels like the user would be repeating themself (at least that's how I would feel when using it).

With option 1, there is the advantage of not having to loop through sheets, since excel doesn't index the tables but just lists them within sheets. This advantage doesn't seem all that great, though, I don't think. At worst, it requires looping through sheets and the tables within those sheets, but it doesn't require looping through cells (which could be real onerous).

To raise another question here now I'm thinking of base off of @samukweku's comment, though - what if the user wants to read all tables in the workbook? Currently, read_excel(...,sheet_name=None) will read in all sheets. That seems like a functionality that could be replicated (and would make sense to replicate) with reading tables. That is, a user could pass a string, a list, or None and they would function similarly to sheet_name, and so read_excel(...,table_name=None) would read in all tables..

samukweku · 2024-02-16T03:31:17Z

@scotscotmcc ahh that makes sense then if table name is unique. Totally missed that. Option 1 isn't needed then

iangainey · 2024-03-16T18:42:41Z

I've been making a lot of progress on this, and have a come up with some questions:
To explain current progress of what the implementation looks like:

table_name can be a string, a list of strings, or None (which reads all tables)
If table_name is not passed, no tables will be read and the output will be a dictionary of dataframes of the sheets (this is currently how pandas does this, this wasn't changed)
If table_name is passed, the output will be a dictionary of dataframes of the tables
If both table_name and sheet_name are passed, the result will be a dictionary of 2 nested dictionaries, one for sheets and one for tables. The sheets dictionary will have dataframes of the sheets, and the tables dictionary will have the dataframes of the tables
I did the above forms of output so that if you don't pass table_name, nothing about the output will change (existing code using read_excel will still work fine, shouldn't break anything). This seemed like the most clear to use way of returning the output, any issues on this or suggestions of better ways to do it?

Now, the main question:
Currently with sheets, you can specify the index of the sheet to be read. I have not implemented this for tables. Is this desired functionality?
If so, I run into some issues with determining what to set table_name to if it is not passed in. Currently, I set it to 0 and then no tables are read in/returned. None is already used for a case in which to read all tables.
Sheets approaches this by just setting it to 0 if sheet_name is not passed, which by default reads in the first sheet. Since I don't want to break existing uses of read_excel, I don't want to return the firs table if table_name is not passed. I'm not sure how to implement this so that I can use 0 for an index, but also set table_name to something that will prevent any tables from being read? My only thought was set it to -1, check if it's -1 and if so don't do any table reading/output, but that seems like an unconventional process.

Otherwise, I should have a PR up soon to see if there's any issues or comments on things that could or should be improved.
Thoughts, @samukweku @scotscotmcc @mroeschke ?

samukweku · 2024-03-16T20:52:22Z

As @scotscotmcc mentioned, sheets isn't required since table names are unique, so it can/should be ignored

iangainey · 2024-03-16T21:00:53Z

I think I'm misunderstanding what your referencing. Sheet names as ignored in regards to what tables are read in, you don't have to specify a sheet to read tables. But, my question is more that I had to format the output differently if you pass in sheet names and table names, and wondering if reading table by index is desired @samukweku

samukweku · 2024-03-16T21:04:47Z

@iangainey my bad. I don't think the nested approach for sheet names dictionary is necessary, a dictionary of table names and mapping sounds ok

scotscotmcc · 2024-03-21T15:29:54Z

I think it is fine to not read tables by index. I think that it is more straightforward for sheets to read by index since, in excel, there is already a single index of sheets at the workbook level. For tables (i.e. listobjects), they live within sheets and each sheet has its own list of them. One could then just sort of make a single list of them where the tables in sheet1 are first, then the tables in sheet2, etc. If I was trying to use this index system to grab a table out of excel, though, it seems like I'd be doing more work than just using the name (unless it is the first table).

All that said, I'm thinking a lot about my own use cases, where I have named tables that I have control over, and so I can read them by name just fine. If someone else is getting an excel file from someone else and trying to read it, and it has inconsistent table names, I could see it working less well for them. Even still, they could read all the tables and then filter the returning dictionary.

samukweku added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 4, 2021

lithomas1 added IO Excel read_excel, to_excel and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 2, 2021

lithomas1 added this to the Contributions Welcome milestone Feb 2, 2021

mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 14, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

github-actions bot assigned iangainey Feb 3, 2024

samukweku mentioned this issue Feb 17, 2024

sheet_name not required in jn.xlsx_table pyjanitor-devs/pyjanitor#1338

Closed

This was referenced Apr 28, 2024

ENH: Adding table name parameter to pandas.read_excel #58464

Closed

REF: Read excel parse refactor #58497

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pd.read_excel with table parameter #38937

ENH: pd.read_excel with table parameter #38937

samukweku commented Jan 4, 2021 •

edited

Loading

shraddhazpy commented Feb 9, 2021

samukweku commented Feb 9, 2021

scotscotmcc commented Feb 20, 2022

samukweku commented Feb 20, 2022

scotscotmcc commented Feb 21, 2022

scotscotmcc commented Mar 17, 2022

samukweku commented Apr 12, 2022

Meadiocre commented Feb 3, 2024

samukweku commented Feb 3, 2024

iangainey commented Feb 3, 2024 •

edited

Loading

iangainey commented Feb 15, 2024

samukweku commented Feb 16, 2024 •

edited

Loading

scotscotmcc commented Feb 16, 2024 •

edited

Loading

samukweku commented Feb 16, 2024

iangainey commented Mar 16, 2024

samukweku commented Mar 16, 2024

iangainey commented Mar 16, 2024 •

edited

Loading

samukweku commented Mar 16, 2024

scotscotmcc commented Mar 21, 2024

ENH: pd.read_excel with table parameter #38937

ENH: pd.read_excel with table parameter #38937

Comments

samukweku commented Jan 4, 2021 • edited Loading

Is your feature request related to a problem?

Sample Data:

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

shraddhazpy commented Feb 9, 2021

samukweku commented Feb 9, 2021

scotscotmcc commented Feb 20, 2022

samukweku commented Feb 20, 2022

scotscotmcc commented Feb 21, 2022

scotscotmcc commented Mar 17, 2022

samukweku commented Apr 12, 2022

Meadiocre commented Feb 3, 2024

samukweku commented Feb 3, 2024

iangainey commented Feb 3, 2024 • edited Loading

iangainey commented Feb 15, 2024

samukweku commented Feb 16, 2024 • edited Loading

scotscotmcc commented Feb 16, 2024 • edited Loading

samukweku commented Feb 16, 2024

iangainey commented Mar 16, 2024

samukweku commented Mar 16, 2024

iangainey commented Mar 16, 2024 • edited Loading

samukweku commented Mar 16, 2024

scotscotmcc commented Mar 21, 2024

samukweku commented Jan 4, 2021 •

edited

Loading

iangainey commented Feb 3, 2024 •

edited

Loading

samukweku commented Feb 16, 2024 •

edited

Loading

scotscotmcc commented Feb 16, 2024 •

edited

Loading

iangainey commented Mar 16, 2024 •

edited

Loading