-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: pd.read_excel with table parameter #38937
Comments
Hi, if @samukweku is not working on it and this is up for grabs, I'd would want to contribute. |
@shraddhazpy pls go ahead |
hi, @shraddhazpy, are you working on this? This is functionality I would definitely appreciate. I haven't contributed to any public projects before, so it may be too big for me to tackle, but I would be interested in helping if I can. |
@scotscotmcc I think U can just type |
I'm stalling a bit on taking this; I did some playing around with the code yesterday, but today couldn't get the development environment with docker set up... That said, I want to raise a question around how a
Excel forces each table in a workbook to have a unique name, and so any table listed in the parameter could only exist once. Therefore, we don't need to know the Requiring Further, if you wanted to read all of Sheet1 and also Table2 that is within Sheet2 (but don't want to read Sheet2), it would be great to be able to do both of these at the time. To that end, allowing a new |
I think that this enhancement is / should be dependent on this other to allow for passing Currently, The linked enhancement is to allow passing explicit arguments into the engine. That framework could then be used for this enhancement to change the read_only argument. The alternative would be that this enhancement creates a separate process to change the engine arguments, which seems clunky, inefficient, and cumbersome. |
if this helps, there is an implementation for reading xlsx tables here; hopefully it can be done within Pandas (and one less library for users to install 😁 |
Hello @samukweku and @mroeschke, My project team is looking for Pandas enhancement features for our grad school semester long project. We saw this task and would like to contribute if possible! |
Hi @Meadiocre type take and it will be assigned to you |
take |
After looking into how this would be implemented, I had a couple of thoughts/wanted to clarify some desired behavior: The other question has to do with a post by @scotscotmcc. To reiterate what they said, for if a table name and a sheet name are specified, either these options could be done:
I'm planning on proceeding with #2 unless there's any feedback otherwise with better suggestions. Additionally, I'm planning on pursuing the route of being dependent on openpyxl as the read_excel function already has many parts that are, just making it an easier use to read in tables with this function instead of using more complex methods through openpyxl or including pyjanitor, again unless there's any better feedback. |
@iangainey hi. I wrote the pyjanitor implementation, because at the time there wasnt any in pandas. It would be nice if this was done within Pandas, and not dependent on pyjanitor. instead the dependency should directly be on openpyxl(at the moment, they are the only library offering this option). it'd give the pyjanitor team a valid reason to deprecate the function (we try not to replicate anything that Pandas already covers well). On the other part, I'd suggest we go for option 1. The onus should be on the user to specify the correct sheet where the table name is stored. If the user does not provide a table name, then all tables from all sheets can be read in, or all tables from the specified sheet(s) can be read in. as always, we can lean on the guidance of the pandas dev team here - @mroeschke @phofl @scotscotmcc |
Thanks for taking this on, @iangainey. As just a regular-old pandas user (and I use excel a bunch), I don't like option 1. As I said in my earlier post on it, excel requires the table name to be unique, and so passing in a table name uniquely and sufficiently identifies what the user is trying to read. Requiring the sheet name as well feels like the user would be repeating themself (at least that's how I would feel when using it). With option 1, there is the advantage of not having to loop through sheets, since excel doesn't index the tables but just lists them within sheets. This advantage doesn't seem all that great, though, I don't think. At worst, it requires looping through sheets and the tables within those sheets, but it doesn't require looping through cells (which could be real onerous). To raise another question here now I'm thinking of base off of @samukweku's comment, though - what if the user wants to read all tables in the workbook? Currently, |
@scotscotmcc ahh that makes sense then if table name is unique. Totally missed that. Option 1 isn't needed then |
I've been making a lot of progress on this, and have a come up with some questions:
Now, the main question: Otherwise, I should have a PR up soon to see if there's any issues or comments on things that could or should be improved. |
As @scotscotmcc mentioned, sheets isn't required since table names are unique, so it can/should be ignored |
I think I'm misunderstanding what your referencing. Sheet names as ignored in regards to what tables are read in, you don't have to specify a sheet to read tables. But, my question is more that I had to format the output differently if you pass in sheet names and table names, and wondering if reading table by index is desired @samukweku |
@iangainey my bad. I don't think the nested approach for sheet names dictionary is necessary, a dictionary of table names and mapping sounds ok |
I think it is fine to not read tables by index. I think that it is more straightforward for sheets to read by index since, in excel, there is already a single index of sheets at the workbook level. For tables (i.e. listobjects), they live within sheets and each sheet has its own list of them. One could then just sort of make a single list of them where the tables in sheet1 are first, then the tables in sheet2, etc. If I was trying to use this index system to grab a table out of excel, though, it seems like I'd be doing more work than just using the name (unless it is the first table). All that said, I'm thinking a lot about my own use cases, where I have named tables that I have control over, and so I can read them by name just fine. If someone else is getting an excel file from someone else and trying to read it, and it has inconsistent table names, I could see it working less well for them. Even still, they could read all the tables and then filter the returning dictionary. |
Is your feature request related to a problem?
Excel has tables , that makes data managing within Excel easier and offers some other features.
At the moment, pandas cannot access those tables, it simply reads in all the data. I think it would be helpful if a
table
parameter was added topd.read_excel
to capture table(s) defined in a sheet.Sample Data:
Attached is a sample excel file: 016-MSPTDA-Excel.xlsx
With the current implementation of
read_excel
, we cannot select specific tables (dSalesReps, dProduct, dCategory, or dSupplier). Even if we read theTables
sheet, it becomes quite hard to separate the data into individual tables.Describe the solution you'd like
pd.read_excel(io=filename, sheet=sheetname, table=tablename, ...)
The table parameter can have a default of
None
, in which case the entire sheet is read in; if however table is notNone
, then the table or lists of tables only are read in. Theopenpyxl
library would be used to implement this.API breaking implications
I am not aware of any API breaking implications
Describe alternatives you've considered
It could be done outside Pandas, where you read in the data first through
openpyxl
, before passing it to Pandas. I wrote a blog post about it; I feel, however, that it may be more convenient to provide that same functionality within Pandas and let the user worry less about the abstractions.Additional context
I did check to see if this has been raised before, but did not find any. If it already has, kindly point me to that issue, and I will gladly close this.
Thanks.
The text was updated successfully, but these errors were encountered: