-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: update wording about when xlrd engine can be used #38456
Conversation
pandas/io/excel/_base.py
Outdated
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will | ||
be used. | ||
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, | ||
then ``openpyxl`` will be used. | ||
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not made the code changes related to this line or the change on line 121 (these changes are echoed in the section around line ~900), but that's what I think the approach should be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, so that will be the approach in the future. But in pandas we decided to first raise a warning about it (in case the user only upgraded pandas, and not xlrd).
So for now we should keep this sentence (but maybe we should add something like "... and this ability will be removed in a future version" to make this more explicit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I respect the choice of approach, I can't really condone it. As I said above: I have concerns around security with xlrd<2, and so anything that can be done to stop people using it would be a very good thing.
If a user is upgrading such that they have pandas 1.2, that would be a very good time for them to also upgrade to xlrd 2 and whatever the latest openpyxl release is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not made the code changes related to this..., but that's what I think the approach should be.
Just so I understand the intentions here, are you planning to make those code changes (taking into account @jorisvandenbossche request)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's an open question, see #38424 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for those edits!
doc/source/whatsnew/v1.2.0.rst
Outdated
when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and | ||
is anticipated to be unusable in the future. | ||
anything but ``.xls`` files when specifying ``engine="xlrd"`` to ``pd.read_excel`` will raise | ||
an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's only the case with the latest xlrd, though, I think?
(while many users will still have xlrd < 2.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but if I could, I would totally enforce the use of only xlrd >= 2.0 now that it is out.
To be explicit: that comes from concerns I have, particularly around security, about code that hasn't been maintained for years being used to interact with formats (zip and xml, both of which are used in xlsx) that have known security issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjw296 yes, but we cann't force anyone to update (and in all likely hood they will simply update pandas and continue using the current version of xlrd).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback - I mean, in terms of implementation, you can: raise an exception if xlrd.__version__ < '2'
. Given that people will only hit this when they're already upgrading at least one package, ie: pandas, this feels reasonable to me, but pandas is not my project, so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so while I'd be disappointed at the security risk you're prepared to expose your users to, I'd have to accept it.
believe me users are way more exposed to security risks that this. its hard to immediately just do something which causes current code to break. yes I get your point, but the reverse is true too. people update pandas and not other packages and expect things to work. we generally like to warn if at all possible first.
pandas/io/excel/_base.py
Outdated
extension ``.xls``, or is an ``xlrd`` Book instance, then ``xlrd`` will | ||
be used. | ||
- Otherwise if `openpyxl <https://pypi.org/project/openpyxl/>`_ is installed, | ||
then ``openpyxl`` will be used. | ||
- Otherwise ``xlrd`` will be used and a ``FutureWarning`` will be raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, so that will be the approach in the future. But in pandas we decided to first raise a warning about it (in case the user only upgraded pandas, and not xlrd).
So for now we should keep this sentence (but maybe we should add something like "... and this ability will be removed in a future version" to make this more explicit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these changes look fine to me. but don't we still have the question of what happens when > 2 xlrd is installed and our error conditions are triggered? IIUC this will simply raise the value error from inside xlrd, which is fine as the message is helpful, but are we also triggering the warning? maybe we should catch the exception and raise a nice warning say you should just use openpyxl
?
do we have testing with > 2 xlrd?
doc/source/whatsnew/v1.2.0.rst
Outdated
when `defusedxml <https://github.com/tiran/defusedxml/>`_ is installed, and | ||
is anticipated to be unusable in the future. | ||
anything but ``.xls`` files when specifying ``engine="xlrd"`` to ``pd.read_excel`` will raise | ||
an exception. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjw296 yes, but we cann't force anyone to update (and in all likely hood they will simply update pandas and continue using the current version of xlrd).
Unfortunately, the discussion around this is a little fragmented, I think #38424 (comment) addresses this. If I worked up a PR for the code I suggest there, would that be accepted? Also, would it be okay to add it as a separate commit to this PR or would you prefer a separate one? |
I think this update is fine (and for 1.2), unless objections for THIS PR, @jorisvandenbossche @simonjayhawkins ok merging (as I think this is the only outstanding issue for 1.2) |
@jreback - This PR changes the documented behavior but does not change the code; it will make the code and docs out of sync. |
@jorisvandenbossche - I think #38522 is the place to bottom things out now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending discussion
@cjw296 thanks for the update. I just removed the |
Ah sorry, yes, that was a mistake on my part. I do feel you've watered down the warning though: I feel people should be warned that opening xlsx with xlrd is explicitly unsupported, and to upgrade rather than logging issues about that. |
Is it about the "no longer" that I added? Happy to remove that again |
(for the rest I kept your sentence intact. But of course it's now within a larger warning box, instead of only that sentence, so therefore maybe a bit less standing out) |
Thus, it is strongly encouraged to install ``openpyxl`` to read Excel 2007+ | ||
(``.xlsx``) files. | ||
Please do not report issues when using ``xlrd`` to read ``.xlsx`` files. | ||
This is no longer supported, switch to using ``openpyxl`` instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer these two lines to be further up, around 2938, and in their own paragraph. I mean, I'd prefer them to be bright red and bold, maybe flashing too, but I doubt you'd agree to that ;-)
doc/source/whatsnew/v1.2.0.rst
Outdated
would result in using the ``xlrd`` engine in many cases. If | ||
`openpyxl <https://openpyxl.readthedocs.io/en/stable/>`_ is installed, | ||
Previously, the default argument ``engine=None`` to :func:`~pandas.read_excel` | ||
would result in using the ``xlrd`` engine in many cases, also for new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"including" might be better than "also for" here.
@jorisvandenbossche - not really sure why you're doing stuff on this PR, #38522 has been closed in favour of the, quite frankly, inferior #38571. I'm kinda done with helping pandas, this has been an extremely negative experience for me, I would welcome a discussion in private, but I couldn't find your email address anywhere. Mine is chris@python.org if you fancy a chat. |
doc/source/user_guide/io.rst
Outdated
would result in using the ``xlrd`` engine in many cases, also for new | ||
would result in using the ``xlrd`` engine in many cases, including for new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to be pedantic, but the change I suggested was "also for" -> "including", not "also" -> "including".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, sorry, read to quickly
This PR is not tied to #38522 or #38571. The reason I am pushing to this PR is because I think there were valuable improvements here to the docs, which I wanted to get merged. But for that, the subset that didn't match the actual behaviour in master needed to be removed. So that's why I started pushing that (thinking you would be reluctant to do it yourself) |
@meeseeksdev backport 1.2.x |
…ine can be used
… used (#38660) Co-authored-by: Chris Withers <chris@simplistix.co.uk>
See #38424 (comment).