-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
excel reader & skip row between data & header & docs #4631
Conversation
@@ -2015,6 +2024,30 @@ def test_iteration_open_handle(self): | |||
expected = Series(['DDD', 'EEE', 'FFF', 'GGG']) | |||
tm.assert_series_equal(result, expected) | |||
|
|||
def test_infer_columns(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be in io/tests/test_excel.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it belongs there because it tests on my modification of parsers.py
@timmie let me have a detailed look |
thanks. |
@timmie if you tell me the specific files that need to be included just for this (in your opinion) I'd be willing to rebase your branch for you - I know that can be a bit complicated to do. |
@jtratner I am almost done... |
@jreback ah, okay. |
Thanks. So this one shall just include all the *-py file changes: & the respective tests. |
The docs could then go elsewhere. The doc changes are:
|
privately, I prefer BZR. So with git I always feel of not having control on my files (accidently delete changes) for everthing except: pull, commit, push ;-( |
@timmie to be clear, I don't mean that this will (or will not) get accepted - I just mean that I'll help you take out all the changes that need to be separated from this PR |
uff, so it' on my desk again. though i was done ;-( |
https://github.com/jreback/pandas/compare/timmie_4634?expand=1 @timmie this needs an additional test with what happens if you have dates in the excel files that you don't want converted with offset_datetime (I think it will not work); which I guess is ok, you can always convert them later, any thoughts here? |
@jreback In case the hours are counted 0-23 then there is no need for conversion. Or do I misunderstand you here? |
@timmie your excel files were fine add a column to one of them with regular dates; I am not sure what it will do with those (e.g. it SHOULD NOT apply the offset_datetime) to them, but no easy way to tell it not to (well you could make multiple passes on the file, in pass 1 say use the offset_datetime, in pass 2 do the other dates) |
Ah, mean like combining columns like date, hour into one? yes, this is currently not foreseen. But we still have a problem with creating datetime index from 2 excel columns. This was not the scope of my fix but still a deficiency of the excel reader. |
The issue I am getting at, is that the |
I agree. but how does the read_csv handle it? I am going offline now -- apparently another timezone at your place. |
I think now we should have a test for both cases that you are aiming at. Here's how I handled the same problem with CSV input (back then with scikits.timeseries):
I would assume that this works with the pandas CSV reader as well. But not with the excel reader. |
@jreback Meanwhile, we enter a new issue with the points raised and then improve the Excel reader generally to have a similar behaviour as the read_csv. But I also think that the Excel reader was not equipped with that much magic because the basic assumtion is:
So any further actions could be done with the data frame read in, e.g.:
Now in my test case where the time is not entered in ISO (1-24 instead of 0-23) the reader fails to recognise the dates at all. This was main reason for the fix. The dateconverters are actually only "by-products" of the tests. But they will be useful in similar situations. |
At So there you have them separate. @jreback
Maybe this will make things easier? |
@timmie try this
this will give you a starting point of the branch I created |
|
as I said you are free to submit doc corrects, but as separate PR (unless the issue you are fixing actually needs a doc change). |
Ok, I'll do that. we can add this do the development FAQ wiki. |
@timmie new gitworkflow sounds good. As jeff says you should do separate prs for each feature (and that's a good/git mentality to have in general), don't do lots of things in a PR, concentrate on one thing. (e.g. pr should include: test for a bug, fix for that bug, and release note about fixed bug, ideally all in one commit; e.g. pr should contain improvements to docs on certain thing). @jreback is there any kind of standard for timedeltas (I can't recall coming across any) or are you suggesting only works with user defined parser? |
@hayd timedeltas not support right now for writing to csv (though can kind of hack it), nor reading at all (though again if you write it as an integer then coerce back in it works) I am thinking that |
@jreback seems reasonable, will more useful once we have Timestamp-like efficient timedeltas. Don't know if there is a market for it... I wonder if it makes sense for users to put in multiple arbitrary parsers, i.e. not just for timedeltas. (tbh I usually just munge after the fact as it's so easy..) |
@hayd that's a nice idea....sort of like the |
@hayd maybe open a new issue for that one? |
What do you mean by standard (python or global)? |
did you see the example correction function above. I think is also works with pandas. So I summarise for my issues:
|
res = TextFileReader(*args, **kwds) | ||
|
||
|
||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think you should revert this
FYI our 'standard' now is this, but really a display format:
|
again to yoru comment regarding the keywords: So it would be a pure Excel issue which was fixed with my change. |
@jreback This sounds magical, not sure how this would work
@timmie there is far too much going on in this pr, i guess you know that, but git idiom is to keep one feature (set) to each branch in git (we should add this to the wiki to make it clearer). Then they can be inspected and discussed independently... and small parts are easier. If you can separate these into different PRs moving forward. |
@timmie u r welcome to start with the PR that I modified and add the additional tests I talked about this current PR just has too much going on to be acceptable |
OK, right. got the idea of using multiple branches & PRs. Only uncertain for me at this moment:
|
@timmie start with the PR that I showed you add a column of regular datetimes to the test files, leave in your test so that the converter ONLY works on that particular column |
Yes, for the test it's all ok. But which keyword shall trigger the correctin in line |
you should handle the |
@cpcloud |
@timmie we're pretty sure this is a Github issue where you pushed something that matched current master and it automatically closed the PR since it matched master. You can just push your branch again and open a new pull-request. |
the merge conflict chaos was too hard ;-(
This is a continuation of:
#4404
Because:
I shall fix:
datemodes: (see ENH: Excel to support reading Timedeltas #4332)
are additional rows (to be skipped) between a header and the data (see: xls.parse: fails to skip lines #4340)