excel reader & skip row between data & header & docs #4631

timmie · 2013-08-21T23:44:51Z

the merge conflict chaos was too hard ;-(

This is a continuation of:
#4404

Because:

we hit aparently a GitHub bug
the branches diverged too much.

I shall fix:

in the excel.py there is a fix enabling reading xlsx files with both
datemodes: (see ENH: Excel to support reading Timedeltas #4332)
in the parsers.py there is the fix for readinh the header even if there
are additional rows (to be skipped) between a header and the data (see: xls.parse: fails to skip lines #4340)
a few doc improvements

jreback · 2013-08-21T23:50:22Z

pandas/io/tests/test_parsers.py

@@ -2015,6 +2024,30 @@ def test_iteration_open_handle(self):
                expected = Series(['DDD', 'EEE', 'FFF', 'GGG'])
                tm.assert_series_equal(result, expected)

+    def test_infer_columns(self):


this should be in io/tests/test_excel.py

I thought it belongs there because it tests on my modification of parsers.py

timmie · 2013-08-22T00:09:04Z

@jreback @jtratner

So Travis announces we could merge. I've done my best to clean the code.

What are the (last) steps to get this through?

jreback · 2013-08-22T00:13:07Z

@timmie let me have a detailed look

timmie · 2013-08-22T00:16:59Z

thanks.

jtratner · 2013-08-22T00:39:31Z

@timmie if you tell me the specific files that need to be included just for this (in your opinion) I'd be willing to rebase your branch for you - I know that can be a bit complicated to do.

jreback · 2013-08-22T00:41:04Z

@jtratner I am almost done...

jtratner · 2013-08-22T00:42:06Z

@jreback ah, okay.

timmie · 2013-08-22T00:43:37Z

@jtratner

Thanks.

So this one shall just include all the *-py file changes:
io/excel.py
io/parsers.py
io/date_converters.py

& the respective tests.

timmie · 2013-08-22T00:45:34Z

The docs could then go elsewhere.

The doc changes are:

document the fix of this issues
add shortcut links to handy references in the docs
bring all integer indices methods together
add section about working with the datetime index to timeseries.rst

timmie · 2013-08-22T00:48:10Z

@jtratner

privately, I prefer BZR. So with git I always feel of not having control on my files (accidently delete changes) for everthing except: pull, commit, push ;-(
But seems I have learned with you guy how to write simple tests. the rest may come with time.

jtratner · 2013-08-22T00:58:29Z

@timmie yeah, git can be complicated, you might want to check out scm_breeze, which might make it easier.

I'm sure @jreback will have additional comments, but you need to split out the date_parser changes from the row skipping changes and the doc changes.

jtratner · 2013-08-22T01:04:44Z

@timmie to be clear, I don't mean that this will (or will not) get accepted - I just mean that I'll help you take out all the changes that need to be separated from this PR

timmie · 2013-08-22T01:07:54Z

uff, so it' on my desk again. though i was done ;-(

jreback · 2013-08-22T01:45:05Z

https://github.com/jreback/pandas/compare/timmie_4634?expand=1

@timmie this needs an additional test with what happens if you have dates in the excel files that you don't want converted with offset_datetime (I think it will not work); which I guess is ok, you can always convert them later,
essentially what you want is to apply the date_parser to only certain columns (which you can do with ``parse_dates = ['A']), BUT you cannot then parse 'other' dates

any thoughts here?

timmie · 2013-08-22T01:59:01Z

@jreback
The excel file included is also a corner case counting hours 1-24.

In case the hours are counted 0-23 then there is no need for conversion.

Or do I misunderstand you here?

jreback · 2013-08-22T02:02:40Z

@timmie your excel files were fine

add a column to one of them with regular dates; I am not sure what it will do with those (e.g. it SHOULD NOT apply the offset_datetime) to them, but no easy way to tell it not to (well you could make multiple passes on the file, in pass 1 say use the offset_datetime, in pass 2 do the other dates)

timmie · 2013-08-22T02:02:46Z

essentially what you want is to apply the date_parser to only certain columns (which you can do with >``parse_dates = ['A']), BUT you cannot then parse 'other' dates

Ah, mean like combining columns like date, hour into one?

yes, this is currently not foreseen.
would it be bad to covert that later?
we already assume that datetime / time is read in.

But we still have a problem with creating datetime index from 2 excel columns.

This was not the scope of my fix but still a deficiency of the excel reader.

jreback · 2013-08-22T02:07:39Z

The issue I am getting at, is that the offset_datetime cannot 'know' a column is meant for timedelta conversion (rather than a plain ole datetime conversion). I am thinking that we need another set of keywords, e.g. parse_timedelta, timedelta_parser (which actually we need in another case)

timmie · 2013-08-22T02:11:37Z

@jreback

I agree. but how does the read_csv handle it?
Don't we want to have a similar behaviour?

I am going offline now -- apparently another timezone at your place.
Thanks so far. Let's try to get this into main pandas until Saturday, latest. If not I will have to wait some weeks until I can look at it again.

jreback · 2013-08-22T02:19:09Z

main read_csv doesn't handle this...have to think about it

@cpcloud @jtratner @hayd

parse_timedelta and timedelta_parser keywords for read_csv?

timmie · 2013-08-22T06:48:12Z

I think now we should have a test for both cases that you are aiming at.

Here's how I handled the same problem with CSV input (back then with scikits.timeseries):

def dc_date_time_1to24(date_str, time_str, freq='T'):
    """
    .. csv-table:: Normal Time Series: 00:00-23:59
           :header: "Date", "Time"
           :delim: ;

           01.10.2008;00:10
           01.10.2008;00:20

           [...];[...];[...];[...];[...]
           01.10.2008;23:50
           01.10.2008;00:00
           02.10.2008;00:10

    Note
    -----
    assumed datecols::

        datecols = (0,1)

    """
    date_dt = dt.datetime.strptime(date_str, "%d.%m.%Y")

    time_dt = dt.datetime.strptime(time_str, "%H:%M")
    time_dt = time_dt - dt.timedelta(minutes=10) 
    time_dt = time_dt.time()

    dt_concat = dt.datetime.combine(date_dt, time_dt)


    ts_date =  ts.Date(freq, datetime=(dt_concat))

    return ts_date

I would assume that this works with the pandas CSV reader as well. But not with the excel reader.

timmie · 2013-08-22T10:15:14Z

@jreback
I think a solution would be to pass this PR through to get the current situation fixed.

Meanwhile, we enter a new issue with the points raised and then improve the Excel reader generally to have a similar behaviour as the read_csv.

But I also think that the Excel reader was not equipped with that much magic because the basic assumtion is:

The data provider would have formatted cells in Excel the right way
entered the data according to standards (e.g. ISO 8601 for datetime)

So any further actions could be done with the data frame read in, e.g.:

merge date and hour column to datetime
set index

Now in my test case where the time is not entered in ISO (1-24 instead of 0-23) the reader fails to recognise the dates at all. This was main reason for the fix.

The dateconverters are actually only "by-products" of the tests. But they will be useful in similar situations.

timmie · 2013-08-22T12:47:55Z

At
https://travis-ci.org/timmie/pandas/builds/10491379
you may see that #4634 passes.

So there you have them separate.

@jreback
I have changed my git workflow:

keep master always in sync with upstream
from master branch new to get a feature coded
regularly rebase / pull

Maybe this will make things easier?

jreback · 2013-08-22T13:19:44Z

@timmie try this

git remote add jreback https://github.com/jreback/pandas.git && git fetch jreback && git checkout -b timmie_4634 --track jreback/timmie_4634

this will give you a starting point of the branch I created

timmie · 2013-08-22T14:36:49Z

@jreback

So you have taken out all general doc modifications.
my comments on the python modifications are in jreback@73fbd21#commitcomment-3921863

jreback · 2013-08-22T14:41:28Z

@timmie

as I said you are free to submit doc corrects, but as separate PR (unless the issue you are fixing actually needs a doc change).

timmie · 2013-08-22T14:51:25Z

Ok, I'll do that.

we can add this do the development FAQ wiki.

hayd · 2013-08-22T14:56:17Z

@timmie new gitworkflow sounds good. As jeff says you should do separate prs for each feature (and that's a good/git mentality to have in general), don't do lots of things in a PR, concentrate on one thing. (e.g. pr should include: test for a bug, fix for that bug, and release note about fixed bug, ideally all in one commit; e.g. pr should contain improvements to docs on certain thing).

@jreback is there any kind of standard for timedeltas (I can't recall coming across any) or are you suggesting only works with user defined parser?

jreback · 2013-08-22T15:11:26Z

@hayd timedeltas not support right now for writing to csv (though can kind of hack it), nor reading at all (though again if you write it as an integer then coerce back in it works)

I am thinking that read_csv needs parse_timedelta= and timedelta_parser= in order to really support it

hayd · 2013-08-22T15:23:12Z

@jreback seems reasonable, will more useful once we have Timestamp-like efficient timedeltas. Don't know if there is a market for it...

I wonder if it makes sense for users to put in multiple arbitrary parsers, i.e. not just for timedeltas.

(tbh I usually just munge after the fact as it's so easy..)

jreback · 2013-08-22T15:25:14Z

@hayd that's a nice idea....sort of like the dtype keyword...in fact, why couldn't we hijack that? (accept a callable for the a dtype, which will result in it being parsed using that callable? from a string)?

jreback · 2013-08-22T15:26:17Z

@hayd maybe open a new issue for that one?

timmie · 2013-08-22T15:32:25Z

@hayd

is there any kind of standard for timedeltas (I can't recall coming across any) or

What do you mean by standard (python or global)?
ISO 8601 has some on this.

timmie · 2013-08-22T15:37:57Z

@jreback

I am thinking that read_csv needs parse_timedelta= and timedelta_parser= in order to really support it

did you see the example correction function above. I think is also works with pandas.

So I summarise for my issues:

1 more PR for the doc changes
1 more & separate PR for the skip row thing xls.parse: fails to skip lines #4340 (did you approve my solution)
the timedelta ENH: Excel to support reading Timedeltas #4332 is still in discussion. so you would come up with a final solution or propsal?

hayd · 2013-08-22T15:45:54Z

pandas/io/parsers.py

+    res = TextFileReader(*args, **kwds)
+
+
+    return res


think you should revert this

hayd · 2013-08-22T16:00:17Z

@timmie Yeah, I meant some kind of global standard for timedeltas in csv, I see ISO wiki mentions it briefly (either start and end times or something like "P1Y2M10DT2H30M"... not seen that before)..

jreback · 2013-08-22T16:08:44Z

FYI our 'standard' now is this, but really a display format:

<n> days, hh:mm:ss.fraction

1 days, 3:05:02.0003

timmie · 2013-08-23T10:03:48Z

@jreback

again to yoru comment regarding the keywords:
can we not achive the timedelta by applying shift?

So it would be a pure Excel issue which was fixed with my change.

hayd · 2013-08-23T10:36:34Z

@jreback This sounds magical, not sure how this would work

that's a nice idea....sort of like the dtype keyword...in fact, why couldn't we hijack that? (accept a callable for the a dtype, which will result in it being parsed using that callable? from a string)?

@timmie there is far too much going on in this pr, i guess you know that, but git idiom is to keep one feature (set) to each branch in git (we should add this to the wiki to make it clearer). Then they can be inspected and discussed independently... and small parts are easier. If you can separate these into different PRs moving forward.

jreback · 2013-08-23T11:03:35Z

@hayd apparently there is a converters argument to read_csv that does exactly this (applies a function to parse a specific column)

so @timmie in this case calling read_csv with

converters={'column_name' : offset_datetimes}

will do the right thing

jreback · 2013-08-23T11:05:05Z

@timmie u r welcome to start with the PR that I modified and add the additional tests I talked about

this current PR just has too much going on to be acceptable

timmie · 2013-08-23T12:32:11Z

@jreback

OK, right. got the idea of using multiple branches & PRs.

Only uncertain for me at this moment:

Which keyword do we use for the datemode fix in Excel reading for the case of my test data?
- My tests the parser fix. can this be implemented as well since it is done as additional condition and thus should not break other code (test pass)

jreback · 2013-08-23T13:03:26Z

@timmie start with the PR that I showed you

add a column of regular datetimes to the test files, leave parse_dates=True and don't use date_parser arguments; instead use converters={ 'column_name/number' : offset_datetimes }

in your test

so that the converter ONLY works on that particular column

timmie · 2013-08-23T13:16:56Z

Yes, for the test it's all ok.

But which keyword shall trigger the correctin in line
https://github.com/timmie/pandas/blob/excel_read_day-hours/pandas/io/excel.py#L216
in case the codition in
https://github.com/timmie/pandas/blob/excel_read_day-hours/pandas/io/excel.py#L205
requires a datemode correction?

jreback · 2013-08-23T13:40:27Z

you should handle the converters argument instead; if the column is defined (by number or name) then apply its converter

timmie · 2013-09-25T22:15:29Z

@cpcloud
What happend?
Did you merge the PR albeit the outstanding comments?
I fully reset the branch this afternoon in order to account for your comments and make my changes from scratch based on the discussion ;-(

jtratner · 2013-09-25T22:21:53Z

@timmie we're pretty sure this is a Github issue where you pushed something that matched current master and it automatically closed the PR since it matched master. You can just push your branch again and open a new pull-request.

jreback reviewed Aug 21, 2013
View reviewed changes

timmie mentioned this pull request Aug 22, 2013

now sectionwise: date_converter: delta / time #4632

Closed

timmie mentioned this pull request Aug 22, 2013

now sectionwise: excel / date_parser #4332 #4634

Closed

hayd reviewed Aug 22, 2013
View reviewed changes

pandas/io/parsers.py

res = TextFileReader(*args, **kwds)

return res

Copy link

Contributor

hayd Aug 22, 2013

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you should revert this

cpcloud merged commit fac7d1d into pandas-dev:master Sep 25, 2013

timmie mentioned this pull request Aug 29, 2015

ENH: Open Document Format ODS support in read_excel() #9070

Closed

excel reader & skip row between data & header & docs #4631

excel reader & skip row between data & header & docs #4631

Conversation

timmie commented Aug 21, 2013

jreback Aug 21, 2013

Choose a reason for hiding this comment

timmie Aug 21, 2013

Choose a reason for hiding this comment

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

jtratner commented Aug 22, 2013

jreback commented Aug 22, 2013

jtratner commented Aug 22, 2013

timmie commented Aug 22, 2013

timmie commented Aug 22, 2013

timmie commented Aug 22, 2013

jtratner commented Aug 22, 2013

jtratner commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

timmie commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

hayd commented Aug 22, 2013

jreback commented Aug 22, 2013

hayd commented Aug 22, 2013

jreback commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 22, 2013

timmie commented Aug 22, 2013

hayd Aug 22, 2013

Choose a reason for hiding this comment

hayd commented Aug 22, 2013

jreback commented Aug 22, 2013

timmie commented Aug 23, 2013

hayd commented Aug 23, 2013

jreback commented Aug 23, 2013

jreback commented Aug 23, 2013

timmie commented Aug 23, 2013

jreback commented Aug 23, 2013

timmie commented Aug 23, 2013

jreback commented Aug 23, 2013

timmie commented Sep 25, 2013

jtratner commented Sep 25, 2013