Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

excel reader & skip row between data & header & docs #4631

Merged
merged 0 commits into from
Sep 25, 2013

Conversation

timmie
Copy link
Contributor

@timmie timmie commented Aug 21, 2013

the merge conflict chaos was too hard ;-(

This is a continuation of:
#4404

Because:

  • we hit aparently a GitHub bug
  • the branches diverged too much.

I shall fix:

@@ -2015,6 +2024,30 @@ def test_iteration_open_handle(self):
expected = Series(['DDD', 'EEE', 'FFF', 'GGG'])
tm.assert_series_equal(result, expected)

def test_infer_columns(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in io/tests/test_excel.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it belongs there because it tests on my modification of parsers.py

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback @jtratner

So Travis announces we could merge. I've done my best to clean the code.

What are the (last) steps to get this through?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@timmie let me have a detailed look

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

thanks.

@jtratner
Copy link
Contributor

@timmie if you tell me the specific files that need to be included just for this (in your opinion) I'd be willing to rebase your branch for you - I know that can be a bit complicated to do.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@jtratner I am almost done...

@jtratner
Copy link
Contributor

@jreback ah, okay.

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jtratner

Thanks.

So this one shall just include all the *-py file changes:
io/excel.py
io/parsers.py
io/date_converters.py

& the respective tests.

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

The docs could then go elsewhere.

The doc changes are:

  • document the fix of this issues
  • add shortcut links to handy references in the docs
  • bring all integer indices methods together
  • add section about working with the datetime index to timeseries.rst

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jtratner

privately, I prefer BZR. So with git I always feel of not having control on my files (accidently delete changes) for everthing except: pull, commit, push ;-(
But seems I have learned with you guy how to write simple tests. the rest may come with time.

@jtratner
Copy link
Contributor

@timmie yeah, git can be complicated, you might want to check out scm_breeze, which might make it easier.

I'm sure @jreback will have additional comments, but you need to split out the date_parser changes from the row skipping changes and the doc changes.

@jtratner
Copy link
Contributor

@timmie to be clear, I don't mean that this will (or will not) get accepted - I just mean that I'll help you take out all the changes that need to be separated from this PR

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

uff, so it' on my desk again. though i was done ;-(

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

https://github.com/jreback/pandas/compare/timmie_4634?expand=1

@timmie this needs an additional test with what happens if you have dates in the excel files that you don't want converted with offset_datetime (I think it will not work); which I guess is ok, you can always convert them later,
essentially what you want is to apply the date_parser to only certain columns (which you can do with ``parse_dates = ['A']), BUT you cannot then parse 'other' dates

any thoughts here?

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback
The excel file included is also a corner case counting hours 1-24.

In case the hours are counted 0-23 then there is no need for conversion.

Or do I misunderstand you here?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@timmie your excel files were fine

add a column to one of them with regular dates; I am not sure what it will do with those (e.g. it SHOULD NOT apply the offset_datetime) to them, but no easy way to tell it not to (well you could make multiple passes on the file, in pass 1 say use the offset_datetime, in pass 2 do the other dates)

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

essentially what you want is to apply the date_parser to only certain columns (which you can do with >``parse_dates = ['A']), BUT you cannot then parse 'other' dates

Ah, mean like combining columns like date, hour into one?

yes, this is currently not foreseen.
would it be bad to covert that later?
we already assume that datetime / time is read in.

But we still have a problem with creating datetime index from 2 excel columns.

This was not the scope of my fix but still a deficiency of the excel reader.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

The issue I am getting at, is that the offset_datetime cannot 'know' a column is meant for timedelta conversion (rather than a plain ole datetime conversion). I am thinking that we need another set of keywords, e.g. parse_timedelta, timedelta_parser (which actually we need in another case)

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback

I agree. but how does the read_csv handle it?
Don't we want to have a similar behaviour?

I am going offline now -- apparently another timezone at your place.
Thanks so far. Let's try to get this into main pandas until Saturday, latest. If not I will have to wait some weeks until I can look at it again.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

main read_csv doesn't handle this...have to think about it

@cpcloud @jtratner @hayd

parse_timedelta and timedelta_parser keywords for read_csv?

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

I think now we should have a test for both cases that you are aiming at.

Here's how I handled the same problem with CSV input (back then with scikits.timeseries):

def dc_date_time_1to24(date_str, time_str, freq='T'):
    """
    .. csv-table:: Normal Time Series: 00:00-23:59
           :header: "Date", "Time"
           :delim: ;

           01.10.2008;00:10
           01.10.2008;00:20

           [...];[...];[...];[...];[...]
           01.10.2008;23:50
           01.10.2008;00:00
           02.10.2008;00:10

    Note
    -----
    assumed datecols::

        datecols = (0,1)

    """
    date_dt = dt.datetime.strptime(date_str, "%d.%m.%Y")

    time_dt = dt.datetime.strptime(time_str, "%H:%M")
    time_dt = time_dt - dt.timedelta(minutes=10) 
    time_dt = time_dt.time()

    dt_concat = dt.datetime.combine(date_dt, time_dt)


    ts_date =  ts.Date(freq, datetime=(dt_concat))

    return ts_date

I would assume that this works with the pandas CSV reader as well. But not with the excel reader.

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback
I think a solution would be to pass this PR through to get the current situation fixed.

Meanwhile, we enter a new issue with the points raised and then improve the Excel reader generally to have a similar behaviour as the read_csv.

But I also think that the Excel reader was not equipped with that much magic because the basic assumtion is:

  • The data provider would have formatted cells in Excel the right way
  • entered the data according to standards (e.g. ISO 8601 for datetime)

So any further actions could be done with the data frame read in, e.g.:

  • merge date and hour column to datetime
  • set index

Now in my test case where the time is not entered in ISO (1-24 instead of 0-23) the reader fails to recognise the dates at all. This was main reason for the fix.

The dateconverters are actually only "by-products" of the tests. But they will be useful in similar situations.

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

At
https://travis-ci.org/timmie/pandas/builds/10491379
you may see that #4634 passes.

So there you have them separate.

@jreback
I have changed my git workflow:

  • keep master always in sync with upstream
  • from master branch new to get a feature coded
  • regularly rebase / pull

Maybe this will make things easier?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@timmie try this

git remote add jreback https://github.com/jreback/pandas.git && git fetch jreback && git checkout -b timmie_4634 --track jreback/timmie_4634

this will give you a starting point of the branch I created

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@timmie

as I said you are free to submit doc corrects, but as separate PR (unless the issue you are fixing actually needs a doc change).

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

Ok, I'll do that.

we can add this do the development FAQ wiki.

@hayd
Copy link
Contributor

hayd commented Aug 22, 2013

@timmie new gitworkflow sounds good. As jeff says you should do separate prs for each feature (and that's a good/git mentality to have in general), don't do lots of things in a PR, concentrate on one thing. (e.g. pr should include: test for a bug, fix for that bug, and release note about fixed bug, ideally all in one commit; e.g. pr should contain improvements to docs on certain thing).

@jreback is there any kind of standard for timedeltas (I can't recall coming across any) or are you suggesting only works with user defined parser?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@hayd timedeltas not support right now for writing to csv (though can kind of hack it), nor reading at all (though again if you write it as an integer then coerce back in it works)

I am thinking that read_csv needs parse_timedelta= and timedelta_parser= in order to really support it

@hayd
Copy link
Contributor

hayd commented Aug 22, 2013

@jreback seems reasonable, will more useful once we have Timestamp-like efficient timedeltas. Don't know if there is a market for it...

I wonder if it makes sense for users to put in multiple arbitrary parsers, i.e. not just for timedeltas.

(tbh I usually just munge after the fact as it's so easy..)

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@hayd that's a nice idea....sort of like the dtype keyword...in fact, why couldn't we hijack that? (accept a callable for the a dtype, which will result in it being parsed using that callable? from a string)?

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

@hayd maybe open a new issue for that one?

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@hayd

is there any kind of standard for timedeltas (I can't recall coming across any) or

What do you mean by standard (python or global)?
ISO 8601 has some on this.

@timmie
Copy link
Contributor Author

timmie commented Aug 22, 2013

@jreback

I am thinking that read_csv needs parse_timedelta= and timedelta_parser= in order to really support it

did you see the example correction function above. I think is also works with pandas.

So I summarise for my issues:

res = TextFileReader(*args, **kwds)


return res
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think you should revert this

@hayd
Copy link
Contributor

hayd commented Aug 22, 2013

@timmie Yeah, I meant some kind of global standard for timedeltas in csv, I see ISO wiki mentions it briefly (either start and end times or something like "P1Y2M10DT2H30M"... not seen that before)..

@jreback
Copy link
Contributor

jreback commented Aug 22, 2013

FYI our 'standard' now is this, but really a display format:

<n> days, hh:mm:ss.fraction

1 days, 3:05:02.0003

@timmie
Copy link
Contributor Author

timmie commented Aug 23, 2013

@jreback

again to yoru comment regarding the keywords:
can we not achive the timedelta by applying shift?

So it would be a pure Excel issue which was fixed with my change.

@hayd
Copy link
Contributor

hayd commented Aug 23, 2013

@jreback This sounds magical, not sure how this would work

that's a nice idea....sort of like the dtype keyword...in fact, why couldn't we hijack that? (accept a callable for the a dtype, which will result in it being parsed using that callable? from a string)?

@timmie there is far too much going on in this pr, i guess you know that, but git idiom is to keep one feature (set) to each branch in git (we should add this to the wiki to make it clearer). Then they can be inspected and discussed independently... and small parts are easier. If you can separate these into different PRs moving forward.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@hayd apparently there is a converters argument to read_csv that does exactly this (applies a function to parse a specific column)

so @timmie in this case calling read_csv with

converters={'column_name' : offset_datetimes}

will do the right thing

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@timmie u r welcome to start with the PR that I modified and add the additional tests I talked about

this current PR just has too much going on to be acceptable

@timmie
Copy link
Contributor Author

timmie commented Aug 23, 2013

@jreback

OK, right. got the idea of using multiple branches & PRs.

Only uncertain for me at this moment:

  • Which keyword do we use for the datemode fix in Excel reading for the case of my test data?
    • My tests the parser fix. can this be implemented as well since it is done as additional condition and thus should not break other code (test pass)

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@timmie start with the PR that I showed you

add a column of regular datetimes to the test files, leave parse_dates=True and don't use date_parser arguments; instead use converters={ 'column_name/number' : offset_datetimes }

in your test

so that the converter ONLY works on that particular column

@timmie
Copy link
Contributor Author

timmie commented Aug 23, 2013

Yes, for the test it's all ok.

But which keyword shall trigger the correctin in line
https://github.com/timmie/pandas/blob/excel_read_day-hours/pandas/io/excel.py#L216
in case the codition in
https://github.com/timmie/pandas/blob/excel_read_day-hours/pandas/io/excel.py#L205
requires a datemode correction?

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

you should handle the converters argument instead; if the column is defined (by number or name) then apply its converter

@cpcloud cpcloud merged commit fac7d1d into pandas-dev:master Sep 25, 2013
@timmie
Copy link
Contributor Author

timmie commented Sep 25, 2013

@cpcloud
What happend?
Did you merge the PR albeit the outstanding comments?
I fully reset the branch this afternoon in order to account for your comments and make my changes from scratch based on the discussion ;-(

@jtratner
Copy link
Contributor

@timmie we're pretty sure this is a Github issue where you pushed something that matched current master and it automatically closed the PR since it matched master. You can just push your branch again and open a new pull-request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants