Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloomberg Hackathon #8323

Closed
jreback opened this issue Sep 19, 2014 · 41 comments
Closed

Bloomberg Hackathon #8323

jreback opened this issue Sep 19, 2014 · 41 comments
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

Contributing Guidlines / Help:
https://github.com/pydata/pandas/wiki

Dev Docs
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Docs:

Perf:

Tests:

Bugs:

Enhancements:

IO:

Excel Oriented:

SQL:

More advanced:

Collaborative Efforts:

@jorisvandenbossche @cpcloud @TomAugspurger @hayd
cc @shoyer
cc @immerrr

@jreback jreback added this to the 0.15.0 milestone Sep 19, 2014
@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2014

cc @seth-p
cc @rockg

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2014

most of these are doc/testing things. I looked thru the Good as first PR. Anyone have any issues to add that are not on that list?

@jtratner
Copy link
Contributor

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.

List of PRs (ordered from most interesting/most impact to least interesting):

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2014

@jtratner thanks! i'll update!

@jtratner
Copy link
Contributor

One other great (but self-contained) project, would be to convert a pandas DataFrame into a new BigQuery table when writing. I've been working with BigQuery quite a bit and it would be pretty simple to do and be a nice way to dig into dealing with column metadata. I'll put up an issue right now with more details.

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2014

@jtratner thanks! that would be great!

@TomAugspurger
Copy link
Contributor

@jtratner
Copy link
Contributor

@jreback - I put it up, should be pretty simple to implement, except for how to handle int columns that gain NaN values. #8325

@rockg
Copy link
Contributor

rockg commented Sep 19, 2014

These are things that I would like to see:

Allow reindex to work without passing a completely new multindex (i.e., reindexing a level copies the other levels) #7895
Some HDFStore enhancements which should be straightforward #6857

@jreback
Copy link
Contributor Author

jreback commented Sep 19, 2014

thanks @TomAugspurger @rockg @jtratner

@jorisvandenbossche
Copy link
Member

For the doc issues, maybe also #3705 and #1967?

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

@jorisvandenbossche thanks!

@jorisvandenbossche
Copy link
Member

I was also thinking, some utility function that can 'read' the output of a DataFrame back in, would be something nice (for simple situations, you can use read_csv or better reaf_fwf, but for more complex things (index names, multi-indexes, ..) this does not work anymore I think)

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

@jorisvandenbossche not sure what you mean. except for the column index losing its name (not a multi-index though), csv round-tripping is preserving.

@jorisvandenbossche
Copy link
Member

but I do not mean csv roundtripping, I mean console print roundtripping

@jorisvandenbossche
Copy link
Member

Is there an easy way to read this in (the output as a string)?

In [1]: df = pd.DataFrame(np.random.randn(4,4), index=pd.MultiIndex.from_product([['a', 'b'],[0,1]], names=['l1', 'l2']), columns=['a', 'b', 'c', 'd'])

In [2]: df
Out[2]: 
              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500

Dealing with the multi-index, dealing with the sparse index, index names, ... (or to start with, not flipping on those)

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

I think the clipboard is pretty robust (its just read_csv underneath). needs various options specified, but csv is not a completely fungible format anyhow (unlike say HDF5 where you CAN store the meta-data).

In [25]: df.to_clipboard()

In [26]: pd.read_clipboard()
Out[26]: 
  l1  l2         a         b         c         d
0  a   0 -0.114687 -0.111372  1.116020 -1.127915
1  a   1  1.493011 -0.208416 -0.129818 -0.023854
2  b   0  0.904737 -0.213157 -0.214423  0.300431
3  b   1  0.043716 -0.027796 -0.462323  0.298288

In [29]: pd.read_clipboard(index_col=[0,1])
Out[29]: 
              a         b         c         d
l1 l2                                        
a  0  -0.114687 -0.111372  1.116020 -1.127915
   1   1.493011 -0.208416 -0.129818 -0.023854
b  0   0.904737 -0.213157 -0.214423  0.300431
   1   0.043716 -0.027796 -0.462323  0.298288

@jorisvandenbossche
Copy link
Member

Yes, but what I mean is: if you have this output as a string, or you can copy it (eg from an example in the docs, from a question on stackoverflow, ...), can you convert this easily to a DataFrame in a new session. And using read_clipboard on my example above eg gives CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

@jorisvandenbossche hmm works for me on master.

I usually just copy-paste from a question and do this:

data = """

here is the copied data exactly.....



"""
df = read_csv(StringIO(data))

FYI, I tried making this work from just a string (e.g. read_csv its a bit non-trivial to figure this out actually)

@jorisvandenbossche
Copy link
Member

Yep, that is what I also do, but still, I mostly have to adapt something to the original data to get it working. It would be fine if there is some utility that can read all output.

data = """              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500"""

pd.read_csv(StringIO(data), sep='\s+')

Can you read this in with read_csv without tweaking something? (but a bit deviating from the original issue here ..)

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

@jorisvandenbossche I suppose you could have a wrapper that 'tries' various things, but its non-trivial to simply guess, well you can, but their are so many edge cases its MUCH easier to just have the user specify it.

@jorisvandenbossche
Copy link
Member

Are there that many edge cases? The output of the pandas __repr__ is rather wel defined, or not?

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

ahh, you are proposing a pd.read_csv(data,repr=True) which (so that we don't have ANOTHER top-level function! that basically figures out the options, hmm, interesting.

@jreback
Copy link
Contributor Author

jreback commented Sep 20, 2014

@jorisvandenbossche I updated in the Enhancements section.

@jorisvandenbossche
Copy link
Member

maybe it doesn't need to be in a top level, other possibility is something like pd.util.read_repr. Seems like a nice little project for somebody to hack on that could be useful

@jorisvandenbossche
Copy link
Member

#8336

@TomAugspurger
Copy link
Contributor

Oh, #5563 would be a good one (Series HTML repr)

@jreback
Copy link
Contributor Author

jreback commented Sep 21, 2014

@TomAugspurger

nice posts you have here: tomaugspurger.github.io/blog/2014/09/04/practical-pandas-part-2-more-tidying-more-data-and-merging

about 1/2 down you pass method='table' in to_hdf which is ignored (and means u get a perf warning)
use format='table' and will work

@shoyer
Copy link
Member

shoyer commented Sep 22, 2014

Would #8162 (Allowing the index to be referenced by name, like a column) be a doable?

I would love to see something like this happen in SF!

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

wow this is great everyone bravo!

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2014

let's see how much gets done!

of course the point of this list was to get as much dev time as possible (at the expense of other projects of course) :)

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

in case a brave soul would like to venture into the land of numpy internals:

Missing data support in numpy: #8350

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

nice-to-have:

pandas + airspeed velocity

demo: http://mdboom.github.io/astropy-benchmark/

adding to the top list

@jreback
Copy link
Contributor Author

jreback commented Sep 22, 2014

that looks sexy - can u create a new issue for asv? vbench like

@cpcloud
Copy link
Member

cpcloud commented Sep 22, 2014

yep. vbench was actually mentioned in the asv talk at scipy 2014

@jankatins
Copy link
Contributor

Implemeting a CategoricalIndex #7629?

@jreback
Copy link
Contributor Author

jreback commented Sep 23, 2014

@JanSchulz I think out of scope for a 1-day event

@jasongrout
Copy link

in case a brave soul would like to venture into the land of numpy internals:

I should mention that Mark Wiebe (who knows a lot of numpy internals) will be there. Additionally (re: airspeed velocity), Michael Droettboom will be there.

@immerrr
Copy link
Contributor

immerrr commented Oct 4, 2014

Is there a summary of this hackathon available online?

@jreback
Copy link
Contributor Author

jreback commented Oct 4, 2014

no summary - a few issues worked on / closed

@jorisvandenbossche
Copy link
Member

@ all
I am going to update this the coming days for the upcoming Bloomberg Hackaton this weekend 29-30 November. But if you have new things to add or updates on the list above, certainly post/edit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants