Bloomberg Hackathon #8323

jreback · 2014-09-19T19:26:33Z

Contributing Guidlines / Help:
https://github.com/pydata/pandas/wiki

Dev Docs
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Docs:

Doc Strings examples/links: Flesh out examples included in docstrings #3439, Ongoing: Fixing docstrings #2916, DOC: in docstrings point users to valid offset aliases #3324
~~Docs on ipython startup files: DOC: add section about using python/ipython startup files to set options to FAQ #5748~~
Links to API docs in the tutorials: DOC: provide links to API documentation where possible #3705, DOC: add import prefix to all pandas imports #1967
better docs on DataFrame.apply (examples): doc: DataFrame.apply can return several columns #5299
~~GA docs: google analytics docs #3508~~
doc groupby NA group work-around: DOC: update groupby NA group handing / workaround #5456
consistent imports in all documentation: DOC: add import prefix to all pandas imports #1967
DOC: improve groupby reference docs DOC: improve groupby reference docs #6944
documenting cython class (eg Timestamp): "cyfunction is not a python function" DOC: documenting cython class (eg Timestamp): "cyfunction is not a python function" #5218
Redesigning/reorganising the documentation website

Perf:

~~vbench on different group sizes: PERF: add vbenchs for groupby functions with different group sizes #6787~~
use seed in vbenches: BENCH: put in np.random.seed on vbenches #8144
pandas + airspeed velocity Use airspeed velocity for benchmarking #8361

Tests:

matplotlib lib to check plots: TST: Use matplotlib's compare_images to check plots #5379
harmonize testing namespace: harmonize the testing namespace with tm.TestCase #8023
verify timedelta algos: BUG: algos with timedelta #5986
non_unique storage tests for HDFStore: TST: add tests for Series/Panel with non-unique index in to_hdf with fixed format #7813

Bugs:

Make changes in numpy API for 1.7: CLN: start using numpy-1.7 API #8329
accept scalar in Panel construction: BUG? Can construct constant Series and DataFrame, but not Panel or Panel4D #8285
fillna bug: Panel.fillna with method='ffill' ignores the axis parameter and only fills along axis=1 #8251
HDFStore modifying columns when passed: read_hdf / store.select modifies the passed columns parameters when multi-indexed #7212
Plot label = None instead of provided label in line plot Plot label None in line plot #8905
plot with kind=scatter fails when providing an array for the size BUG: plot with kind=scatter fails when checking if an array is in the DataFrame #8852
Grouper(key='A') gives AttributeError when applying function BUG: Grouper(key='A') gives AttributeError when applying function #8795
ValueError exception with pd.resample ValueError exception with pd.resample #8683
to_csv issue with chunksize when large number of columns to_csv issue #8621
EASY: pd.option_context without 'with' changes option values pd.option_context without 'with' changes option values #8514

Enhancements:

ENH: support TimedeltaIndex plotting ENH/BUG: support TimedeltaIndex plotting #8711
as suggested below, an enhancement to read_csv (or maybe read_repr/string to allow round-triping of the repr (can also serve as a basis for read_clipboard)
raise on invalid compression options in HDFStore: ER: raise on invalid compression options in HDFStore #4582
~~accept Period in DatetimeIndex for start/end: Cannot create DatetimeIndex using Period #6780~~
dont use bare Exceptions: ERR: catchall exception should be explicity and use Exception EAFP #7948
add more Series/Index ops: API/CLN: more common ops to integrate with Series/index OpsMixin #6382
~~to_dict orient parm: DataFrame to_dict method should also provide orient parameter (like to_json) #7840~~
sort_index to generic.py: CLN/TST: move consoliate sort_index to core/generic.py #8283
get to take an axis argument: ENH: allow get to take an axis argument #6703
axis argument to append: ENH: allow axis argument to append / move append code to generic.py #8295
better error on invalid input to cut: qcut() should make sure the bins bounderies are unique before passing them to _bins_to_cuts #7751
~~level kw to any/all: API: add level kwarg for Series.any/.all #8302~~
astype accepting a dict: ENH: df.astype could accept a dict of {col: type} #7271
~~clean up code by removing core/array.py: COMPAT/CLN: remove need for core/array.py #8359~~

IO:

to_clipboard bug/improvements: to_clipboard() locks clipboard system-wide on exception #8304
to_html to actually create links: Improvement: DataFrame.to_html() to create hyperlinks #2679, ENH: Html table export: Add of other attributes #6488, ENH: to_html improvement #4987
max_colwidth with groupby: set_option('max_colwidth', N) not working on groupby output #7856
date_formatting in to_csv not being passed thru: MultiIndex DataFrame to_csv() ignores date_format #7791
generate gbq schema: to_gbq: Allow creation of new tables from DataFrame (and generate schema) #8325
make to_latex work with multi-index: to_latex with MI column and index names #8336
Series.to_html not working so well: Series do not display HTML repr #5563
to_csv issue with chunksize when large number of columns to_csv issue #8621

Excel Oriented:

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679 and when a column contains alpha numeric ending with 'e', pandas converts these to float64 #8272 you'd get a better sense of pandas internals too.
dtypes per column in read in (ENH: read_excel dtypes and converts #8212 and when a column contains alpha numeric ending with 'e', pandas converts these to float64 #8272)
Treat multiple rows/columns as MultiIndex/hierarchical columns ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679
More flexible output formatting for Excel (mentioned in to_excel() float_format to accept this format string? #8191 , but I'm going to put up an issue about having something like per-column styles, also Styling in DataFrame.to_excel #1663).
Allow writing multiple tables to same sheet and/or setting starting position for a particular sheet (mentioned by Wes in an older issue but I can't find it right now).
extend excel writers to write to open document format Wish: Input / output for Open Document Spreadsheet (ODS) #2311
allow ExcelWriter to automatically convert lists and dict to string Suggested improvement: allow ExcelWriter to automatically convert lists and dict to strings #8188
Performance benchmarks for Excel writers and readers PERF: vbench for excel writers #7171
Support BytesIO output in ExcelWriter ExcelWriter does not support BytesIO input #7074

SQL:

ENH: specify dtype in to_sql per column: problem with to_sql with NA #8778
BUG: to_sql fails with datetime.time values with sqlite fallback mode BUG: to_sql fails with datetime.time values with sqlite fallback mode #8341

More advanced:

add write_index option to HDFStore: ENH: add option to HDFStore.put/append to optionally store the index of the object (default for back compat is True) #8319
start/stop on HDFstore fixed: BUG: HDFStore.select ignores start and stop parameters #8287
block splitting in HDFStore: Unable to write to HDF5 table if DataFrame has mixed object types (pd.Timestamp and str) #8284
tz handling using HDFStore / fixed: BUG: round-trip of tz in an index using fixed-format for HDF5 #8165, BUG/ERR: raise on saving a multi-index with tz-info in fixed format for HDF5 #7775
PeriodIndex in HDFStore: BUG: support/test of PeriodIndex in HDFStore #7796
Write meta data as a CArray: ENH: write Table meta-data (non_index_axes) as a CArray (rather than as meta-data) #6245
Modify methods for HDFStore: Updating HDFStore in place #6857
Integrate multi-index reindexing a bit more: MultiIndex reindex should behave like Index. #7895
Timedelta support in groupby: groupby.mean, etc, doesn't recognize timedelta64 #5724
support numeric index ops: API: make '+' and '-' for Index either do numeric operations of raise TypeError (instead of setops) #8226
allow index to referenced like a column: Allowing the index to be referenced by name, like a column #8162

Collaborative Efforts:

Missing data support in numpy: Why does pandas work around numpy limitations with custom dtypes instead of fixing them upstream? #8350

@jorisvandenbossche @cpcloud @TomAugspurger @hayd
cc @shoyer
cc @immerrr

The text was updated successfully, but these errors were encountered:

jreback · 2014-09-19T19:27:20Z

cc @seth-p
cc @rockg

jreback · 2014-09-19T20:14:18Z

most of these are doc/testing things. I looked thru the Good as first PR. Anyone have any issues to add that are not on that list?

jtratner · 2014-09-19T20:30:02Z

More customization of Excel input/output could be great, i.e. making it easier to specify per-column colors/formatting, float formats, etc. The code base isn't too complicated there (just a mixture of the formatter and the ExcelWriter stuff) and you could make rapid progress because it's really easy to test and create samples. I think the result would be very immediately rewarding (better looking things, easier to make reports, etc.). Plus for #4679 and #8272 you'd get a better sense of pandas internals too.

List of PRs (ordered from most interesting/most impact to least interesting):

dtypes per column in read in (ENH: read_excel dtypes and converts #8212 and when a column contains alpha numeric ending with 'e', pandas converts these to float64 #8272)
Treat multiple rows/columns as MultiIndex/hierarchical columns ENH: Excel - allow for multiple rows to be treated as hierarchical columns #4679
More flexible output formatting for Excel (mentioned in to_excel() float_format to accept this format string? #8191 , but I'm going to put up an issue about having something like per-column styles, also Styling in DataFrame.to_excel #1663).
Allow writing multiple tables to same sheet and/or setting starting position for a particular sheet (mentioned by Wes in an older issue but I can't find it right now).
extend excel writers to write to open document format Wish: Input / output for Open Document Spreadsheet (ODS) #2311
allow ExcelWriter to automatically convert lists and dict to string Suggested improvement: allow ExcelWriter to automatically convert lists and dict to strings #8188
Performance benchmarks for Excel writers and readers PERF: vbench for excel writers #7171
Support BytesIO output in ExcelWriter ExcelWriter does not support BytesIO input #7074

jreback · 2014-09-19T20:30:59Z

@jtratner thanks! i'll update!

jtratner · 2014-09-19T20:34:17Z

One other great (but self-contained) project, would be to convert a pandas DataFrame into a new BigQuery table when writing. I've been working with BigQuery quite a bit and it would be pretty simple to do and be a nice way to dig into dealing with column metadata. I'll put up an issue right now with more details.

jreback · 2014-09-19T20:35:12Z

@jtratner thanks! that would be great!

TomAugspurger · 2014-09-19T20:45:34Z

astype taking dict of column to dtype: ENH: df.astype could accept a dict of {col: type} #7271

jtratner · 2014-09-19T21:04:47Z

@jreback - I put it up, should be pretty simple to implement, except for how to handle int columns that gain NaN values. #8325

rockg · 2014-09-19T23:19:26Z

These are things that I would like to see:

Allow reindex to work without passing a completely new multindex (i.e., reindexing a level copies the other levels) #7895
Some HDFStore enhancements which should be straightforward #6857

jreback · 2014-09-19T23:31:40Z

thanks @TomAugspurger @rockg @jtratner

jorisvandenbossche · 2014-09-20T19:40:42Z

For the doc issues, maybe also #3705 and #1967?

jreback · 2014-09-20T19:45:00Z

@jorisvandenbossche thanks!

jorisvandenbossche · 2014-09-20T19:47:57Z

I was also thinking, some utility function that can 'read' the output of a DataFrame back in, would be something nice (for simple situations, you can use read_csv or better reaf_fwf, but for more complex things (index names, multi-indexes, ..) this does not work anymore I think)

jreback · 2014-09-20T19:54:58Z

@jorisvandenbossche not sure what you mean. except for the column index losing its name (not a multi-index though), csv round-tripping is preserving.

jorisvandenbossche · 2014-09-20T19:57:16Z

but I do not mean csv roundtripping, I mean console print roundtripping

jorisvandenbossche · 2014-09-20T20:00:03Z

Is there an easy way to read this in (the output as a string)?

In [1]: df = pd.DataFrame(np.random.randn(4,4), index=pd.MultiIndex.from_product([['a', 'b'],[0,1]], names=['l1', 'l2']), columns=['a', 'b', 'c', 'd'])

In [2]: df
Out[2]: 
              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500

Dealing with the multi-index, dealing with the sparse index, index names, ... (or to start with, not flipping on those)

jreback · 2014-09-20T20:04:59Z

I think the clipboard is pretty robust (its just read_csv underneath). needs various options specified, but csv is not a completely fungible format anyhow (unlike say HDF5 where you CAN store the meta-data).

In [25]: df.to_clipboard()

In [26]: pd.read_clipboard()
Out[26]: 
  l1  l2         a         b         c         d
0  a   0 -0.114687 -0.111372  1.116020 -1.127915
1  a   1  1.493011 -0.208416 -0.129818 -0.023854
2  b   0  0.904737 -0.213157 -0.214423  0.300431
3  b   1  0.043716 -0.027796 -0.462323  0.298288

In [29]: pd.read_clipboard(index_col=[0,1])
Out[29]: 
              a         b         c         d
l1 l2                                        
a  0  -0.114687 -0.111372  1.116020 -1.127915
   1   1.493011 -0.208416 -0.129818 -0.023854
b  0   0.904737 -0.213157 -0.214423  0.300431
   1   0.043716 -0.027796 -0.462323  0.298288

jorisvandenbossche · 2014-09-20T20:10:14Z

Yes, but what I mean is: if you have this output as a string, or you can copy it (eg from an example in the docs, from a question on stackoverflow, ...), can you convert this easily to a DataFrame in a new session. And using read_clipboard on my example above eg gives CParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6

jreback · 2014-09-20T20:15:24Z

@jorisvandenbossche hmm works for me on master.

I usually just copy-paste from a question and do this:

data = """

here is the copied data exactly.....



"""
df = read_csv(StringIO(data))

FYI, I tried making this work from just a string (e.g. read_csv its a bit non-trivial to figure this out actually)

jorisvandenbossche · 2014-09-20T20:20:53Z

Yep, that is what I also do, but still, I mostly have to adapt something to the original data to get it working. It would be fine if there is some utility that can read all output.

data = """              a         b         c         d
l1 l2                                        
a  0   0.426860  0.691807 -1.499024 -0.761304
   1   0.610488 -0.185976  0.788957 -0.952540
b  0   0.527709  0.239897 -0.842122  0.613876
   1   0.401288  1.689590  1.004487 -0.064500"""

pd.read_csv(StringIO(data), sep='\s+')

Can you read this in with read_csv without tweaking something? (but a bit deviating from the original issue here ..)

jreback · 2014-09-20T20:22:33Z

@jorisvandenbossche I suppose you could have a wrapper that 'tries' various things, but its non-trivial to simply guess, well you can, but their are so many edge cases its MUCH easier to just have the user specify it.

jorisvandenbossche · 2014-09-20T20:24:29Z

Are there that many edge cases? The output of the pandas __repr__ is rather wel defined, or not?

jreback · 2014-09-20T20:30:48Z

ahh, you are proposing a pd.read_csv(data,repr=True) which (so that we don't have ANOTHER top-level function! that basically figures out the options, hmm, interesting.

jreback · 2014-09-20T20:35:43Z

@jorisvandenbossche I updated in the Enhancements section.

jorisvandenbossche · 2014-09-20T21:02:09Z

maybe it doesn't need to be in a top level, other possibility is something like pd.util.read_repr. Seems like a nice little project for somebody to hack on that could be useful

jorisvandenbossche · 2014-09-21T13:21:37Z

#8336

TomAugspurger · 2014-09-21T13:27:49Z

Oh, #5563 would be a good one (Series HTML repr)

jreback · 2014-09-21T15:05:11Z

@TomAugspurger

nice posts you have here: tomaugspurger.github.io/blog/2014/09/04/practical-pandas-part-2-more-tidying-more-data-and-merging

about 1/2 down you pass method='table' in to_hdf which is ignored (and means u get a perf warning)
use format='table' and will work

shoyer · 2014-09-22T07:02:03Z

Would #8162 (Allowing the index to be referenced by name, like a column) be a doable?

I would love to see something like this happen in SF!

cpcloud · 2014-09-22T20:39:23Z

wow this is great everyone bravo!

jreback · 2014-09-22T20:40:53Z

let's see how much gets done!

of course the point of this list was to get as much dev time as possible (at the expense of other projects of course) :)

cpcloud · 2014-09-22T20:41:54Z

in case a brave soul would like to venture into the land of numpy internals:

Missing data support in numpy: #8350

cpcloud · 2014-09-22T20:44:50Z

nice-to-have:

pandas + airspeed velocity

demo: http://mdboom.github.io/astropy-benchmark/

adding to the top list

jreback · 2014-09-22T20:48:40Z

that looks sexy - can u create a new issue for asv? vbench like

cpcloud · 2014-09-22T20:49:27Z

yep. vbench was actually mentioned in the asv talk at scipy 2014

jankatins · 2014-09-23T12:27:03Z

Implemeting a CategoricalIndex #7629?

jreback · 2014-09-23T12:48:18Z

@JanSchulz I think out of scope for a 1-day event

jasongrout · 2014-09-24T15:49:45Z

in case a brave soul would like to venture into the land of numpy internals:

I should mention that Mark Wiebe (who knows a lot of numpy internals) will be there. Additionally (re: airspeed velocity), Michael Droettboom will be there.

immerrr · 2014-10-04T07:13:36Z

Is there a summary of this hackathon available online?

jreback · 2014-10-04T12:09:10Z

no summary - a few issues worked on / closed

jorisvandenbossche · 2014-11-25T23:49:18Z

@ all
I am going to update this the coming days for the upcoming Bloomberg Hackaton this weekend 29-30 November. But if you have new things to add or updates on the list above, certainly post/edit!

jreback added this to the 0.15.0 milestone Sep 19, 2014

jreback added the Community label Sep 19, 2014

jreback added the Good as first PR label Sep 26, 2014

jreback closed this as completed Sep 30, 2014

jorisvandenbossche reopened this Nov 26, 2014

jorisvandenbossche closed this as completed Dec 1, 2014

jorisvandenbossche mentioned this issue Aug 30, 2015

EuroScipy 2015 pandas sprint #10877

Closed

chris-b1 mentioned this issue Mar 6, 2019

to_string() not easily reversible for multi-index DataFrames #25570

Open

Bloomberg Hackathon #8323

Bloomberg Hackathon #8323

Comments

jreback commented Sep 19, 2014

jreback commented Sep 19, 2014

jreback commented Sep 19, 2014

jtratner commented Sep 19, 2014

jreback commented Sep 19, 2014

jtratner commented Sep 19, 2014

jreback commented Sep 19, 2014

TomAugspurger commented Sep 19, 2014

jtratner commented Sep 19, 2014

rockg commented Sep 19, 2014

jreback commented Sep 19, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jreback commented Sep 20, 2014

jreback commented Sep 20, 2014

jorisvandenbossche commented Sep 20, 2014

jorisvandenbossche commented Sep 21, 2014

TomAugspurger commented Sep 21, 2014

jreback commented Sep 21, 2014

shoyer commented Sep 22, 2014

cpcloud commented Sep 22, 2014

jreback commented Sep 22, 2014

cpcloud commented Sep 22, 2014

cpcloud commented Sep 22, 2014

jreback commented Sep 22, 2014

cpcloud commented Sep 22, 2014

jankatins commented Sep 23, 2014

jreback commented Sep 23, 2014

jasongrout commented Sep 24, 2014

immerrr commented Oct 4, 2014

jreback commented Oct 4, 2014

jorisvandenbossche commented Nov 25, 2014