REF/INT: concat blocks of same type with preserving block type #17728

jorisvandenbossche · 2017-09-30T15:26:19Z

Related to #17283

Goal is to get pd.concat([list of series] working with Series with external block type. Currently the values are always converted to arrays, concatenated, and converted back to block with block type inference.

This is a proof-of-concept to check whether something like this would be approved. Tests should still break because I didn't specialize the concat_same_type for categorical data.

pep8speaks · 2017-09-30T15:26:25Z

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 10, 2017 at 23:27 Hours UTC

codecov · 2017-10-01T11:05:25Z

Codecov Report

Merging #17728 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17728      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.01%     
==========================================
  Files         163      163              
  Lines       49765    49793      +28     
==========================================
+ Hits        45421    45442      +21     
- Misses       4344     4351       +7

Flag	Coverage Δ
#multiple	`89.05% <100%> (+0.01%)`	⬆️
#single	`40.32% <53.33%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals.py	`94.41% <100%> (+0.03%)`	⬆️
pandas/core/reshape/concat.py	`97.68% <100%> (+0.07%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.73% <0%> (-0.1%)`	⬇️
pandas/core/dtypes/inference.py	`98.36% <0%> (+0.02%)`	⬆️
pandas/core/dtypes/concat.py	`99.13% <0%> (+0.86%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8467c0...d4ce3df. Read the comment docs.

codecov · 2017-10-01T11:05:42Z

Codecov Report

Merging #17728 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #17728      +/-   ##
==========================================
+ Coverage   91.22%   91.24%   +0.01%     
==========================================
  Files         163      163              
  Lines       50014    50043      +29     
==========================================
+ Hits        45627    45661      +34     
+ Misses       4387     4382       -5

Flag	Coverage Δ
#multiple	`89.05% <100%> (+0.03%)`	⬆️
#single	`40.25% <85.36%> (-0.06%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals.py	`94.45% <100%> (+0.06%)`	⬆️
pandas/core/reshape/concat.py	`97.57% <100%> (-0.04%)`	⬇️
pandas/core/dtypes/concat.py	`99.12% <100%> (+0.86%)`	⬆️
pandas/io/gbq.py	`25% <0%> (-58.34%)`	⬇️
pandas/core/frame.py	`97.77% <0%> (-0.1%)`	⬇️
pandas/plotting/_converter.py	`65.2% <0%> (+1.81%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 727ea20...bb5a100. Read the comment docs.

jorisvandenbossche · 2017-10-01T14:49:34Z

@jreback this PR introduces some additional code without removing any (due to other complexities like index not being based on blocks, I can't remove any code I think), but I think the code addition is not too excessive for obtaining the goal.

It is still a bit messy where I need to check for the SparseBlock.

It also seems to give a modest speed-up on a very quick test, from around 900ms to around 700ms on this test (although performance was not the original motivation):

In [1]: s = pd.Series(np.arange(10000))

In [2]: %timeit pd.concat([s]*10)

jreback · 2017-10-01T17:23:24Z

pandas/core/internals.py

@@ -2684,6 +2705,19 @@ def shift(self, periods, axis=0, mgr=None):
        return [self.make_block_same_class(new_values,
                                           placement=self.mgr_locs)]

+    def concat_same_type(self, to_concat):


we should prob make a dispatch function in _concat to avoid all of this boilerplate

def _concat_for_type(to_concat, typ='sparse|datetime|cat')

jreback · 2017-10-01T17:24:10Z

pandas/core/reshape/concat.py

                # concat Series with length to keep dtype as much
                non_empties = [x for x in self.objs if len(x) > 0]
+
+                # check if all series are of the same block type:


this is the purpose of concat_compat, why are you re-writing?

concat_compat works on the level of values, here I want to work on the level of blocks. I could also try to do that in concat_compat, but I am not sure I can fully switch that to using blocks (eg for concatting index it still needs to work on values)

yeah we just pushed the sparse handling for this down into the blocks to make .unstack cleaner. Its not that hard actually. you would just have a BlockManager.concat which does the work for you (e.g. give it a list of other block managers).

right seem my comment. This would be ideal to actually do inside the BlockManger (not objecting to the actual code, more of its location).

For series this would be SingleBlockManager then?

For me that would be OK to move this there, but I in the SingleBlockManager.concat, I would still need to dispatch to the actual Blocks (as I now have the methods on the individual blocks), as my original goal is enabling external blocks to override how their values are concatenated.

sure that's fine, your code would be pretty similar. the idea is you just have a Block.concat that works for mostl cases and case override as needed (e.g. in your custom block).

yeah, the sparse unstack PR is indeed a good example for this

So I did an attempt to move the logic inside SingleBlockManager, see last commit. Not fully sure I find it cleaner (as eg the new_axis is passed into the function, as that logic is currently quite ingrained into the _Concatenator class to calculate the new axis values)

…DataFrames)

jorisvandenbossche · 2017-10-06T10:32:23Z

I added a similar logic to concatenate_block_managers to also deal with concatting frames (it is reusing the concat_same_type methods of the Blocks).
@jreback could you have a look ?

…oncat

jreback

looks pretty reasonable.

jreback · 2017-10-06T13:57:15Z

pandas/core/internals.py

+        to_concat = [blk.values for blk in to_concat]
+        values = _concat._concat_categorical(to_concat, axis=self.ndim - 1)
+
+        if is_categorical_dtype(values.dtype):


I think you could dispense with all of these if/thens and just use make_block

Yep, I could, but I wanted to make advantage of the fact that for those case it is categorical, I don't need to re-infer that.
Not sure how important that is for the internal ones though, as the if/else statements in make_block will not give that much of overhead (I added the concat_same_type machinery to be able to do that in an external Block, but that does not mean that I necessarily need to do it internally).

its just added code and more complexity. Trying reduce code here.

So I removed this if/else check in the categorical and datetimetz block's concat_same_type.

But if you want, I can also completely remove all overwriting of Block.concat_same_type and use a generic one in the base Block type (using _concat._concat_compat and make_block instead of np.concatenate and self.make_block_same_class). And in this way only keep the ability of external block types (like geopandas) to overwrite this, and without using that ability internally. But, doing that would remove the performance improvement I mentioned above (which is not the main objective of this PR, so not too bad, but having it is still nice).

jreback · 2017-10-06T13:58:28Z

pandas/core/internals.py

+
+    """
+    return (
+        # all blocks need to have the same type


a mouthful!

can you put blank lines in between statements

jreback · 2017-10-08T16:22:09Z

pandas/tests/internals/test_external_block.py

    blk_mgr = BlockManager([block], [['col'], range(3)])
    df = pd.DataFrame(blk_mgr)
    assert repr(df) == '      col\n0  Val: 0\n1  Val: 1\n2  Val: 2'
+
+
+def test_concat_series():


can you add the issue number here

jreback · 2017-10-09T12:21:47Z

ping when ready to have a look

jreback · 2017-10-10T03:38:19Z

pandas/core/internals.py

@@ -314,6 +314,15 @@ def ftype(self):
    def merge(self, other):
        return _merge_blocks([self, other])

+    def concat_same_type(self, to_concat, placement=None):


actually you could combine all of these concat_same_type into a single routine if you did this

def concat_same_type(self, to_concat, placement=None) """ Concatenate list of single blocks of the same type. """ values = self._concatenator([blk.values for blk in to_concat], axis=self.ndim - 1) return self.make_block_same_class( values, placement=placement or slice(0, len(values), 1))

Then add to Block

_concatenator = np.concatenate

Categorical

_concatnator = _concat._concat_categorical

etc

Ah, that would actually be nice. The only problem with this is that (for the reasons that I had the if/else statements originally) self.make_block_same_class will not always work. Eg _concat_categorical can return both categorical values as object values, and depending on that should return a CategoricalBlock or another type of Block.

so for categorical block override; otherwise u end up repeating lots of code

but need to overwrite for Datetimetz as well, so then end up with almost as many overridden ones as now (only for Sparse it would not be needed then).

this is still repeatting way too much code. you can just do it this way

in Block

def concat_same_type(self, to_concat, constructor=None, placement=None): values = self._concatenator(......) if constructor is None: constructor = make_block return constructor(....)

then where needed

def concat_same_type(.......): return super(Categorical, self).concat_same_type(....., constructor=self.make_block_same_class)

that way for an overriden class you are not repeating evertyhing.

…oncat

jreback · 2017-10-10T23:54:47Z

pandas/core/internals.py

+        if len(non_empties) > 0:
+            blocks = [obj.blocks[0] for obj in non_empties]
+
+            if all([type(b) is type(blocks[0]) for b in blocks[1:]]):  # noqa


could some of this logic be moreve to concaty_same_type?

Which logic do you mean exactly?
For now, I coded concat_same_type such that is assumes only blocks of the same type are passed (so I don't do the checking there, but before the method is called)

TomAugspurger

This looks good to me. OK to merge, and then iterate as needed (basically as driven by geopandas)?

jorisvandenbossche · 2017-10-11T16:46:07Z

So as long we agree on the "external interface", which currently is a concat_same_type method on Block that can be overridden, and we keep that stable, then the internal implementation can be iterated on.
So the question then mainly is, are we OK with this way to override the behaviour?

jorisvandenbossche · 2017-10-12T02:07:05Z

Just to note that I have no time until friday to do edits (work-related coding event). But, I think the requested changes are more exact internal implementation cosmetics, and is more clean-up and not fundamental code changes. So depending on when you want to release, I would propose to merge this, and I promise to do a follow-up PR.

TomAugspurger · 2017-10-12T14:52:37Z

Fine by me. I think we can trust you to not flake off :)

…

On Wed, Oct 11, 2017 at 9:07 PM, Joris Van den Bossche < ***@***.***> wrote: Just to note that I have no time until friday to do edits (work-related coding event). But, I think the requested changes are more exact internal implementation cosmetics, and is more clean-up and not fundamental code changes. So depending on when you want to release, I would propose to merge this, and I promise to do a follow-up PR. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#17728 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIjQhV0XR9s6m8gMjhn-BRvRzZ8H9ks5srXRNgaJpZM4Ppl3X> .

jreback · 2017-10-12T21:02:26Z

thanks @jorisvandenbossche

yep a clean up would be great.

Following on pandas-dev/pandas#17728

…s-dev#17728)

Following on pandas-dev/pandas#17728

* Add _concatenator method to GeometryBlock Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. * fix align and fillna tests * implement dropna * remove comments * Redefine isna This makes it so that only Nones and NaNs are considered missing. * clean up tests * respond to comments * remove unsupported kwargs

…s-dev#17728)

Following on pandas-dev/pandas#17728 * Use None for missing values Previously we used `Empty Polygon` for missing values. Now we revert to using NULL in GeometryArray (as before) and Python None when we convert to shapely objects. This makes it so that only Nones and NaNs are considered missing.

REF series concat: concat blocks if all of same type

007efb1

jorisvandenbossche added the Internals Related to non-user accessible pandas implementation label Sep 30, 2017

jorisvandenbossche added 3 commits October 1, 2017 12:16

fix categorical and sparse

dd9babd

add specific test for external block

8c6f4a7

fix pep8

d4ce3df

jorisvandenbossche mentioned this pull request Oct 1, 2017

Internals: concat does not preserve block types / does not take ndim of blocks into account #17283

Closed

jreback reviewed Oct 1, 2017

View reviewed changes

move logic into SingleBlockManager.concat

7676f03

jorisvandenbossche added this to the 0.21.0 milestone Oct 3, 2017

jorisvandenbossche added 2 commits October 6, 2017 11:04

use Block.concat_same_type in concatenate_block_managers (concatting …

1c35aca

…DataFrames)

fix categorical and datetimetz (when in dataframe + converted to object)

bd561d9

jorisvandenbossche force-pushed the internals-block-concat branch from ab3f3a2 to bd561d9 Compare October 6, 2017 10:02

jorisvandenbossche changed the title ~~[WIP] REF series concat: concat blocks if all of same type~~ REF/INT: concat blocks of same type with preserving block type Oct 6, 2017

small clean-up

0c90561

jorisvandenbossche added 2 commits October 6, 2017 13:33

Merge remote-tracking branch 'upstream/master' into internals-block-c…

1603f7d

…oncat

fix pep8

d3a79e7

jreback approved these changes Oct 6, 2017

View reviewed changes

jreback requested changes Oct 8, 2017

View reviewed changes

remove if/else in categorical/datetimetz concat_same_type

315f16b

jreback requested changes Oct 10, 2017

View reviewed changes

jorisvandenbossche mentioned this pull request Oct 10, 2017

RLS: 0.21.0 #17285

Closed

64 tasks

jorisvandenbossche added 2 commits October 11, 2017 00:32

Merge remote-tracking branch 'upstream/master' into internals-block-c…

66b5846

…oncat

Use _concatenator attribute on Block class

bb5a100

jreback reviewed Oct 10, 2017

View reviewed changes

TomAugspurger approved these changes Oct 11, 2017

View reviewed changes

jreback approved these changes Oct 12, 2017

View reviewed changes

jreback merged commit eac4d3f into pandas-dev:master Oct 12, 2017

mrocklin added a commit to mrocklin/geopandas that referenced this pull request Oct 14, 2017

Add _concatenator method to GeometryBlock

bc68baa

Following on pandas-dev/pandas#17728

mrocklin mentioned this pull request Oct 14, 2017

Add _concatenator method to GeometryBlock geopandas/geopandas#581

Closed

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017

REF/INT: concat blocks of same type with preserving block type (panda…

c085293

…s-dev#17728)

mrocklin added a commit to mrocklin/geopandas that referenced this pull request Oct 16, 2017

Add _concatenator method to GeometryBlock

a4c15c6

Following on pandas-dev/pandas#17728

jorisvandenbossche mentioned this pull request Oct 27, 2017

REF/INT: refactor concatenation of values in Block.concat_same_type #17997

Closed

alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017

REF/INT: concat blocks of same type with preserving block type (panda…

87453aa

…s-dev#17728)

No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017

REF/INT: concat blocks of same type with preserving block type (panda…

b50d765

…s-dev#17728)

jorisvandenbossche deleted the internals-block-concat branch January 11, 2018 17:29

jorisvandenbossche mentioned this pull request Jan 11, 2018

__finalize__ called with other as _Concatenator? #18999

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REF/INT: concat blocks of same type with preserving block type #17728

REF/INT: concat blocks of same type with preserving block type #17728

jorisvandenbossche commented Sep 30, 2017

pep8speaks commented Sep 30, 2017 •

edited

Loading

codecov bot commented Oct 1, 2017

codecov bot commented Oct 1, 2017 •

edited

Loading

jorisvandenbossche commented Oct 1, 2017

jreback Oct 1, 2017

jreback Oct 1, 2017

jorisvandenbossche Oct 1, 2017

jreback Oct 1, 2017

jreback Oct 1, 2017

jorisvandenbossche Oct 1, 2017

jreback Oct 1, 2017

jorisvandenbossche Oct 1, 2017

jorisvandenbossche Oct 2, 2017

jorisvandenbossche commented Oct 6, 2017

jreback left a comment

jreback Oct 6, 2017

jorisvandenbossche Oct 6, 2017

jreback Oct 6, 2017

jorisvandenbossche Oct 9, 2017

jreback Oct 6, 2017

jreback Oct 8, 2017

jreback commented Oct 9, 2017

jreback Oct 10, 2017

jorisvandenbossche Oct 10, 2017

jreback Oct 10, 2017

jorisvandenbossche Oct 10, 2017

jreback Oct 11, 2017

jreback Oct 10, 2017

jorisvandenbossche Oct 11, 2017

TomAugspurger left a comment

jorisvandenbossche commented Oct 11, 2017

jorisvandenbossche commented Oct 12, 2017

TomAugspurger commented Oct 12, 2017 via email

jreback commented Oct 12, 2017

REF/INT: concat blocks of same type with preserving block type #17728

REF/INT: concat blocks of same type with preserving block type #17728

Conversation

jorisvandenbossche commented Sep 30, 2017

pep8speaks commented Sep 30, 2017 • edited Loading

Comment last updated on October 10, 2017 at 23:27 Hours UTC

codecov bot commented Oct 1, 2017

Codecov Report

codecov bot commented Oct 1, 2017 • edited Loading

Codecov Report

jorisvandenbossche commented Oct 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 6, 2017

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Oct 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 11, 2017

jorisvandenbossche commented Oct 12, 2017

TomAugspurger commented Oct 12, 2017 via email

jreback commented Oct 12, 2017

pep8speaks commented Sep 30, 2017 •

edited

Loading

codecov bot commented Oct 1, 2017 •

edited

Loading