Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF/INT: concat blocks of same type with preserving block type #17728

Merged
merged 13 commits into from
Oct 12, 2017

Conversation

jorisvandenbossche
Copy link
Member

Related to #17283

Goal is to get pd.concat([list of series] working with Series with external block type. Currently the values are always converted to arrays, concatenated, and converted back to block with block type inference.

This is a proof-of-concept to check whether something like this would be approved. Tests should still break because I didn't specialize the concat_same_type for categorical data.

@pep8speaks
Copy link

pep8speaks commented Sep 30, 2017

Hello @jorisvandenbossche! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on October 10, 2017 at 23:27 Hours UTC

@jorisvandenbossche jorisvandenbossche added the Internals Related to non-user accessible pandas implementation label Sep 30, 2017
@codecov
Copy link

codecov bot commented Oct 1, 2017

Codecov Report

Merging #17728 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17728      +/-   ##
==========================================
- Coverage   91.27%   91.26%   -0.01%     
==========================================
  Files         163      163              
  Lines       49765    49793      +28     
==========================================
+ Hits        45421    45442      +21     
- Misses       4344     4351       +7
Flag Coverage Δ
#multiple 89.05% <100%> (+0.01%) ⬆️
#single 40.32% <53.33%> (-0.08%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals.py 94.41% <100%> (+0.03%) ⬆️
pandas/core/reshape/concat.py 97.68% <100%> (+0.07%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/core/dtypes/inference.py 98.36% <0%> (+0.02%) ⬆️
pandas/core/dtypes/concat.py 99.13% <0%> (+0.86%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8467c0...d4ce3df. Read the comment docs.

@codecov
Copy link

codecov bot commented Oct 1, 2017

Codecov Report

Merging #17728 into master will increase coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17728      +/-   ##
==========================================
+ Coverage   91.22%   91.24%   +0.01%     
==========================================
  Files         163      163              
  Lines       50014    50043      +29     
==========================================
+ Hits        45627    45661      +34     
+ Misses       4387     4382       -5
Flag Coverage Δ
#multiple 89.05% <100%> (+0.03%) ⬆️
#single 40.25% <85.36%> (-0.06%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals.py 94.45% <100%> (+0.06%) ⬆️
pandas/core/reshape/concat.py 97.57% <100%> (-0.04%) ⬇️
pandas/core/dtypes/concat.py 99.12% <100%> (+0.86%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️
pandas/plotting/_converter.py 65.2% <0%> (+1.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 727ea20...bb5a100. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

@jreback this PR introduces some additional code without removing any (due to other complexities like index not being based on blocks, I can't remove any code I think), but I think the code addition is not too excessive for obtaining the goal.

It is still a bit messy where I need to check for the SparseBlock.

It also seems to give a modest speed-up on a very quick test, from around 900ms to around 700ms on this test (although performance was not the original motivation):

In [1]: s = pd.Series(np.arange(10000))

In [2]: %timeit pd.concat([s]*10)

@@ -2684,6 +2705,19 @@ def shift(self, periods, axis=0, mgr=None):
return [self.make_block_same_class(new_values,
placement=self.mgr_locs)]

def concat_same_type(self, to_concat):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should prob make a dispatch function in _concat to avoid all of this boilerplate

def _concat_for_type(to_concat, typ='sparse|datetime|cat')

# concat Series with length to keep dtype as much
non_empties = [x for x in self.objs if len(x) > 0]

# check if all series are of the same block type:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the purpose of concat_compat, why are you re-writing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concat_compat works on the level of values, here I want to work on the level of blocks. I could also try to do that in concat_compat, but I am not sure I can fully switch that to using blocks (eg for concatting index it still needs to work on values)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we just pushed the sparse handling for this down into the blocks to make .unstack cleaner. Its not that hard actually. you would just have a BlockManager.concat which does the work for you (e.g. give it a list of other block managers).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right seem my comment. This would be ideal to actually do inside the BlockManger (not objecting to the actual code, more of its location).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For series this would be SingleBlockManager then?

For me that would be OK to move this there, but I in the SingleBlockManager.concat, I would still need to dispatch to the actual Blocks (as I now have the methods on the individual blocks), as my original goal is enabling external blocks to override how their values are concatenated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure that's fine, your code would be pretty similar. the idea is you just have a Block.concat that works for mostl cases and case override as needed (e.g. in your custom block).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the sparse unstack PR is indeed a good example for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I did an attempt to move the logic inside SingleBlockManager, see last commit. Not fully sure I find it cleaner (as eg the new_axis is passed into the function, as that logic is currently quite ingrained into the _Concatenator class to calculate the new axis values)

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Oct 3, 2017
@jorisvandenbossche jorisvandenbossche changed the title [WIP] REF series concat: concat blocks if all of same type REF/INT: concat blocks of same type with preserving block type Oct 6, 2017
@jorisvandenbossche
Copy link
Member Author

I added a similar logic to concatenate_block_managers to also deal with concatting frames (it is reusing the concat_same_type methods of the Blocks).
@jreback could you have a look ?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty reasonable.

to_concat = [blk.values for blk in to_concat]
values = _concat._concat_categorical(to_concat, axis=self.ndim - 1)

if is_categorical_dtype(values.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could dispense with all of these if/thens and just use make_block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I could, but I wanted to make advantage of the fact that for those case it is categorical, I don't need to re-infer that.
Not sure how important that is for the internal ones though, as the if/else statements in make_block will not give that much of overhead (I added the concat_same_type machinery to be able to do that in an external Block, but that does not mean that I necessarily need to do it internally).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its just added code and more complexity. Trying reduce code here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I removed this if/else check in the categorical and datetimetz block's concat_same_type.

But if you want, I can also completely remove all overwriting of Block.concat_same_type and use a generic one in the base Block type (using _concat._concat_compat and make_block instead of np.concatenate and self.make_block_same_class). And in this way only keep the ability of external block types (like geopandas) to overwrite this, and without using that ability internally. But, doing that would remove the performance improvement I mentioned above (which is not the main objective of this PR, so not too bad, but having it is still nice).


"""
return (
# all blocks need to have the same type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a mouthful!

can you put blank lines in between statements

blk_mgr = BlockManager([block], [['col'], range(3)])
df = pd.DataFrame(blk_mgr)
assert repr(df) == ' col\n0 Val: 0\n1 Val: 1\n2 Val: 2'


def test_concat_series():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number here

@jreback
Copy link
Contributor

jreback commented Oct 9, 2017

ping when ready to have a look

@@ -314,6 +314,15 @@ def ftype(self):
def merge(self, other):
return _merge_blocks([self, other])

def concat_same_type(self, to_concat, placement=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually you could combine all of these concat_same_type into a single routine if you did this

def concat_same_type(self, to_concat, placement=None)
       """
       Concatenate list of single blocks of the same type.
       """
        values = self._concatenator([blk.values for blk in to_concat], axis=self.ndim - 1)
        return self.make_block_same_class(
            values, placement=placement or slice(0, len(values), 1))

Then add to Block

_concatenator = np.concatenate

Categorical

_concatnator = _concat._concat_categorical

etc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that would actually be nice. The only problem with this is that (for the reasons that I had the if/else statements originally) self.make_block_same_class will not always work. Eg _concat_categorical can return both categorical values as object values, and depending on that should return a CategoricalBlock or another type of Block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so for categorical block override; otherwise u end up repeating lots of code

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but need to overwrite for Datetimetz as well, so then end up with almost as many overridden ones as now (only for Sparse it would not be needed then).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still repeatting way too much code. you can just do it this way

in Block

def concat_same_type(self, to_concat, constructor=None, placement=None):
     values = self._concatenator(......)
    if constructor is None:
        constructor = make_block
      return constructor(....)

then where needed

def concat_same_type(.......):
      return super(Categorical, self).concat_same_type(....., constructor=self.make_block_same_class)

that way for an overriden class you are not repeating evertyhing.

@jorisvandenbossche jorisvandenbossche mentioned this pull request Oct 10, 2017
64 tasks
if len(non_empties) > 0:
blocks = [obj.blocks[0] for obj in non_empties]

if all([type(b) is type(blocks[0]) for b in blocks[1:]]): # noqa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could some of this logic be moreve to concaty_same_type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which logic do you mean exactly?
For now, I coded concat_same_type such that is assumes only blocks of the same type are passed (so I don't do the checking there, but before the method is called)

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. OK to merge, and then iterate as needed (basically as driven by geopandas)?

@jorisvandenbossche
Copy link
Member Author

So as long we agree on the "external interface", which currently is a concat_same_type method on Block that can be overridden, and we keep that stable, then the internal implementation can be iterated on.
So the question then mainly is, are we OK with this way to override the behaviour?

@jorisvandenbossche
Copy link
Member Author

Just to note that I have no time until friday to do edits (work-related coding event). But, I think the requested changes are more exact internal implementation cosmetics, and is more clean-up and not fundamental code changes. So depending on when you want to release, I would propose to merge this, and I promise to do a follow-up PR.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 12, 2017 via email

@jreback jreback merged commit eac4d3f into pandas-dev:master Oct 12, 2017
@jreback
Copy link
Contributor

jreback commented Oct 12, 2017

thanks @jorisvandenbossche

yep a clean up would be great.

mrocklin added a commit to mrocklin/geopandas that referenced this pull request Oct 14, 2017
ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 16, 2017
mrocklin added a commit to mrocklin/geopandas that referenced this pull request Oct 16, 2017
mrocklin added a commit to geopandas/geopandas that referenced this pull request Oct 28, 2017
* Add _concatenator method to GeometryBlock

Following on pandas-dev/pandas#17728

* Use None for missing values

Previously we used `Empty Polygon` for missing values.  Now we revert to
using NULL in GeometryArray (as before) and Python None when we convert
to shapely objects.

* fix align and fillna tests

* implement dropna

* remove comments

* Redefine isna

This makes it so that only Nones and NaNs are considered missing.

* clean up tests

* respond to comments

* remove unsupported kwargs
@jorisvandenbossche jorisvandenbossche deleted the internals-block-concat branch January 11, 2018 17:29
jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Jul 1, 2019
Following on pandas-dev/pandas#17728

* Use None for missing values

Previously we used `Empty Polygon` for missing values.  Now we revert to
using NULL in GeometryArray (as before) and Python None when we convert
to shapely objects.

This makes it so that only Nones and NaNs are considered missing.
jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Sep 18, 2019
Following on pandas-dev/pandas#17728

* Use None for missing values

Previously we used `Empty Polygon` for missing values.  Now we revert to
using NULL in GeometryArray (as before) and Python None when we convert
to shapely objects.

This makes it so that only Nones and NaNs are considered missing.
jorisvandenbossche pushed a commit to jorisvandenbossche/geopandas that referenced this pull request Sep 30, 2019
Following on pandas-dev/pandas#17728

* Use None for missing values

Previously we used `Empty Polygon` for missing values.  Now we revert to
using NULL in GeometryArray (as before) and Python None when we convert
to shapely objects.

This makes it so that only Nones and NaNs are considered missing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants