Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Consolidation and further optimization of take functions in common #2867

Merged
merged 4 commits into from
Feb 14, 2013

Conversation

stephenwlin
Copy link
Contributor

I've consolidated take_nd, ndtake, and take_fast in common into a single signature take_nd which has cleaner semantics (which is documented in a docstring), at least preserves the existing performance properties in all cases, and improves it (sometimes significantly) in some. (In particular, computation of intermediate arrays like boolean masks are still being short-circuited in the same way they were before, in the appropriate situations.) The operation that used to be take_fast was also broken for non-NA fill_value: this is fixed too.

In addition, I've optimized the Cython implementations of 2-D takes to use row-wise or column-wise memcpys memmoves automatically in places where it is appropriate (same type input and output, non-object type, both c-contiguous if slice axis is 0, both f-contiguous if slice axis is 1...). In theory Cython and/or the C compiler could do this automatically, but apparently it doesn't because the performance does improve when this is triggered, at least with gcc on 32-bit linux.

Tests with measurably improved performance (this is on top of the performance improvements in #2819)

Results:
                                            t_head t_baseline      ratio
name
frame_reindex_axis0                         0.6845     3.2839     0.2084
frame_reindex_axis1                         4.0088     5.4203     0.7396
frame_boolean_row_select                    0.2932     0.3939     0.7443

All other ratios are apparently just statistical noise within the +- 10% range that vary from run to run. (Was hoping it would help more, but I guess this is OK)

@jreback
Copy link
Contributor

jreback commented Feb 13, 2013

I don't think this needs the dtypes_bug branch
u can rebase off master?

it makes it easier for u I think

@stephenwlin
Copy link
Contributor Author

yeah I rebased already I didn't realize I pulled that into this branch.

@ghost
Copy link

ghost commented Feb 13, 2013

Could you take a look at:
numpy/numpy#324
#2490

and confirm that a similar issue isn't a problem here?

Thank you for all your recent work, great stuff.

@stephenwlin
Copy link
Contributor Author

i'm pretty sure the functions already have undefined behavior with or without the memcpy if the buffers overlap, because one is being written while the other is being read and the relative order depends on the indexer passed in. but I guess it's probably not good to pass memcpy overlapping buffers just in case, since it's undefined behavior and technically could mean anything could happen, so I'll add a check.

@stephenwlin
Copy link
Contributor Author

actually decided just to change memcpy to memmove, which should be fine: i'm sure the latter is doing a check internally that is as fast or faster than it could be written in C.

it will still be undefined behavior if the two buffers overlap, which is the case for all the Cython algos as far as I can tell, but at least it won't be calling a C function with invalid inputs.

@ghost
Copy link

ghost commented Feb 14, 2013

IIRC there's a performance panelty associated with memmove vs memcpy,
better make sure being cautious didn't nullify the perf gain.

@stephenwlin
Copy link
Contributor Author

sure, i'll check, but i can't imagine it would make a noticeable difference with a sane implementation. all it has to do some arithmetic on the pointers and the length first before deciding what to do: if there's no overlap it can do whatever memcpy does.

@ghost
Copy link

ghost commented Feb 14, 2013

I would think so too, but if that's all there was to it I don't see why such little overhead
would be left outside memcpy in the first place.
But I really don't know what [X]-libc does in practice.

@ghost
Copy link

ghost commented Feb 14, 2013

I do rememver toliphant of numpy fame being concerned with the associated
perf hit in related discussion.

@stephenwlin
Copy link
Contributor Author

well, that's surprising re: toliphant...I guess I can revert to memcpy if it makes any difference: it's not like any of the Cython code has well defined behavior on overlapping buffers anyway. If we really wanted to be careful we could check to see if ((out if out.base is None else out.base) is (values if values.base is None else values.base)) on in every single algo first but it seems like overkill. (Btw it's kind of odd that arr.base is None rather that arr when an arr is not a view...really complicates things for no reason.)

@stephenwlin
Copy link
Contributor Author

Actually apparently that doesn't even work, arr.base isn't guaranteed to be the ultimate owner of the buffer (http://projects.scipy.org/numpy/ticket/1232), so you'd have to walk through until you find something with base.flags.owndata. That or pointer arithmetic on the strides, etc (and the views may not be contiguous, so it wouldn't be easy). Yuck either way.

@ghost
Copy link

ghost commented Feb 14, 2013

I think so too. wes has the last word though.

Not sure about the cython code, I suppose in some places it delegates to numpy
for example, which does handle this, at least in the new release.

@stephenwlin
Copy link
Contributor Author

There's numpy.may_share_memory apparently.

@stephenwlin
Copy link
Contributor Author

Pretty much the same results...

Results:
                                            t_head t_baseline      ratio
name
frame_reindex_axis0                         0.6879     3.2483     0.2118
frame_reindex_axis1                         4.0164     5.3709     0.7478
frame_boolean_row_select                    0.2907     0.3873     0.7505

@jreback
Copy link
Contributor

jreback commented Feb 14, 2013

how does frame_reindex_cast do? I think parts if that DO NOT hit this optimization, because int64 taking int 16,32,64

@stephenwlin
Copy link
Contributor Author

"frame_reindex_upcast" you mean? it seems to improve but the amount varies run to run so I don't know how much of it is noise:

frame_reindex_upcast                       21.4190    24.5636     0.8720
frame_reindex_upcast                       23.4064    24.2167     0.9665

@jreback
Copy link
Contributor

jreback commented Feb 14, 2013

ok
that looks fine then

@jreback
Copy link
Contributor

jreback commented Feb 14, 2013

can you put a mention in RELEASE.rst (and for your converts_branch as well)...thxs

jreback added a commit that referenced this pull request Feb 14, 2013
ENH: Consolidation and further optimization of take functions in common
thanks!
@jreback jreback merged commit 5b5e532 into pandas-dev:master Feb 14, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants