Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyArray conversion speedups and PyArrayFromBuffer #487

Merged
merged 22 commits into from
Nov 13, 2018

Conversation

JobJob
Copy link
Contributor

@JobJob JobJob commented Apr 4, 2018

This is the last one for now 😄

Speedups for PyArray conversion using Python's Buffer interface. PyArrayFromBuffer could become the default constructor for PyArray, if this looks good.

  • ~15x speedup for PyArray(o::PyObject)
  • PyArrayFromBuffer(o::PyObject) further 4x faster PyArray creation (~60x faster than master)
  • setdata! for fast reassignment of data ptr in PyArray when you know the array is the right shape (~500x faster than creating a new PyArray on master - not an apples to apples comparison, but still good to have)
  • Workhorse is PyArrayInfoFromBuffer(o::PyObject) - a faster way to get PyArray_Info from numpy than extracting values from the __array_interface__
  • (speculative) ArrayFromBuffer no copy conversion to Julia Array
  • PyArray_Info parameterised T,N and immutable
  • PyArray_Info size and stride changed to Tuples (from Vectors)
  • PyArray_Info data changed from Ptr{Cvoid} to Ptr{T}
  • moved PyBuffer tests to separate file

In no major rush for review, so please don't feel pressured.


master results:

Mean times
----------
nprand_pyo(2, 2)    : TrialEstimate(17.451 μs)
convert_pyarr(2, 2) : TrialEstimate(403.306 μs)
nprand_pyo(100, 100): TrialEstimate(152.220 μs)
convert_pyarr(100, 100): TrialEstimate(418.158 μs)

This branch results:

Mean times
----------
nprand_pyo(2, 2)           : TrialEstimate(16.688 μs)
convert_pyarr(2, 2)        : TrialEstimate(21.455 μs)
PyArrayInfoFromBuffer(2, 2): TrialEstimate(3.519 μs)
convert_pyarrbuf(2, 2)     : TrialEstimate(5.450 μs)
convert_arr(2, 2)          : TrialEstimate(100.200 μs)
convert_arrbuf(2, 2)       : TrialEstimate(10.339 μs)
setdata!(2, 2)             : TrialEstimate(745.279 ns)
setdata! bufprealloc(2, 2) : TrialEstimate(469.487 ns)
nprand_pyo(100, 100)       : TrialEstimate(158.003 μs)
convert_pyarr(100, 100)    : TrialEstimate(21.049 μs)
PyArrayInfoFromBuffer(100, 100): TrialEstimate(3.402 μs)
convert_pyarrbuf(100, 100) : TrialEstimate(5.024 μs)
convert_arr(100, 100)      : TrialEstimate(4.559 ms)
convert_arrbuf(100, 100)   : TrialEstimate(10.037 μs)
setdata!(100, 100)         : TrialEstimate(745.049 ns)
setdata! bufprealloc(100, 100): TrialEstimate(465.865 ns)

Benchmark code for master.

using PyCall, BenchmarkTools, DataStructures

results = OrderedDict{String,Any}() 
let 
    np = pyimport("numpy") 
    nprand = np["random"]["rand"] 
    # nparray_pyo(x) = pycall(np["array"], PyObject, x) 
    # pytestarray(sz::Int...) = pycall(np["reshape"], PyObject, nparray_pyo(1:prod(sz)), sz) 

    nprand_pyo(sz...)   = pycall(nprand, PyObject, sz...) 

    for arr_size in [(2,2), (100,100)] 
        pyo_arr = nprand_pyo(arr_size...) 
        results["nprand_pyo$arr_size"] = @benchmark $nprand_pyo($arr_size...) 
        println("nprand_pyo $arr_size:\n"); display(results["nprand_pyo$arr_size"]) 
        println("--------------------------------------------------") 

        results["convert_pyarr$arr_size"] = @benchmark $convert(PyArray, $pyo_arr) 
        println("convert_pyarr $arr_size:\n"); display(results["convert_pyarr$arr_size"]) 
        println("--------------------------------------------------")
    end
end

println("")
println("Mean times")
println("----------")
foreach((r)->println(rpad(r[1],20), ": ", mean(r[2])), results)

@stevengj
Copy link
Member

stevengj commented Apr 4, 2018

Thanks for working on this! My hope for a long time (see #70) has been to ditch numpy.jl completely in favor of something based entirely on PyBuffer. Can you remove numpy.jl completely in this PR?

@stevengj
Copy link
Member

stevengj commented Apr 4, 2018

That is, PyArray (the no-copy conversion type) should just use the buffer API too.

@JobJob
Copy link
Contributor Author

JobJob commented Apr 4, 2018

Yeah I think you can remove all the stuff related to Python Array -> PyArray conversion. But this doesn't do anything for julia array -> python array

@JobJob
Copy link
Contributor Author

JobJob commented Apr 4, 2018

as in everything related to __aray_interface__

@stevengj
Copy link
Member

stevengj commented Apr 4, 2018

Right.... for that we'd need to create an object that implements the buffer API. One concern is that I don't know how well this would work with Python functions expecting a NumPy array.

Okay, it makes sense to replace the Python -> Julia array code (__array_interface__ and PyArray) with buffer code in this PR, but leave the Julia -> Python stuff using NumPy for now and work on that in a separate PR.

@JobJob
Copy link
Contributor Author

JobJob commented Apr 4, 2018

The other issue with replacing PyArray(o::PyObject) with PyArrayFromBuffer is if something implements __array_interface__ but not the buffer interface. Not sure which is more supported in the wild? But we could just add a fallback to __array_interface__ if the buffer interface isn't implemented.

@stevengj
Copy link
Member

stevengj commented Apr 4, 2018

if something implements __array_interface__ but not the buffer interface.

I'm guessing that this never happens in practice.

My preference would be to remove __array_interface__ entirely, and only worry about adding it back as a fallback if someone encounters it in the wild.

@JobJob
Copy link
Contributor Author

JobJob commented Apr 4, 2018

I'm guessing that this never happens in practice.

Ok, let's not support it then unless someone tells us it does 😄

@JobJob
Copy link
Contributor Author

JobJob commented Apr 8, 2018

What's the issue with having numpy as a test dependency? Seems that using numpy will allow us to cover more cases, more easily.

I thought I might be able to get by with python's array module, but then I did a little digging and found out that it doesn't support the buffer interface on Python 2.7 🤦. You can get typed byte data with struct.pack, but you can't reshape a memoryview using cast in py2 (you can in Python 3). So, the only way I could see to create a multi-dim array without using numpy in py2.7 is messing around in ctypes.

It could be done, but would take effort, esp. for more exotic arrays (e.g. non-native endian and f_contiguous), and since I'm guessing the majority of the arrays these functions will see in the wild will be coming from numpy anyway (?), could be good to test with it.

@stevengj
Copy link
Member

stevengj commented Apr 8, 2018

NumPy as a test dependency is fine; we should definitely test with NumPy arrays and operations anyway.

JobJob added 6 commits April 30, 2018 17:05
* PyArray_Info parameterised T,N and immutable
* PyArray_Info size and stride changed to Tuples (from Vectors)
* PyArray_Info data changed from Ptr{Cvoid} to Ptr{T}
* `PyArrayInfoFromBuffer(o::PyObject)` faster way to get PyArray_Info from numpy than numpy's __array_interface__
* `PyArrayFromBuffer(o::PyObject)` 4x faster PyArray conversion
* ArrayFromBuffer no copy conversion to Julia Array
* moved PyBuffer tests to separate file
rename ArrayFromBuffer to NoCopyArray, and make indexing of NoCopyArray match py indexing for row major arrays too
@JobJob JobJob mentioned this pull request Jun 28, 2018
3 tasks
@stevengj
Copy link
Member

Bump.

@JobJob
Copy link
Contributor Author

JobJob commented Aug 7, 2018

Assuming the tests pass I think this is pretty good to go. (they pass for me on mac with py 2/3 julia 0.6.3/0.7-dev-xxx)

There are things in pep-3118 that it doesn't handle, but I think it handles all the cases of arrays (strides and types) that current master does.

Oh the one thing I wanted to ask about is there's a deprecation warning about strides on 0.7 - but should PyArray and PyBuffer implement strides for 0.6?

@JobJob
Copy link
Contributor Author

JobJob commented Aug 7, 2018

Umm nightly failures seem unrelated :/

@stevengj
Copy link
Member

Nightly failures on master should be fixed now if you want to rebase.

src/pybuffer.jl Outdated Show resolved Hide resolved
and change pass a PyPtr_NULL instead of a C_NULL for Py_buffer.obj when creating a new PyBuffer
@codecov-io
Copy link

codecov-io commented Oct 22, 2018

Codecov Report

Merging #487 into master will decrease coverage by 0.08%.
The diff coverage is 44.51%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   53.11%   53.03%   -0.09%     
==========================================
  Files          19       20       +1     
  Lines        1525     1567      +42     
==========================================
+ Hits          810      831      +21     
- Misses        715      736      +21
Impacted Files Coverage Δ
src/PyCall.jl 59.51% <ø> (ø) ⬆️
src/numpy.jl 67.24% <100%> (+16.33%) ⬆️
src/pyarray.jl 39.02% <39.02%> (ø)
src/conversions.jl 52.17% <50%> (-0.61%) ⬇️
src/pybuffer.jl 51.31% <65.51%> (+13.81%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e3f2ac...dbc9caa. Read the comment docs.

src/PyCall.jl Outdated Show resolved Hide resolved
pyarr = convert(PyArray, ao1)
ao2 = arrpyo(11.0:20.0, "d")
setdata!(pyarr, ao2)
@test all(pyarr[1:10] .== 11.0:20.0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using PyCall
np = pyimport("numpy")
pyarr = pycall(np["array"], PyArray, [1:10;])
pyarr[1:2]

The above is an error on current master on Julia 1.0.1 🙀

@JobJob
Copy link
Contributor Author

JobJob commented Nov 2, 2018

Bumpity bump? @stevengj

@JobJob
Copy link
Contributor Author

JobJob commented Nov 13, 2018

Hello?

@stevengj stevengj merged commit e74bf61 into JuliaPy:master Nov 13, 2018
@stevengj
Copy link
Member

Thanks for keeping at this!

@JobJob
Copy link
Contributor Author

JobJob commented Nov 13, 2018

🎉 🎉 🎉
Thanks to you @stevengj you brilliant beast 😄
Sorry the diff became so big - will try to keep things more manageable in future.

@tkf
Copy link
Member

tkf commented Nov 15, 2018

There are failures in Julia 0.7 after this is merged: https://travis-ci.org/JuliaPy/PyCall.jl/builds/454473965

(One failure is from nightly but it's terminated due to "No output has been received in the last 10m0s" during installation. So it probably is irrelevant.)

The build before merge was all green: https://travis-ci.org/JuliaPy/PyCall.jl/builds/450430878
So I think the source tree was identical and the test should pass, unless there are upstream updates. But there is no updates for Julia 0.7.

Also, Travis reports "Job errored" ! rather than "Job failed" x even though there seems to be segmentation fault during the test:
https://travis-ci.org/JuliaPy/PyCall.jl/jobs/454473973#L756

What is happening?

@JobJob
Copy link
Contributor Author

JobJob commented Nov 15, 2018

I think it could be related to 48d730f
I can replicate on 0.7 with master locally, but when I reset to 17dcb37 on the branch of this PR (array-perf on https://github.com/JobJob/PyCall.jl) things are fine - i.e. no segfault at the end of testing

@JobJob
Copy link
Contributor Author

JobJob commented Nov 15, 2018

N.b. I didn't merge that commit into the branch of this PR

@JobJob
Copy link
Contributor Author

JobJob commented Nov 15, 2018

Simple fix in #615

@marius311
Copy link
Contributor

Apologies, I haven't followed the discussion above, but I did bisect a failure of my code when upgrading to PyCall#master down to this commit. It seems that before this commit,

julia> using PyPlot
julia> subplots(1, 1; squeeze=false)[2]
1×1 Array{PyCall.PyObject,2}:
 PyObject <matplotlib.axes._subplots.AxesSubplot object at 0x7f75f95c7610>

but after

julia> using PyPlot
julia> subplots(1, 1; squeeze=false)[2]
PyObject array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7ff0846ed610>]],
      dtype=object)

Was this intended or not?

In either case, is there a workaround to force conversion to a Julia Matrix in the second case?

@stevengj
Copy link
Member

Probably not intended. Have you tried convert(Matrix, ans) on the result?

@marius311
Copy link
Contributor

marius311 commented Jan 24, 2019

Thanks. Neither of these two attempts on current master seem to work:

julia> convert(Matrix, subplots(1,1;squeeze=false)[2])
ERROR: MethodError: Cannot `convert` an object of type PyObject to an object of type Array{T,2} where T
Closest candidates are:
  convert(::Type{T<:Array}, ::AbstractArray) where T<:Array at array.jl:474
  convert(::Type{T<:AbstractArray}, ::T<:AbstractArray) where T<:AbstractArray at abstractarray.jl:14
  convert(::Type{T<:AbstractArray}, ::LinearAlgebra.Factorization) where T<:AbstractArray at /home/marius/src/julia/usr/share/julia/stdlib/v1.1/LinearAlgebra/src/factorization.jl:46
  ...
Stacktrace:
 [1] top-level scope at none:0

julia> PyArray(subplots(1,1;squeeze=false)[2])
ERROR: KeyError: key "O" not found
Stacktrace:
 [1] getindex at ./dict.jl:478 [inlined]
 [2] array_format(::PyBuffer) at /home/marius/.julia/dev/PyCall/src/pybuffer.jl:261
 [3] PyArray_Info(::PyObject) at /home/marius/.julia/dev/PyCall/src/pyarray.jl:17
 [4] PyArray(::PyObject) at /home/marius/.julia/dev/PyCall/src/pyarray.jl:123
 [5] top-level scope at none:0

But permutedims(getindex.(Ref(_), 1:m, (1:n)')) works fine so I'm happy with that workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants