Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: exploratory branch for fast array views #5003

Closed
wants to merge 13 commits into from

Conversation

StefanKarpinski
Copy link
Sponsor Member

The code gen for the core indexing computations – both multidimensional and linear – in c980bae makes me hopeful that we can get array views to be a very fast core component of the language. This approach is really making me wonder if we shouldn't have a simple single-dimensional Vec or Buffer type that underlies all arrays, in terms of which they can be defined. We could even hide it from sight for the most part, but it makes defining arrays in Julia much more doable.

@ViralBShah
Copy link
Member

We used to have a Buffer in the very early days, if you remember. :-)

@StefanKarpinski
Copy link
Sponsor Member Author

I do recall. Then we decided tuple, buffer and array were just too many vector-like types to explain to people. I'm thinking more something that you use to implement other things, not a "user-facing" data type, if you see what I mean.

@ViralBShah
Copy link
Member

There was also the performance issue of having all the metadata associated with arrays in a composite type, IIRC. That may no longer be the case with immutable types.

@StefanKarpinski
Copy link
Sponsor Member Author

Yes, it might be more reasonable now. But in any case, it's not necessary for this. The thing that made me think of it is that this currently is written specifically to be a view of a vector, not a multidimensional array. I could make the type more general, but then I'd have to add a gratuitous type parameter for the parent array, which is annoying. I could also construct views of multidimensional arrays by reshaping them to 1-d and then taking a view of that, but that's a bit weird.

@kmsquire
Copy link
Member

kmsquire commented Dec 4, 2013

divmod support (#4437) might be nice here.

@StefanKarpinski
Copy link
Sponsor Member Author

LLVM does that automatically on x86 :-)

See the native instructions here: c980bae – only div, no mod.

@kmsquire
Copy link
Member

kmsquire commented Dec 4, 2013

Nice. I'll have to brush up on my assembly to really understand what's happening, but nice!

@timholy
Copy link
Sponsor Member

timholy commented Dec 4, 2013

Surprisingly, this branch doesn't seem to fix the performance problem, even though I can verify that the assembly is shorter.

On this branch:

julia> code_native(gtindex, (typeof(B2), Int))
        .text
Filename: /tmp/gtindex.jl
Source line: 2
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        mov     RBX, RSI
        mov     R14, RDI
Source line: 2
        movabs  RAX, 139991579184432
        mov     ESI, 1
        call    RAX
        mov     RCX, RAX
Source line: 147
        mov     RSI, QWORD PTR [R14 + 8]
        test    RSI, RSI
        je      69
        mov     RDI, QWORD PTR [R14 + 32]
        test    RDI, RDI
        je      56
Source line: 3
        dec     RBX
Source line: 4
        mov     RAX, RBX
        cqo
        idiv    RCX
Source line: 147
        mov     RDI, QWORD PTR [RDI + 8]
        mov     RDX, QWORD PTR [RDI + 8]
        imul    RDX, RAX
Source line: 5
        imul    RAX, RCX
        sub     RBX, RAX
Source line: 147
        imul    RBX, QWORD PTR [RDI]
        add     RBX, QWORD PTR [R14 + 40]
        add     RBX, RDX
        mov     RAX, QWORD PTR [RSI + 8]
        movsd   XMM0, QWORD PTR [RAX + 8*RBX - 8]
Source line: 7
        pop     RBX
        pop     R14
        pop     RBP
        ret
Source line: 147
        movabs  RAX, 139991621867280
        mov     RDI, QWORD PTR [RAX]
        movabs  RAX, 139991607851584
        mov     ESI, 147
        call    RAX

On master:

julia> code_native(gtindex, (typeof(B2), Int))
        .text
Filename: /tmp/gtindex.jl
Source line: 2
        push    RBP
        mov     RBP, RSP
        push    R14
        push    RBX
        mov     RBX, RSI
        mov     R14, RDI
Source line: 2
        movabs  RAX, 139991602731920
        mov     ESI, 1
        call    RAX
        mov     RCX, RAX
        cmp     RCX, -1
Source line: 4
        setne   AL
Source line: 3
        dec     RBX
        movabs  RDX, -9223372036854775808
        cmp     RBX, RDX
Source line: 4
        setne   DL
Source line: 2
        test    RCX, RCX
        je      87
Source line: 4
        or      AL, DL
        je      79
Source line: 147
        mov     RSI, QWORD PTR [R14 + 8]
        test    RSI, RSI
        je      96
        mov     RDI, QWORD PTR [R14 + 32]
        test    RDI, RDI
        je      83
Source line: 4
        mov     RAX, RBX
        cqo
        idiv    RCX
Source line: 147
        mov     RDI, QWORD PTR [RDI + 8]
        mov     RDX, QWORD PTR [RDI + 8]
        imul    RDX, RAX
Source line: 5
        imul    RAX, RCX
        sub     RBX, RAX
Source line: 147
        imul    RBX, QWORD PTR [RDI]
        add     RBX, QWORD PTR [R14 + 40]
        add     RBX, RDX
        mov     RAX, QWORD PTR [RSI + 8]
        movsd   XMM0, QWORD PTR [RAX + 8*RBX - 8]
Source line: 7
        pop     RBX
        pop     R14
        pop     RBP
        ret
Source line: 4
        movabs  RAX, 139991645423408
        mov     RDI, QWORD PTR [RAX]
        movabs  RAX, 139991631408592
        mov     ESI, 4
        call    RAX
Source line: 147
        movabs  RAX, 139991645423376
        mov     RDI, QWORD PTR [RAX]
        movabs  RAX, 139991631408592
        mov     ESI, 147
        call    RAX

I've updated that gist to make sure that any differences in inlining (which seems to be prevented by assigning the output to a variable) are compensated. The assembly by gcc is here.

@timholy
Copy link
Sponsor Member

timholy commented Dec 4, 2013

Oh, and the profile suggests that the div statement is still the main issue.

@StefanKarpinski
Copy link
Sponsor Member Author

Hmm. Maybe I'll have to ponder hard how we might be able to get rid of the div. it's a really hard problem though. I'm not sure it's even possible.

@timholy
Copy link
Sponsor Member

timholy commented Dec 4, 2013

Here's a perhaps-interesting observation: with the @inbounds ret = ... (preventing inlining), the Julia timings are similar (not identical) to the non-optimized gcc timings (see the code in the gist). In particular, without optimization the C implementation of the subarray version is much slower than the C array version (by about the same margin as for Julia), whereas on -O2 they are nearly indistinguishable. Might there be some sophisticated trickery here, like hoisting some of the div-type operations out of the getindex call and into the containing loop? (I can't read assembly well enough to decipher this myself.)

@simonster
Copy link
Member

@timholy I suspect that was with strd2 = size(s,1) and not @inbounds strd2 = s.dims[1]? With that change I get assembly that more closely resembles the C version, with two major differences: 1) the Julia code does checks to determine whether the fields of the SubArray type are undefined and 2) in the C version, you are passing the size and strides as arguments so they are being passed in registers, while in the Julia version they are being loaded from memory.

@timholy
Copy link
Sponsor Member

timholy commented Dec 4, 2013

You're right that the gist did not include that optimization, but I noticed that too and tested it locally. As we noted before, it definitely makes a difference, but only about 10%. Performancewise, there's still a 3-4x gap between non-inlined performance with 2d arrays vs. subarrays in Julia.

The two differences you noted (nice detective work, thanks!) seem like potential explanations of the performance gap. One test might be to compare to cartesian indexing A[i,j], which has the same issues but avoids the div. Here I used getindex, which allows inlining in all cases except linear indexing with SubArrays. Here's what I get:

Array 2d: elapsed time: 0.186270968 seconds (64 bytes allocated)
SubArray 2d: elapsed time: 1.991903323 seconds (1664 bytes allocated)
Cartesian, Array 2d: elapsed time: 0.189886095 seconds (64 bytes allocated)
Cartesian, SubArray 2d: elapsed time: 0.431062165 seconds (64 bytes allocated)

So there's a 2.25-fold gap for cartesian indexing. Cartesian indexing has the same issues you point out for linear indexing, but the gap is smaller.

The tenfold-gap for linear indexing, by comparison to Arrays, will hopefully be mostly fixed by @vtjnash's inlining work?

One last thing to note is the puzzling allocation of memory for linear-indexing SubArrays.

@toivoh
Copy link
Contributor

toivoh commented Dec 4, 2013

The div is for linear indexing into array views, right?
One of my motivations to work on broadcasting was to help the case for array views. I think that most things that we currently implement with linear indexing would be more properly and powerfully done with broadcasting, and hopefully more efficiently as well. The major trouble I think is that it's a lot harder to find and implement good abstractions for broadcasting to use in everyday code as compared to linear indexing.

One simple observation though: I get the feeling that most uses of linear indexing involve simple iteration over the array (view) elements. It should be a lot simpler to do efficient linear iteration as compared to linear random access indexing.

@JeffBezanson
Copy link
Sponsor Member

I have felt for a while that we need a different idiom for iterating over arrays than linear indexing. Linear indexing only makes sense for that application for simple dense arrays.

@timholy
Copy link
Sponsor Member

timholy commented Dec 4, 2013

I agree too; of course, a natural alternative is Cartesian indexing. Basically, Cartesian.jl was born in part to handle this issue with the performance of SubArray indexing. (It's more than that, but it was a major motivation.)

Hence #1917 will almost certainly quickly become important in any discussion of alternatives to linear indexing.

@davidssmith
Copy link
Contributor

+1 I'm very excited about this. This is half of what I do all day.

- only supports Vector{T} as the data array
- multidimensional indexing via an arbitrary linear transform
- linear indexing compatible with the multidimensional indexing
I'm not even sure why we were generating code to check for a zero
denominator – this seems to still raise an exception as expected.
The other special case just isn't worth it to emit more than one
instruction for div – and div/rem since x86 does both together.
This version lets you see how tight the code gen can be without the
gratuitous (now-removed) checks from the div intrinsic and the undef
checks from loading the fields from the ArrayView structure:

	julia> code_native(view2data,(Int,(Int,Int),(Int,Int)))
		.section	__TEXT,__text,regular,pure_instructions
	Filename: /Users/stefan/projects/julia/base/array/view.jl
	Source line: 17
		push	RBP
		mov	RBP, RSP
	Source line: 17
		mov	R8, QWORD PTR [RSI + 16]
		mov	RSI, QWORD PTR [RSI + 24]
		mov	RAX, QWORD PTR [RDX + 16]
		mov	RCX, QWORD PTR [RDX + 24]
		mov	RCX, QWORD PTR [RCX + 8]
		dec	RCX
		imul	RCX, QWORD PTR [RSI + 8]
		mov	RAX, QWORD PTR [RAX + 8]
		dec	RAX
		imul	RAX, QWORD PTR [R8 + 8]
		add	RAX, RDI
	Source line: 19
		lea	RAX, QWORD PTR [RCX + RAX + 1]
		pop	RBP
		ret

	julia> code_native(view2data,(Int,(Int,Int),(Int,Int),Int))
		.section	__TEXT,__text,regular,pure_instructions
	Filename: /Users/stefan/projects/julia/base/array/view.jl
	Source line: 23
		push	RBP
		mov	RBP, RSP
		mov	R8, RDX
	Source line: 23
		lea	RAX, QWORD PTR [RCX - 1]
	Source line: 25
		mov	RCX, QWORD PTR [RSI + 16]
		mov	R9, QWORD PTR [RSI + 24]
		cqo
		idiv	QWORD PTR [RCX + 8]
		mov	RCX, RDX
		mov	RDX, QWORD PTR [R8 + 16]
		mov	RSI, QWORD PTR [R8 + 24]
		imul	RCX, QWORD PTR [RDX + 8]
		add	RCX, RDI
		cqo
		idiv	QWORD PTR [R9 + 8]
		imul	RDX, QWORD PTR [RSI + 8]
	Source line: 28
		lea	RAX, QWORD PTR [RDX + RCX + 1]
		pop	RBP
		ret
Also support views into multidimensional arrays. This wasn't all that
annoying, actually; I realized I could just add an `m` parameter.
julia> include(joinpath(JULIA_HOME,"../../base/array/view.jl"))
benchmark (generic function with 4 methods)

julia> results = benchmark();

julia> data = squeeze(mapslices(median,results[:,:,2:3]./results[:,:,1],2),2)
10x2 Array{Float64,2}:
  6.64002     2.99833
  6.69273    16.3886
  8.59361    29.9632
 10.4218     48.2432
 12.1854     65.8061
 13.2618   2357.6
 16.1612   2534.79
 17.3616   2367.67
 17.1718   2754.53
 20.8452   3254.62

julia> data./[1:size(data,1)]
10x2 Array{Float64,2}:
 6.64002    2.99833
 3.34636    8.19429
 2.86454    9.98773
 2.60545   12.0608
 2.43708   13.1612
 2.21029  392.934
 2.30875  362.113
 2.1702   295.958
 1.90797  306.059
 2.08452  325.462
unfortunately, this takes a damned long time to run because the
multidimensional linear SubArray indexing is *soooo* slow.
@ViralBShah
Copy link
Member

@lindahua Just pinging you on this thread, in case you haven't seen it already.

@StefanKarpinski
Copy link
Sponsor Member Author

Yes, that's about right. The improvement is good but I'm still a bit sad that I can't get it even closer to direct linear indexing. It's pretty damned hard to compete with just lea + fetch.

@ViralBShah
Copy link
Member

I do feel that we need to get array views as fast as array indexing. Perhaps we need to hack some support deep in the internals, but it would be worth it.

@StefanKarpinski
Copy link
Sponsor Member Author

I'm not entirely sure it can be done, regardless of how internal you get. There's a certain amount of arithmetic that has to be done to turn a linear index into an offset into the original array and there's a fair amount of non-linearity to it, so you can't just collapse the computations down all the way. But we may be able to get it within 2x if we're very careful.

@ViralBShah
Copy link
Member

Can't we get the offset with pointer_to_array, and then do the same arithmetic that is currently done for Array indexing? That was the approach @lindahua was trying here:

https://github.com/JuliaLang/julia/pull/3503/files

@StefanKarpinski
Copy link
Sponsor Member Author

You can do that, but you still have to do more arithmetic. Like I said, the relationship between the input index and the offset into the original array is not a simple linear relationship.

@StefanKarpinski
Copy link
Sponsor Member Author

In particular, the amount of work you need to do is proportional to the number of dimensions.

@StefanKarpinski
Copy link
Sponsor Member Author

Ok, those undef checks are really killing us. I just manually hoisted the lookups for those out of the benchmarking loop and got this:

julia> squeeze(mapslices(median,10^9*results./sizes,2),2)
10x2 Array{Float64,2}:
 1.93126     6.10509
 1.47899     6.98912
 1.40065    13.3511
 1.50102    56.747
 1.49355    78.5422
 1.52309    93.0085
 0.421136   33.0249
 0.247773   21.4358
 1.76813   131.094
 1.5015    175.582

julia> squeeze(mapslices(median,results[:,:,2]./results[:,:,1],2),2)
10-element Array{Float64,1}:
   3.15233
   4.73653
   9.32549
  38.1872
  52.2354
  61.1526
  78.704
  84.6919
  75.6112
 117.906

I didn't run the SubArray benchmarks since they're slow and won't have changed. This is pretty respectable performance. I think that with some more performance tweaking and appropriate optimizations, we can get this really close to direct linear indexing, but unless someone comes up with a brilliant trick, there's always going to be a cost to linear indexing into high-dimensional array views.

@timholy
Copy link
Sponsor Member

timholy commented Dec 29, 2013

@StefanKarpinski, this is amazing work.

With regards to what is ultimately possible, have you considered making more extensive comparisons against C? In particular, an analysis that I had done earlier seemed to suggest that for C, the arithmetic (at least for 2d) was inconsequential compared with cache misses. Are you running these tests with arrays too big to fit into L1 cache?

Of course, it's also possible that there's something devious going on, and I misinterpret the results of the C code.

Finally, there's another major issue: whether getindex itself is inlined within the caller. Inlining makes a huge difference in tight loops, and I suspect it's currently impossible to write SubArray linear indexing in a way that allows it to be inlined. In my own investigations I had come to the conclusion that perhaps we should wait to re-evaluate the performance of array-view indexing until after #3796 gets merged.

@lindahua
Copy link
Contributor

@StefanKarpinski This is great work. Thanks for taking the lead here.

My two cents:

  • I have tried to tackle the problem of efficient indexing several times, but it ended up much more complex than I had thought. I still, however, believe that this problem is doable given enough time and efforts.

  • I agree that array views (instead of copies) are the way to go.

  • I think there should be two kind of views: contiguous views and non-contiguous strided views. The reason is that linear indexing of these two kinds of views are vastly different. For contiguous views, linear indexing is obviously trivial. But for non-contiguous views, linear indexing may involve integer division, which is considerably slower and more complex.

    The internal of these two kinds of views are also different. Contiguous views only have to maintain the memory address and the shape, while non-contiguous views will also have to maintain the strides.

    It is easy to determine when to return contiguous views in a static way (with type stability). Suppose a is a contiguous array/view, then anything looks like a[:], a[:,:], a[:,i], a[:,:,i],etc where i is an integer or a contiguous range will also be a contiguous view. Otherwise, it is in general not contiguous, and a strided view should be returned.

  • Specialization is important to get efficiency for common cases. It takes some efforts, but it is worth it. I think specialize things up to 2D/3D should cover at least 90% of everyday use of the language. Of course, a lot of test cases are needed to ensure that things are implemented correctly (including corner cases).

  • In my previous try, I found that index checking usually kills performance if implemented in a typical Julian way. So some internal support might be needed to address this issue.

@StefanKarpinski
Copy link
Sponsor Member Author

@timholy: trying out these techniques in C is a good idea since it will give a good sense of what the best performance we can possibly get would be. I worry that the C compiler might do things that wouldn't be valid for Julia's compiliation to do because it can perform static global analysis that we can't.

@lindahua: I think you are absolutely right about contiguous views and it does seem like there are a lot of situations where we can statically determine that a view should be contiguous. I'll have to think about that a bit more – definitely a huge performance win.

@JeffBezanson
Copy link
Sponsor Member

Although it might add additional challenges, I'd also like to explore
moving away from linear indexing. It clearly isn't general enough. Tuples
of integers can be fast now, so we have more options.
On Dec 29, 2013 1:26 PM, "Stefan Karpinski" notifications@github.com
wrote:

@timholy https://github.com/timholy: trying out these techniques in C
is a good idea since it will give a good sense of what the best performance
we can possibly get would be. I worry that the C compiler might do things
that wouldn't be valid for Julia's compiliation to do because it can
perform static global analysis that we can't.

@lindahua https://github.com/lindahua: I think you are absolutely right
about contiguous views and it does seem like there are a lot of situations
where we can statically determine that a view should be contiguous. I'll
have to think about that a bit more – definitely a huge performance win.


Reply to this email directly or view it on GitHubhttps://github.com//pull/5003#issuecomment-31322185
.

@StefanKarpinski
Copy link
Sponsor Member Author

I definitely agree that leaning so hard on linear indexing is just causing a problem that doesn't need to exist. Now that we have a proposed hierarchy of array-like things, we should try to figure out a better general protocol for iterating over them that will be efficient for a wide range of implementations. I don't think that faster tuples help so much if you mean using tuples as state since it's not really the multiple pieces of state that we need but the multiple layers of control – we really need nested for loops, each of which has its own piece of state that it can update very simply. More pieces of state with a single control loop isn't any better because the update step still ends up being very complex – and it still happens on each iteration.

@JeffBezanson
Copy link
Sponsor Member

That's a good point, but I was hoping that using a local tuple would allow LLVM's scalar conversion and hoisting to figure out which pieces of the indexing calculations are constant for each loop. We could also end up with each loop variable being an integer, but the programmer seeing a single tuple as each index.

We need a for i = 1:length(n) for the twenty-first century. It would be great if a simple loop like this could work roughly equally well on local and distributed arrays. In no way do I expect a silver bullet, but there could at least be something better than the fallback implementation individually fetching every element from a remote processor and taking 1000x longer.

@timholy
Copy link
Sponsor Member

timholy commented Dec 30, 2013

I agree that more flexible iteration is highly desirable.

@StefanKarpinski, on linear indexing and C: depending on how much work it is for you, one way to find out would be to look at the assembly generated by the code in that gist I linked to. That's 2d only, but it would be a start.

@lindahua lindahua mentioned this pull request Jan 6, 2014
@jiahao jiahao mentioned this pull request Jan 8, 2014
@dmbates
Copy link
Member

dmbates commented Jan 10, 2014

Could someone clarify for me the state of the art regarding ArrayView, NumericExtensions.UnsafeVectorView, etc.? I am interested in present facilities and the near term future.

The context is that I am writing a new edition of Bates and Watts, 1988, "Nonlinear Regression Analysis and Its Applications". This will be a large rewrite under the working title of "Nonlinear Regression with R" but I now propose changing that to "Nonlinear Regression with R and Julia".

For generality, the model function evaluation for partially linear models (almost all nonlinear regression models have one or more parameters that occur linearly in the model) has a signature of

### Michaelis-Menten model for enzyme kinetics
function MicMenmmf(nlpars::StridedVector,x::StridedVector,
                   tg::StridedVector,MMD::StridedMatrix)
    x1 = x[1]
    denom = nlpars[1] + x1
    MMD[1,1] = -(tg[1] =  x1/denom)/denom
end

because I am using sub on vectors and matrices when calling such a model function.

On thinking about it, I am probably better off using duck typing as advocated by Steven Johnson and not using a signature.

I still need to decide how to call such a function. All the arrays involved in the evaluation are configured so that the array views passed to the function will be contiguous views. Right now I am leaning toward using the unsafe views from the NumericExtensions package for these, because these are contiguous. Should I expect to continue to use those or will ArrayView's be preferred?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 11, 2014

@dmbates off-topic, but I would avoid using the NumericExtensions.Unsafe* functions. I've only read through the code quickly, but it makes a number of assumptions which would make segfaults and incorrect answers likely. If performance is really that critical, IMHO, you are probably better off writing your code in C, where at least the trade-off is obvious.

A much better solution is to combine limited and judicious usage of @inbounds with non-copying array indexing (e.g. sub, which will eventually become the default) as you are doing now

@lindahua
Copy link
Contributor

Ok, let me make this more explicit. unsafe_views do make several assumptions: (1) through the life time of the view, the underlying array should not be resized (therefore the cached pointer remains valid). (2) no bound checking: it assumes input indices are within bounds. I think the word unsafe in the names should have given the user proper warnings about these assumptions.

These stuff should be used with caution. But I don't think they are as risky as @vtjnash suggests. If one uses these views within a local scope (i.e. not passing them around) and carefully ensures that the indices are actually in bounds. There won't be a lot of risk.

I introduced unsafe_views basically as a stopgap. For example, if you want to efficiently process columns one by one,
you currently have two choices in Julia base:

  1. use a[:, i] -- it makes an unnecessary copy
  2. use sub(a, 1:m, i) -- this does not make the copy. But working with a sub-array is many times (sometimes more than one order of magnitude) slower than with an ordinary array.

unsafe_views addresses both issues -- but you have to be careful when using them.

When the array views in Julia base become fast enough (which I am eagerly looking forward to), one can just do a search & replace to replace unsafe_view with sub or view, and the codes are likely to continue to work.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jan 11, 2014

They aren't properly gc-rooted, so it is unsafe to assume that the original array will stick around, and you have to assume that any code that could get called on the array anywhere (esp. code that is expecting out-of-bounds conditions to throw an error) does not have typos. As I said before, if you want your code to require manual memory management and tracking of pointers, then just be honest about it and use C.

I generally have a strong dislike of creating stopgaps, I would rather see the effort put into writing correct code and enjoying the speed increases as the underlying issues get addressed.

@lindahua
Copy link
Contributor

I agree with your general opinions. However, I don't think using unsafe_view is always a bad idea.

Admittedly, unsafe views are not a terribly elegant solution. I myself don't use them if there are other solutions with comparable performance (within 2x - 3x) & productivity.

However, when the performance gap is up to 5x - 10x or even bigger, I don't see I still want to stick to the recommended ways, and have my program run 10 times slower. In the rare cases that I use unsafe_views in my codes, I strictly follow the pattern below:

function bar(vec, ...)
    ....
end

function foo( a::Matrix, ... )
    for i = 1 : size(a, 2)
        col = unsafe_view(a, :, i);
        bar(col,  ....)
    end
end

I never throw the unsafe views around, and the code context makes it clear that the indices are within bounds. I don't see how the use of unsafe views under such restricted conditions would do any harm.

I can certainly implement all these in C. But it means that I also have to migrate the related functions (such as bar in the example above) to C. Not to mention that there's much less support of numerical computation in C, implying that implementing these functions in C would take much longer time.

Generally, I suggest writing codes using safe ways in the first version of the implementation. And use unsafe views to improve the performance of critical parts with caution (preferably under the guidance of profiler).

In terms of efficient array views, that's a complex problem. People have been working on it for months, still the progress is very limited. Therefore, such stopgaps are introduced to temporarily work around the problem.

Sometimes, we have to make tradeoffs from a practical standpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.