Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework for symbolic optimizations #122

Closed
ViralBShah opened this issue Jul 16, 2011 · 35 comments
Closed

Framework for symbolic optimizations #122

ViralBShah opened this issue Jul 16, 2011 · 35 comments
Assignees
Labels
performance Must go faster speculative Whether the change will be implemented is speculative

Comments

@ViralBShah
Copy link
Member

We need a framework to express certain mathematical optimizations in julia itself. These may be expressed as rules that are run after types have been inferred. Examples are:

  1. A' * B, A' \ B: Can be computed without computing the transpose by calling BLAS/LAPACK routines
  2. A - B + C .* D: Can be computed without temporaries
  3. A[m:n, p:q] + B: Avoid computing the ref.
  4. A[m:n, p:q] * B: Avoid computing the sref, and call DGEMM directly in cases where this is possible.
  5. [ a b c; d f; e]: Compute the entire concatenation in one shot.
  6. A[p,q] = B[r,s]: Temporary can be avoided by calling assign(A, sub(B, r, s), p, q)

In all cases, temporaries can be avoided. Either we can do it with optimizations, or have the parser such expressions down to the runtime. If a matching runtime implementation for such cases is not found, the existing base case should certainly be used.

@ghost ghost assigned JeffBezanson Jul 16, 2011
@JeffBezanson
Copy link
Sponsor Member

#5 above is easy since it doesn't generalize (it only deals with 2 dimensions). How should it be called? Maybe

hvcat({a, b, c}, {d, f}, {e})

?

Of course a fallback definition can be provided so implementing it is only an optimization.

Same idea applies to a'*b, a*b', etc. We could have aTb() and abT() methods.

For #2, we already parse runs of + and * to single calls, so +(as::Array...) can be implemented.
Is it worth having special methods for s*x+y etc.?

@ViralBShah
Copy link
Member Author

Yes, #5 is easier - and the suggested syntax seems ok.

I guess aTb() and abT() are good starts for a first release, until a more general framework can be implemented. Ideally, BLAS allows you to efficiently do A[I,J]^T*B[I,J]^T + C[I,J]^T where I and J are ranges. I guess the indexing part can be handled through subarrays, but then the user has to use subarrays explicitly.

For #2, I was only thinking of element-wise operations for a start.

Think of a stencil operation. It will be essentially the element-wise stuff, but with some indexing thrown in. It would be nice to handle these cases with indexing. Perhaps we need to use subarrays more automatically in more places.

@JeffBezanson
Copy link
Sponsor Member

Well I'm saying we can already handle element-wise a+b+c+d+....

What special methods do we want? I guess I could also make special versions of the operators where you know that the first and/or second argument can be overwritten.

@ViralBShah
Copy link
Member Author

I was referring to element-wise a-b._c+4_d?

@JeffBezanson
Copy link
Sponsor Member

Seems to me passing this to something at run time amounts to handling arithmetic operators with an interpreter?

@ViralBShah
Copy link
Member Author

Would be cool if there were a way to do this. I guess that for Dense Arrays, one can just hack it into the compiler as a special case.

@StefanKarpinski
Copy link
Sponsor Member

The A*B' and A'*B cases could be handled by allowing cheap transpose by storing strides for each dimension. That way A' would be a trivial operation that uses the same data as A and just puts a new matrix object around it with different strides. The the multiply call could just pass dgemm the correct stride values and it would work.

Unfortunately, this presumes that you can do T = A' without any copying, which in the current implementation you can't. This brings us back to immutability vs. mutability issues...

@ViralBShah
Copy link
Member Author

I have often thought of this but am not sure the resulting complexity is worthwhile. Also it annoys the user who know what they are doing.

@StefanKarpinski
Copy link
Sponsor Member

It really only annoys the user who knows what they're doing in Matlab and doesn't know what they're doing in Julia. There could be ways to force the actual transpose to be done. With the mutability vs. immutability stuff, this approach wouldn't just apply to just transpose, but also subarray refs — since those could safely be references into the same data as the original array with different strides — and many other cases too.

@ViralBShah
Copy link
Member Author

Here's a suggestion for this - How about extend * to allow:

For A*B:
*(A,B)

For A'*B'+C', where either of the transposes may or may not happen.
*(A, transA, B, transB, C, transC)

Also want one more form for x' *A' *y'. Perhaps * can take a 7th argument that says what type of operation it is - a'b'+c' or x'A'y'.

@StefanKarpinski
Copy link
Sponsor Member

Ahh! These all go away if we deal with mutability and allow cheap transposes.

@ViralBShah
Copy link
Member Author

I don't like the idea of the cheap transpose. Cheap transposes are not always desirable, as they just move the problem elsewhere. Most fast matlab code is written with an assumption on stride and storage, and the cheap transpose just messes up that performance model. Also, every single operation has to be made stride-aware, and also will become a bit slower due to dealing with strides everywhere.

I'd be ok with a cheap transpose that is evaluated lazily, but not one that just changes the stride.

@JeffBezanson
Copy link
Sponsor Member

We can have functions for A'_B'+C', x'_A'_y' etc. but I'd rather they not be part of _ since they don't just multiply. We need other names for them.

@ViralBShah
Copy link
Member Author

muladd() and bilinear()?

@ViralBShah
Copy link
Member Author

I tried a simple experiment on randmatstat, minimizing the array allocations and concatenation with the following change:

    ab = randn(n, 2*n)
    cd = randn(n, 2*n)
    P = [ab cd]
    Q = [ab; cd]

It gives a 10% improvement right away, and we can perhaps get a bit more. Optimizing P'_P and Q'_Q will possibly give a bit more improvement.

JeffBezanson added a commit that referenced this issue Aug 11, 2011
  called as hvcat((rows...), arguments...) where the first argument is
  a tuple giving the number of arguments in each block row
addresses issue #122
also related to issue #38
@ViralBShah
Copy link
Member Author

commit 361588d adds transposed multiplication.

@ViralBShah
Copy link
Member Author

Things such as A[m:n, p:q] + B, A[m:n, p:q] * B can be done efficiently with subArrays, but not automatically.

@o-jasper
Copy link

Common Lisp has Compiler Macros, these should be fairly easy to implement. Basically, they are just macros except they may choose to do nothing and are intended for optimization.

If you do this, it might actually be more desirable to allow the compiler macros to do things after type calculation, so that information is available? Though then it might have to be careful not to break whatever has been type-calculated, or the type calculation has to run again. Of course the disadvantage of this is that the macros can do anything whatsoever, care has to be taken in making them. (And of course you can make a macro to turn conversion rules into compiler macros)

Also, do you care about float-behavior specifics? Maybe if you want to, using the types Real32,Real64 could 'mean' that something like log(1+x) is converted to log1p(x) or exp(x)-1 to expm1(x).

@StefanKarpinski
Copy link
Sponsor Member

I find it quite comical that no matter what language feature X is, the statement "Common Lisp has X" always holds.

@fph
Copy link

fph commented Oct 28, 2012

We can have functions for A'*B'+C' [...] We need other names for them.

I think the standard name for this operation is axpy.

@tknopp
Copy link
Contributor

tknopp commented Dec 29, 2013

Given that array views seem to be the way to go (see #5003), wouldn't it be consequent to also let the transpose operator return a view? I am not sure what to do when assigning the transposed to a new variable (i.e. x=y') but this should be handled the same way as array views (which are the right choice IMHO)

@johnmyleswhite
Copy link
Member

I believe there's another issue open for making transpose operations produce a new immutable Transposed type.

@jiahao
Copy link
Member

jiahao commented Dec 29, 2013

#4774

@lindahua
Copy link
Contributor

FWIW, I have been exploring something along this line in Accelereval.jl. I am considering to use this to replace Devectorize.jl. The progress is not very fast, as I only worked on it when I got some free time to spend.

This work is inspired by Theano, but take advantage of Julia's multiple dispatch mechanism and meta-programming to simplify interface. The goal is to automatically devectorize expressions or turn them into BLAS calls whenever appropriate.

@jiahao
Copy link
Member

jiahao commented Dec 29, 2013

We already have generic axpy! (#5189‎).

#4950‎ provides syr2k!/her2k! for the rank-2k updates of the form C=a*A*B'+conj(a)*B*A' + b*C. The possibility of rewrite rules to turn expressions of this form into calls to syr2k!/her2k! were mentioned.

@tknopp
Copy link
Contributor

tknopp commented Dec 29, 2013

#4774 is about how to handle transpose for 1d arrays. I would propose that transpose in General returns a view where the strides are reversed.

@lindahua your metaprogramming work is impressive!

@StefanKarpinski
Copy link
Sponsor Member

Having slicing return views may actually help alleviate the issue that v'' is not of the same dimensionality as v'. If you follow along with my ArrayView pull request (#5003), when v::Array{T,1} you could have v'::ArrayView{T,2,1} and then know that since the original v is one-dimensional, v'' should also be one-dimensional. It's a bit sketchy and @lindahua's proposed contiguous array views may be necessary to make it work.

@malmaud
Copy link
Contributor

malmaud commented Jan 5, 2014

Neal has been implementing some of these features in R recently: http://radfordneal.wordpress.com/2014/01/01/new-version-of-pqr-now-with-task-merging/. Might be worth taking some inspiration.

The new version extends use of this mechanism to merge transpose and matrix multiply operations — now t(A)%*%B does not actually compute the transpose of A, but instead calls a routine that directly computes the transpose of A times B.

@JeffBezanson
Copy link
Sponsor Member

This is well-trod ground. I'm not sure I've heard this optimization called "task merging" before; it is usually called "stream fusion".

We could hack something into the compiler to recognize and replace certain patterns, I just don't find it very satisfying. Library-defined rewrite rules are probably the best option I'm aware of, but they still feel like a hack. An easier case might be high-latency operations like on DArrays, where you could do delayed evaluation at run time, all at the library level.

@mlubin
Copy link
Member

mlubin commented Jan 6, 2014

What would these library-defined rewrite rules look like? Right now it's hard to implement something like this with a macro because inferred types aren't available at compile time. Solving this issue in a sufficiently general way will directly help JuMP.

@lindahua
Copy link
Contributor

lindahua commented Jan 6, 2014

@StefanKarpinski mentioned the staged function idea in the maillist.

@lindahua
Copy link
Contributor

lindahua commented Jan 6, 2014

This might be a complicated issue -- suppose the compiler managed to make the inferred type available to the "later-staged" macro, what if that macro some how modify the types ...

@mlubin
Copy link
Member

mlubin commented Jan 6, 2014

Operator overloading with lazy objects is a very intuitive way to do this "task merging". If only we could have some interface where these lazy objects can be computed once at compile time, so long as the compiler can guarantee the types of the objects from type inference (otherwise it's done at runtime).

@mlubin
Copy link
Member

mlubin commented Feb 5, 2014

I keep coming back to this for JuMP. I essentially need a post-macro step that lets me write code based on the inferred types of objects in an expression tree. It's not feasible to write a code that dispatches on the types of the objects or generates isa branches because there's an exponential number of possible type combinations. How crazy is it to make this work in Julia?

@ViralBShah
Copy link
Member Author

Much of this issue is addressed in many different ways - with staged functions, array views, and the various mul operators.

Keno pushed a commit that referenced this issue Oct 9, 2023
Reorganize the source files (fixes #79)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster speculative Whether the change will be implemented is speculative
Projects
None yet
Development

No branches or pull requests