-
-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator fusion #607
Comments
Does anyone know how other frameworks do it? My naive first thought is to add some |
I have some ideas for this, but they would require a pretty major overhaul to tensor_ops, and some changes to how tensors are represented. I think that we should target folding for reshapes (stack, concat, slice, permute, select, gather), unary operations, binary operations, and reductions. Effectively planning folds for these operations requires access to the full graph of tensor operations, and for tensors to only sometimes contain their data. I do not think we can reasonably fold convolutions or matrix multiplications, as each require calling external kernels. In this scheme, kernels would consist of the following four phases:
On the cpu, folded operations can be represented with rust structs/enums that contain information on all reshape operations and all mathematical operations, with the mathematical operations represented represented something like: struct BinaryOperation {
input_register1: usize,
input_register2: usize,
output_register: usize,
op: Box<dyn Fn(f32, f32) -> f32>, // or &'static dyn
cuda_representation: String,
} or with a specialized trait. This format should be designed to be directly translated into a cuda kernel and executed or jit compiled on the cpu. Blocking questions:
|
Do you know what other frameworks fuse? Each op we could fuse would save on allocation/data movement, but I think a whole rewrite of tensor ops seems like a big price to pay. I also want to add to the discussion: when should we not fuse? If a tensor is only used once, it makes sense to fuse and this does save computation time. However if a tensor is re-used, does fusing cost twice as much computation?
In my head we would have: struct Tensor {
...
lazy_fns: Arc<Vec<D::Func>>,
} after then after Interestingly enough ln/exp are opposites so we could actually just remove these two 😁 but for sake of example whatever would be fusing would compile this to If we had a case of tensor being used multiple times (e.g. the tensor.ln() also was used for another operation like |
What about using something similar to Rust's We could even go as far as to do various reductions such as the one you mentioned ( Another topic related to this that is worth discussing is CUDA Graphs #360. If we go with this lazy evaluation pattern then we could also generate CUDA graphs automatically (reduce overhead of calling individual kernels). |
The type-based system would need to be at compile time, right? That's how rust's iterators work. Would the vec-based system be implementable as a special tape? Since all operations are recorded on the tape, it could build up this vec of operations during some initial pass, then combine the eligible ops, generate the kernels, and then produce a new forward function? |
Another idea I'll add to the mix is adding some wrapper type around device like struct FusedVec<E> {
data: Vec<E>,
ops: Vec<Box<dyn Fn()>>,
}
impl DeviceStorage<E> for Fusing<Cpu> {
type Storage = FusedVec<E>
}
...
let dev: Fusing<Cpu> = Default::default(); this would require moving away from GATs like was introduced in #633 |
For future reference: https://live.juliacon.org/talk/9RFTHY, https://github.com/PumasAI/SimpleChains.jl |
i gener(ic)ally like this idea, but reducing something like |
I think how it would be done is some trait like |
I'll throw in an idea I've been thinking about when reading through tinygrad code. There's essentially a spectrum of ways to run DL computation. EagerOn one end, there's Pytorch (or at least Pytorch 1.x, and currently dfdx), where everything is eager. When StaticOn the opposite end is Tensorflow 1.x, where everything is fully static. The entire network gets built as a huge DAG of operations, and nothing gets ran until the model is compiled and executed. This means when you write the operations in the model, they can be reordered, changed, or entirely deleted so long as the end behavior is the same. This allows the TF compiler to work with the network at a global level, and understand every single thing going on all at once. Of course, this means the limit to optimization is the power of the compiler and the creativity of the people programming it. It results in the fastest models with the most aggressive optimization. Downside is that it's really hard to debug this, as no prints can be put in the middle, operations aren't straightforward, and the network is difficult to program, with things like HybridIn tinygrad, the goal is to compute everything lazily, basically only run the computations when the data is actually needed. Which means that when However, in typical python fashion, this is all super implicit and handled behind the scenes. If one dynamic line is added somewhere deep down in the module tree, a potentially very large graph that could have been well optimized gets split, perf goes down, and it's hard to understand why without going through every line of code. Explicitly HybridI think this can be done in an explicit (albeit less developer friendly) way. We can still keep the current Tensor with the eager operations, which means all modules can be directly ported over. But for modules that have a high performance cost or get ran often, we can instead define a local graph: /// Feedforward layer for a transformer
struct FF<I, M, O> {
lin1: Linear<I, M>,
lin2: Linear<M, O>,
}
/// Typical eager forward
fn forward(&self, in: Tensor<I>) -> Tensor<O> {
let mut mid = self.lin1.forward(in);
mid = mid.relu();
self.lin2.forward(mid)
}
/// Graph-ified forward
fn forward(&self, in: Tensor<I>) -> Tensor<O> {
let graph = Graph::new(in).apply(self.lin1.forward).relu().apply(self.lin2.forward);
graph.compute()
} In this second forward, we create a graph object, which then wraps the tensor and goes through the same ops, only now we don't actually execute them, only track them. This API still preserves the type-safety / tensor shape safety, but the computation only happens when compute is called. At that point, the graph is optimized and ran (and the optimized graph can be cached). We can go much further and allow the forward function to take in something that turns into a Graph, which would be either a graph or a tensor, and output a graph so that this module can be part of a larger graph, rather than only the small graph we demonstrated above: /// Maybe part of a larger graph!
fn forward<G: Into<Graph<Tensor<O>>>>(&self, in: G) -> Graph<Tensor<O>> {
let graph = in.into(); // If "in" is already a graph, this is a no-op
let graph = graph.apply(self.lin1.forward).relu().apply(self.lin2.forward);
graph // Notice no compute here, we're not doing any computation, just passing it back to the caller!
} Apologies for the wall of text, just a view of what I think the future of DL libs will look like. I think dfdx is uniquely positioned to take the lead in perf if these graphs can be optimized well enough. Safe and fast! |
@coreylowman I know you're busy but when you get a chance would love to hear some thoughts |
Another idea (that fails to address the "automatic" fusing problem, but is probably simpler) would be to implement "manual" fusing. let f = |x: Resolve<f32>, y: Resolve<f32>| x.add(y).mul(3.6).sub(y);
let a = f(4f32.to_val(), 3f32.to_val());
assert_eq!(a.eval(), 22.2);
let src = f("x".to_marker(), "y".to_marker()).to_cl_source();
assert_eq!(src, "(((x + y) * 3.6) - y)"); The user would then need to specify the operations applied to the tensor in a similar closure. |
I've been working on a DL library that does fully static computation, which allows it to do aggressive fusion / compilation before running: https://github.com/jafioti/luminal Llama now runs on it! The approach I took is pretty incompatible with dfdx, which relies on eager execution, but it might be another useful approach to look at. |
Awesome work! Any benchmarks to report? I'm super curious what the performance benefits of this are. I'm leaning towards not approaching this because it's very complex, and as noted in other places, writing custom fused kernels is pretty standard these days, and let's you take advantage of things that automatic fusion maybe won't be able to (See flash attention & flash attention 2) |
@coreylowman Perf right now is absolute dogwater because not much in the way of fusion optimizers have been written yet. I've been trying to speedrun achieving llama with the smallest set of primitive operators possible (11!) Next step will be to write optimizers to start fusing the primitive graph down to be reasonably fast. Good news is that this should be super straightforward, since optimizers take a global view of the graph, and so optimizations can pretty much go as far as your imagination takes you. Also, since optimizers just take in a global graph and mutate it, both manual kernels and automatic fusion are possible. In a week or so I expect to have some decent benchmarks. |
Keep us posted 👍 |
Let's discuss how operator fusion might work in dfdx. I suspect it will require a lot of work. On cuda side of things it will at least require jit compiling kernels.
Originally posted by @jafioti in #590 (comment)
The text was updated successfully, but these errors were encountered: