Roadmap for multiple backends and dealing with quantization format changes #261

philpax · 2023-05-21T02:17:28Z

philpax
May 21, 2023
Maintainer

G'day, everyone!

In the last week, we've seen two quantization format changes. Our strategy last week was to try and bridge the quantizations by dequantizing and requantizing in the new format at launch, but I didn't implement it in time, and there are a couple of minor technical issues. At the time of writing, we do not support version 3/qnt2 yet (#252).

Going forward, we'll need a better way to handle this. I've been thinking about potential solutions, and this is what I'm thinking of:

Immediately update to the latest versions when they become available going forward
add multiple backend support (Non-ggml backend #31 / Build and execute our own computation graph #137)
introduce the older versions of GGML as alternate backends
switch to them when older quantization formats are detected

This would mean that there would be a mapping between the N backends and M format (version)s, with each backend supporting different formats. We would likely select the best backend for the user by default, with an option for the user to override the backend.

In terms of how we'll actually do this:

Update to the latest ggml, accepting whatever breakage it brings
Update the existing computation graphs for all models - we've realised that they're all a bit erratic recently (Update to latest upstream LLaMA implementation #210, Update to latest upstream BLOOM implementation / BLOOM Quantization does not work #228, Update to latest upstream GPT-NeoX implementation / Fix GPT-NeoX inference #246) and need to be synced up
Release 0.2
Introduce an abstraction to handle multiple backends, including likely a computation graph
Cease development on ggml, then introduce new versions of ggml-sys that provide defined last-known-good versions of each quantization format
Implement new backends based on those formats. They can potentially share some code, but we should conceptually treat them as separate backends
Along the way, implement other backends, including potentially burn, wonnx, dfdx, and more
Along the way, implement other formats, including safetensors, .pt, ONNX, and whatever else the ML world may throw at us

Does this sound like a reasonable plan of attack? What should we watch out for? Is there anything you'd like to see prioritised?

Let us know!

LLukas22 · 2023-05-21T07:03:26Z

LLukas22
May 21, 2023
Collaborator

I could help to sync up the model implementations. But i dont know about the GGML Stuff your plan sounds reasonable but the faster we can move towards a new Format/Backend the better.

0 replies

danforbes · 2023-05-21T15:34:26Z

danforbes
May 21, 2023
Collaborator

What do you mean by this?

Cease development on ggml, then introduce new versions of ggml-sys that provide defined last-known-good versions of each quantization format

1 reply

philpax May 22, 2023
Maintainer Author

Right now, we have a ggml wrapper library that bundles a ggml-sys library, and this targets some specific version of GGML. However, if we introduce this change, having an ergonomic interface to GGML doesn't matter as much - we'd just implement our backend abstraction using the raw sys functions.

We want to be able to support as many of the formats as possible, which means we need to support their corresponding GGML versions. For that, we would likely either publish separate versions of ggml-sys for each defined version ("last-valid implementation for qnt0", "last-valid implementation for qnt1", etc), and then implement a backend for each version, like

struct GgmlQnt0 { /* ... */ }
impl Backend for GgmlQnt0 {
  type Tensor = *mut ggml_sys_qnt0::tensor;

  fn supported_formats(&self) -> Vec<Format> {
    vec![Format::Ggml, Format::Ggmf, Format::Ggjt1 /* maybe a way to specify specific element types here */]
  }

  fn op_add(&mut self, a: Self::Tensor, b: Self::Tensor) -> Self::Tensor {
    ggml_sys_qnt0::ggml_op_add(a, b)
  }
}

or whatever it may look like, and then implementing these for each version of GGML we support, as well as any alternate backends.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap for multiple backends and dealing with quantization format changes #261

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Roadmap for multiple backends and dealing with quantization format changes #261

philpax May 21, 2023 Maintainer

Replies: 2 comments · 1 reply

LLukas22 May 21, 2023 Collaborator

danforbes May 21, 2023 Collaborator

philpax May 22, 2023 Maintainer Author

philpax
May 21, 2023
Maintainer

Replies: 2 comments 1 reply

LLukas22
May 21, 2023
Collaborator

danforbes
May 21, 2023
Collaborator

philpax May 22, 2023
Maintainer Author