Add RWKV2 (fast) #17230

leondz · 2022-05-13T10:48:16Z

Model description

I would like to implement a new model architecture.

Short description

RWKV v2 is an "RNN with transformer-level performance, without using attention. Similar to Apple's Attention Free Transformer. All trained models open-source. Inference is very fast (even on CPUs) and might work on cell phones. There's also a GPT-type implementation." -- (Hochreiter's description)

RWKV v2 is parallelizable because the time-decay of each channel is data-independent (and trainable). For example, in usual RNN you can adjust the time-decay of a channel from say 0.8 to 0.5 (these are called "gates"), while in RWKV v2 you simply move the information from a W-0.8-channel to a W-0.5-channel to achieve the same effect. RWKV can leverage GPUs, but doesn't need to.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Implementation and weights

There's an implementation at BlinkDL/RWKV-LM which also gives a detailed description of the model internals and some performance benchmarks. Model weights currently are being trained for a few datasets, including the Pile (see e.g. BlinkDL/RWKV-v2-RNN-Pile) and Danish Gigaword by me. Both will be openly available - some checkpoints for the Pile already are, even though it's an ongoing process.

Status

The model seems quite exciting and I'm able to replicate preliminary results. I'm already talking with @BlinkDL about the implementation. I'm happy to implement/port the model architecture (for both RNN and GPT variants), tokenizer, and tests myself (and have already started) and would appreciate help and advice.

leondz · 2022-05-16T06:49:52Z

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform. What I'd really like to do is implement and develop it on Hub, and see if it's useful/popular there. I spent an amount of time with the docs, and the route to adding new model architectures seems to preferentially support adding directly to transformers. Tooling for new model architectures that worked on Hub (e.g. cookiecutter, class organisation, and tests) would be super neat. Is that something there's any interest in?

mrseeker · 2022-05-16T12:11:11Z

-- on second thoughts: it's not immediately clear to me how many people will use this particular model, or how it will perform.

To answer your question: If it performs better than the other CausalLM models out there, it will most likely get used. Make a PR, build an initial version that can be run on HF, and see if any of the HF devs are willing to chime in. I am interested in this work, particularly because it solves a problem I haven't seen before: Be able to run CasualLM models on CPU. And my work stretches beyond the KoboldAI team, I know there are more out there that seem to benefit from the usage of CPU models because of the high prices that GPU models currently have.

leondz · 2022-05-20T19:41:23Z

Work is going OK. We're porting the GPT-like part to Transformers first, for training and induction, and will work out the fast RNN induction-only part after the GPT part passes tests.

xloem · 2022-06-26T12:33:40Z

Where is your work at? I have worked on this model and would like to contribute. I'm also experienced now at troubleshooting the parts of this model (mostly inference accuracy though), and have spent time understanding the cuda kernels. I have some experience with adjusting new codebases to unexpected featureset combinations.

jbmaxwell · 2022-09-09T01:08:37Z

I'm also curious how this one is coming along. (I just saw the original paper today. Not sure how I missed it...)

ArEnSc · 2022-10-04T17:33:15Z

@leondz are you guys still working on this? I am looking to get into this if this can work on edge devices

xloem · 2022-10-05T05:24:17Z

Some time ago I looked a little into continuing this, but other things came up.
After that experience, I would recommend that future implementers start a new fork, rather than working off the existing one, because very little has been done, so it can take extra effort to learn the existing situation without much return.
For the record:
leondz's branch is at https://github.com/leondz/transformers/tree/rwkv-v2 .
I added smidges to it at https://github.com/xloem/transformers/tree/rwkv-v2 and https://github.com/xloem/transformers/tree/rwkv-v2-disable_non_clm_for_now .

Since that work, RWKV is on version 4 now (although the changes between versions are not generally complex): https://github.com/BlinkDL/RWKV-LM

mrconter1 · 2022-11-21T09:17:27Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

xloem · 2022-11-21T09:35:42Z

You could ask the same about any model or technology near the top of a leaderboard. Things happen because people do the work or make the business decisions behind them happening. There are scads and scads of things better than the original transformer paper, but they're not normative yet.

BlinkDL · 2022-11-22T13:25:06Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

This is better but GPT is good enough for most applications.
I will just keep training larger models. RWKV 14B release soon.

ArEnSc · 2022-11-22T17:12:39Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

It's not presented well and clearly, I am working on a fork or huggingface integration that answers questions, this is pretty much a breakthrough model imo, I am just making sure the runtimes are true. It still in R and D phase adoption phase comes soon after

leondz · 2022-11-22T17:43:13Z

I spent about a month working on this but the code wasn't stable and wasn't version controlled in the normal way, which made refactoring really tricky. Then time ran out. I think if the engineering side of things is fixed, and there's a stable release, it's a great model - definitely more data-efficient than competitors, which is really the core factor now.

henk717 · 2022-11-30T17:55:29Z

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc?

For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model.

mrconter1 · 2022-11-30T18:27:37Z

Yeah but why doesn't OpenAI literally just spend one month on this with 10 guys and use this? It think this has some drawback but no one can tell me what it is... It's feel reasonable that all new papers from Google, OpenAI should use this. Den ons 30 nov. 2022 18:55henk717 ***@***.***> skrev:

…

I can't understand why this hasn't seen wider adoption. It makes me a bit skeptical. If it's better in all ways compared to the original transformer paper why wouldn't we see adoption from Meta, OpenAI, DeepMind etc? For our own project we have kind of basic support for it workarounded in with the original base, but the reason we don't finetune it or don't support it properly is because Huggingface support is missing and we are tightly integrated with huggingface. I assume other providers / projects have the same issue. For adoption I'd love to see RWKV land in huggingface so we can begin to offer it to our users the proper way, without them relying on manual steps, and without missing features for this model. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTWSJQDOOINSE5GVFUDWK6IJZANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

BlinkDL · 2022-11-30T20:45:58Z

Yeah but why doesn't OpenAI literally just spend one month on this with 10 guys and use this? It think this has some drawback but no one can tell me what it is... It's feel reasonable that all new papers from Google, OpenAI should use this.

There are a number of papers with similar "exponential moving average" design now.

For example, S4D is using slightly fancier kernels: https://github.com/HazyResearch/state-spaces (while I find simple kernels are enough).

RWKV is weaker at LAMBADA (comparing with GPT) when the model is small (< 3B), but I find adding one single tiny QKV attention is enough to solve it (helps a small model to copy words in prompt).

Moreover, it's reasonable to expect a competitive linear-time attention model, because when human novelists write very long stories the speed is consistent (except GRRM lol).

ArEnSc · 2022-12-01T16:01:01Z

I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPTx. For example percievers io, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPTx series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. They build the product and make it easy to replace the AI portion of it in 10 minutes. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need.

mrconter1 · 2022-12-01T16:45:14Z

I understand. But this is the only architecture that has infinite context length. Den tors 1 dec. 2022 17:01Michael Chung ***@***.***> skrev:

…

I don't think this project is well known, theres a huge eco system based of just what works right now i.e T5 and GPT*x. For example percievers, and percievers AR by deepmind seems to do something similar to get linear attention. To get this project to that level of popularity we have to build various production level proofs, most people already understand the challenges of T5 and GPT*x series. Second the models from a product perspective isn't as important, it's the data that is important. People are making the bets that its smarter to deploy a product with shitty AI and wait for the improvement before investing in the R and D. These factors make it difficult to get projects and indepdent researchers to get the spotlight they need. — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTICKMR7YJCRZTPKO3WLDDUTANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

jbmaxwell · 2022-12-01T17:18:44Z

"...this is the only architecture that has infinite context length."

Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach.

mrconter1 · 2022-12-01T17:24:26Z

"So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding."

https://www.reddit.com/r/MachineLearning/comments/umq908/_/

Den tors 1 dec. 2022 18:18jbm ***@***.***> skrev:

…

"...this is the only architecture that has infinite context length." Wait, really?... How did I miss that? I thought it was just a faster, more efficient approach? — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTLSQHYBSKLYA5BRX3WLDMYBANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

xloem · 2022-12-01T17:27:49Z

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

jbmaxwell · 2022-12-01T17:27:52Z

Right, okay. Well, that's pretty compelling, for sure...

ArEnSc · 2022-12-01T22:25:01Z

The context length is presently limited by the accuracy of the floating point representation, due to the heavily simplified and unified architecture. RWKV is a strong combination of speed and long-context.

I think its also limited by the memory as well

xloem · 2022-12-02T09:55:28Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

ArEnSc · 2022-12-02T16:01:20Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says

  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]

so it appears to have a limit based off memory @BlinkDL can you clearify ?

xloem · 2022-12-02T16:18:35Z

I should let Blink clarify, but regarding T_MAX: https://github.com/BlinkDL/RWKV-LM/blob/a268cd2e40351ee31c30c5f8a5d1266d35b41829/RWKV-v4neo/src/model.py#L34

henk717 · 2022-12-03T02:25:20Z

Since the model support for this stalled, perhaps someone on HF's side such as @younesbelkada can help get this model supported?

BlinkDL · 2022-12-03T05:25:49Z

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation.

So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says
  T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!]
so it appears to have a limit based off memory @BlinkDL can you clearify ?

I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here:

https://huggingface.co/BlinkDL/rwkv-4-pile-3b

With the correct training method, I estimate the effective ctx_len can at least be 100K.

mrconter1 · 2022-12-03T08:51:32Z

So it doesn't have "infinite" ctx_len. Den lör 3 dec. 2022 06:26PENG Bo ***@***.***> skrev:

…

There is no memory limit associated with context length that I am aware of with these models. State can be retained in a recurrent manner, providing for using only however much memory is available for accelerated parallel operation. So you are telling me, that the context is effectively encoded into the state. I am reffering to the context length of the model consumes. I guess what you are trying to say is that because we have a state, the model can look into that state for any context size? as a result it has an infinite context length? I looked into the code and it says T_MAX = 1024 # increase this if your ctx_len is long [NOTE: TAKES LOTS OF VRAM!] so it appears to have a limit based off memory @BlinkDL <https://github.com/BlinkDL> can you clearify ? I am not using the correct method to train it because I am lazy. But you can always finetune the model to support longer ctxlen. For example, fine-tuned to 4096 here: https://huggingface.co/BlinkDL/rwkv-4-pile-3b — Reply to this email directly, view it on GitHub <#17230 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHYLDTTHM2NCFZJFFG4JF63WLLKWTANCNFSM5V275BWA> . You are receiving this because you commented.Message ID: ***@***.***>

xloem · 2022-12-03T09:58:43Z

I suspect technically if you used a rational number representation rather than floating point it would have infinite context length.

Aside: I’m not an ML researcher, but I don’t know why downscaling like this doesn’t get more attention. It seems context length could be fully infinite by re-encoding past information for what is helpful for future states, and a network wired to discover its own architecture would quickly find this.

younesbelkada · 2022-12-05T19:46:23Z

Wow very cool @leondz !
Would be also very keen to have a look at the tutorial you made, we can also ultimately have them on the HF blog to announce the release of this architecture (ofc once we figure out everything about the integration & happy to help you on the post too), how does that sound?

leondz · 2022-12-05T19:51:32Z

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

BlinkDL · 2022-12-05T20:19:17Z

It's absolutely BlinkDL's project, so up to them and they get the headline credit, but that sounds lovely - I'm down :)

Can you share your slides? :)

Consider this a community project, and we can build an ecosystem on top of RWKV, like what happens to Stable Diffusion.

I will focus on improving the algorithm & model - now training RWKV-4a with one single tiny extra attention (just a few extra lines comparing with RWKV-4) to further improve some difficult zeroshot tasks (such as LAMBADA) for smaller models.

ArEnSc · 2022-12-05T20:20:08Z

Hey,

This integration went fine, until two snags wre hit:

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I would love to see these stable & independent in their own branch. There was no hope of getting RWKV2 to pass the HF model implementation requirements (esp. the model weights precisely matching!) without these being established, but maybe things are better now.

Re: uptake - this model kicks ass, imo the restrictions have only been the difficulty of re-using/reproducing the codebase while it was under development, and that the paper hadn't been written. The math all checks out (I even wrote some tutorial slides for teaching the model) and the implementations have been elegant, it's just engineering issues in the way. Once a reproducible training codebase & paper are out, it's 🚀 time!

-- also would be super cool to have integrated the fast RNN inference if that's still working, but again the implementation and interface was fluid last time I tried to integrate this, and you can't integrate a moving implementation.

Can I also get the slides perhaps a google docs link for them would be the quickest there are a few parts of this architecture that are still fuzzy to me

xloem · 2022-12-05T20:52:14Z

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

leondz · 2022-12-05T20:55:12Z

the code for reading input couldn't be reproduced

the code for training couldn't be reproduced

I wasn’t aware. It’s too bad we didn’t take these things farther; I was having the opposite issue. @ArEnSc , please let us know if there are any snags preventing opening a PR so somebody else can step in too.

It's important to say that this was due to the pace and mode of development, not the model's quality!

harrisonvanderbyl · 2022-12-05T21:31:04Z

Might not be fully helpful, but I have a repository with a bunch of different variations on inference

https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/model_run_onnx.py for example is a file where I have made the code compatible with onnx, tensorflow, and Iree inference converters (with only some minor tweaking)

ArEnSc · 2022-12-07T17:18:28Z

@ArthurZucker
Hey I am getting issues setting up the dev environment.
I am on python 3.8.10, updated to the latest pip3. I create a venv using 3.8.10 and then run this command
I am on OSX Monterey, M1 Pro.
Which version of python should I be developing on ?

 pip3 install -e ".[dev]"
ERROR: Could not find a version that satisfies the requirement tensorflow-text; extra == "dev" (from transformers[dev]) (from versions: none)
ERROR: No matching distribution found for tensorflow-text; extra == "dev

younesbelkada · 2022-12-08T10:55:26Z

Hi @ArEnSc
Indeed it's a bit tricky to install dev environment on a MAC M1.
Could you please replace your setup.py by this one: https://gist.github.com/younesbelkada/ce24f0b517db46502792c4b638d4f5b9 and run your command again

After that, you need to run pip3 install numpy --upgrade and everything should work fine

ArEnSc · 2022-12-10T15:53:54Z

@younesbelkada

(.env) michaelchung@michaels-mbp transformers % pip install -e ".[dev]"

Obtaining file:///Users/michaelchung/Code/transformers
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... done
  Preparing editable metadata (pyproject.toml) ... done
Collecting packaging>=20.0
  Using cached packaging-22.0-py3-none-any.whl (42 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.2.tar.gz (359 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting requests
  Using cached requests-2.28.1-py3-none-any.whl (62 kB)
Collecting numpy>=1.17
  Using cached numpy-1.23.5-cp38-cp38-macosx_11_0_arm64.whl (13.3 MB)
Collecting tqdm>=4.27
  Using cached tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp38-cp38-macosx_11_0_arm64.whl (287 kB)
Collecting filelock
  Using cached filelock-3.8.2-py3-none-any.whl (10 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Collecting pyyaml>=5.1
  Using cached PyYAML-6.0-cp38-cp38-macosx_12_0_arm64.whl
Collecting pytest-xdist
  Using cached pytest_xdist-3.1.0-py3-none-any.whl (36 kB)
Collecting rjieba
  Using cached rjieba-0.1.11-cp36-abi3-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (5.7 MB)
Collecting unidic>=1.0.2
  Using cached unidic-1.1.0.tar.gz (7.7 kB)
  Preparing metadata (setup.py) ... done
Collecting phonemizer
  Using cached phonemizer-3.2.1-py3-none-any.whl (90 kB)
Collecting jaxlib<=0.3.6,>=0.1.65
  Using cached jaxlib-0.3.5-cp38-none-macosx_11_0_arm64.whl (61.3 MB)
Collecting codecarbon==1.2.0
  Using cached codecarbon-1.2.0-py3-none-any.whl (135 kB)
Collecting pyctcdecode>=0.4.0
  Using cached pyctcdecode-0.4.0-py2.py3-none-any.whl (45 kB)
Collecting flake8>=3.8.3
  Using cached flake8-6.0.0-py2.py3-none-any.whl (57 kB)
Collecting sacremoses
  Using cached sacremoses-0.0.53.tar.gz (880 kB)
  Preparing metadata (setup.py) ... done
Collecting tensorflow-metal
  Using cached tensorflow_metal-0.7.0-cp38-cp38-macosx_12_0_arm64.whl (1.4 MB)
Collecting GitPython<3.1.19
  Using cached GitPython-3.1.18-py3-none-any.whl (170 kB)
Collecting datasets!=2.5.0
  Using cached datasets-2.7.1-py3-none-any.whl (451 kB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.0-cp38-cp38-macosx_12_0_arm64.whl (8.2 MB)
Collecting sudachidict-core>=20220729
  Using cached SudachiDict-core-20221021.tar.gz (9.0 kB)
  Preparing metadata (setup.py) ... done
Collecting sacrebleu<2.0.0,>=1.4.12
  Using cached sacrebleu-1.5.1-py3-none-any.whl (54 kB)
Collecting Pillow
  Using cached Pillow-9.3.0-cp38-cp38-macosx_11_0_arm64.whl (2.9 MB)
Collecting tf2onnx
  Using cached tf2onnx-1.13.0-py3-none-any.whl (442 kB)
Collecting sentencepiece!=0.1.92,>=0.1.91
  Using cached sentencepiece-0.1.97-cp38-cp38-macosx_11_0_arm64.whl (1.1 MB)
Collecting evaluate>=0.2.0
  Using cached evaluate-0.3.0-py3-none-any.whl (72 kB)
Collecting fugashi>=1.0
  Using cached fugashi-1.2.1.tar.gz (337 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [8 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/setup.py", line 15, in <module>
          output, data_files = check_libmecab()
        File "/private/var/folders/jn/8d33s3c55jv5pctdc6wdnm2h0000gn/T/pip-install-xf18599w/fugashi_18a210c9f68f4c1fb6ece4f85f9f7479/fugashi_util.py", line 58, in check_libmecab
          raise RuntimeError("Could not configure working env. Have you installed MeCab?")
      RuntimeError: Could not configure working env. Have you installed MeCab?
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
(.env) michaelchung@michaels-mbp transformers %

closer! but still problems

younesbelkada · 2022-12-10T22:20:39Z

I think here you need to install Mecab through brew - can you try to run:

brew install mecab
brew install mecab-ipadic

and re-run pip install -e "[dev]" again?

ArthurZucker · 2022-12-12T08:38:12Z

I had the same issue when installing, you should make sure to install fugashi==1.1.2a6 ( ignore the mecab part).
You can also follow the short guide from #18355

xloem · 2022-12-12T11:15:24Z

Is a full dev environment needed to start with? Personally it would be quite inspiring to see a PR even if it didn't pass tests.

younesbelkada · 2022-12-12T11:19:58Z

@ArEnSc did you managed to open a PR? I think it's ok to leave it as a draft even if the test does not even pass (i.e. eventually no need to install the dev env, at least for the beginning, we can in the worst case take over the PR if any). Let us know what do you think!

ArEnSc · 2022-12-12T15:56:16Z

Yeah hey sorry guys! probably sometime this week or today, my day job is iOS Development, it isn't in MLE. I just a moon light side job in NLP and Speech Synthesis in the media creation domain. Looking to transition eventually, hopefully this PR will proof of my capabilities so I won't abandon it =)

ArEnSc · 2022-12-12T17:10:28Z

#20737 here is the draft, probably generating all the scaffolding soon

xloem · 2023-01-05T11:00:28Z

There is recent active work for interfacing multiple backends to rwkv at https://github.com/harrisonvanderbyl/rwkv_chatbot/blob/main/src/rwkvops.py#L914 (list down at end of file)
EDIT: dev discussion happens in the rwkv discord, where unfortunately I am not active

ArEnSc · 2023-01-05T14:35:15Z

yeah we will be looking into that as soon as I figure out how the architecture works from a high level I might have some questions but Iam tracing the model now

oobabooga · 2023-03-05T13:43:58Z

I have made a very simple and dumb wrapper for RWKV including RWKVModel.from_pretrained and RWKVModel.generate functions that could maybe serve as inspiration: RWKV.py

This depends on the rwkv library: pip install rwkv==0.0.6

I'd like to tag @zphang. He recently implemented LLaMA support in transformers. Maybe adding RWKV would interest him as well.

fblgit · 2023-04-10T18:18:20Z

this is by far some of the best models right now, the performance of 7B is outstanding.
How come the best model is not supported by HF ?

mrseeker · 2023-04-10T19:40:52Z

Because nobody tried implementing it?

fblgit · 2023-04-11T06:05:21Z

We want to have a positive impact on the AI field. We think the direction of more responsible AI is through openly sharing models, datasets, training procedures, evaluation metrics and working together to solve issues. We believe open source and open science bring trust, robustness, reproducibility, and continuous innovation. With this in mind, we are leading [BigScience](https://bigscience.huggingface.co/), a collaborative workshop around the study and creation of very large language models gathering more than 1,000 researchers of all backgrounds and disciplines.

Thats HF mission, so I was wondering how come HF has missed the best model in the industry. Making me think about bias behind what this "Open" platform says vs what they do.

And because of that, i was wondering how come HF teams are not giving a hand to port this in.
I saw LlaMA integration going in at flash speed with HF coverage.. and why this hasnt ??

flozi00 · 2023-04-11T07:46:18Z

There is already an open PR by @ArEnSc

mrseeker · 2023-04-11T08:00:23Z

Two things:
If there are open PR, mention their number so we can keep track of what is stale, duplicate etc.

Llama was so fast because people actively wanted to use it. Meta releases something, HF jumps in line and puts a PR together to support it. Since RWKV is not that big, no support. I am waiting eagerly for support...

younesbelkada · 2023-04-11T08:18:46Z

Hi there,
I am also super excited about this model, I think that PR will go on stale as there has been no activity since a while. If someone wants to take the lead on it, I would be happy to assist with @ArthurZucker !

fblgit · 2023-04-11T09:07:25Z

Well, I won't go into politics wether big or not big company should get community support or not.. having in mind their resources and manpower.

Projects like this, which are highly relevant, gets unsupported. Its the trending of github.. what else are we looking for ?

#17230
#20809
#21875
#20737

leondz added the New model label May 13, 2022

ArEnSc mentioned this issue Dec 12, 2022

RWKV4neo #20737

Closed

4 tasks

oobabooga mentioned this issue Mar 5, 2023

Making this library more like Hugging Face BlinkDL/ChatRWKV#25

Closed

younesbelkada mentioned this issue May 3, 2023

Add RWKV-4 #22797

Merged

7 tasks

sgugger closed this as completed in #22797 May 9, 2023

Add RWKV2 (fast) #17230

Add RWKV2 (fast) #17230

Comments

leondz commented May 13, 2022

Model description

Short description

Open source status

Provide useful links for the implementation

Implementation and weights

Status

leondz commented May 16, 2022

mrseeker commented May 16, 2022

leondz commented May 20, 2022 • edited Loading

xloem commented Jun 26, 2022 • edited Loading

jbmaxwell commented Sep 9, 2022

ArEnSc commented Oct 4, 2022

xloem commented Oct 5, 2022

mrconter1 commented Nov 21, 2022

xloem commented Nov 21, 2022

BlinkDL commented Nov 22, 2022 • edited Loading

ArEnSc commented Nov 22, 2022

leondz commented Nov 22, 2022

henk717 commented Nov 30, 2022

mrconter1 commented Nov 30, 2022 via email

BlinkDL commented Nov 30, 2022 • edited Loading

ArEnSc commented Dec 1, 2022 • edited Loading

mrconter1 commented Dec 1, 2022 via email

jbmaxwell commented Dec 1, 2022 • edited Loading

mrconter1 commented Dec 1, 2022 via email

xloem commented Dec 1, 2022

jbmaxwell commented Dec 1, 2022

ArEnSc commented Dec 1, 2022

xloem commented Dec 2, 2022

ArEnSc commented Dec 2, 2022

xloem commented Dec 2, 2022

henk717 commented Dec 3, 2022

BlinkDL commented Dec 3, 2022 • edited Loading

mrconter1 commented Dec 3, 2022 via email

xloem commented Dec 3, 2022

younesbelkada commented Dec 5, 2022 • edited Loading

leondz commented Dec 5, 2022

BlinkDL commented Dec 5, 2022 • edited Loading

ArEnSc commented Dec 5, 2022

xloem commented Dec 5, 2022

leondz commented Dec 5, 2022

harrisonvanderbyl commented Dec 5, 2022

ArEnSc commented Dec 7, 2022

younesbelkada commented Dec 8, 2022 • edited Loading

ArEnSc commented Dec 10, 2022

younesbelkada commented Dec 10, 2022

ArthurZucker commented Dec 12, 2022 • edited Loading

xloem commented Dec 12, 2022

younesbelkada commented Dec 12, 2022

ArEnSc commented Dec 12, 2022

ArEnSc commented Dec 12, 2022 • edited Loading

xloem commented Jan 5, 2023 • edited Loading

ArEnSc commented Jan 5, 2023

oobabooga commented Mar 5, 2023

fblgit commented Apr 10, 2023

mrseeker commented Apr 10, 2023

fblgit commented Apr 11, 2023

flozi00 commented Apr 11, 2023

mrseeker commented Apr 11, 2023

younesbelkada commented Apr 11, 2023

fblgit commented Apr 11, 2023 • edited Loading

leondz commented May 20, 2022 •

edited

Loading

xloem commented Jun 26, 2022 •

edited

Loading

BlinkDL commented Nov 22, 2022 •

edited

Loading

BlinkDL commented Nov 30, 2022 •

edited

Loading

ArEnSc commented Dec 1, 2022 •

edited

Loading

jbmaxwell commented Dec 1, 2022 •

edited

Loading

BlinkDL commented Dec 3, 2022 •

edited

Loading

younesbelkada commented Dec 5, 2022 •

edited

Loading

BlinkDL commented Dec 5, 2022 •

edited

Loading

younesbelkada commented Dec 8, 2022 •

edited

Loading

ArthurZucker commented Dec 12, 2022 •

edited

Loading

ArEnSc commented Dec 12, 2022 •

edited

Loading

xloem commented Jan 5, 2023 •

edited

Loading

fblgit commented Apr 11, 2023 •

edited

Loading