Skip to content
iMountTai edited this page Nov 14, 2023 · 13 revisions

FAQ

Question 1: What's the difference between this project and the first-gen project?

Answer: The first generation project, Chinese-LLaMA-Alpaca, was developed based on the "not fully open-source" first-generation LLaMA, and the related model cannot be used for commercial purposes and there are many restrictions on distribution. This project is an upgrade based on the relevant technology of the first-gen project, developed based on the Llama-2 (commercial and distributable), and the model performance has a significant advantage compared to its first generation.

Question 2: Can the model be used commercially?

Answer: Yes, but please carefully read the commercial licensing requirements of the original Llama-2 in advance. Developers should be responsible for the compliance of using related models and seek legal support when necessary. This project is not responsible for the results and associated losses caused by using related models.

Question 3: Do you accept third-party Pull Requests?

Answer: We welcome Pull Requests. This project mainly accepts PRs in areas such as new tool adaptation, script enhancement, bug fixes, usage tutorials, etc. We temporarily do not accept PRs that do not affect the normal use of the model, such as only correcting one or two typos (we still appreciate your comments).

Question 4: Why not perform full pre-training but use LoRA instead?

Answer: Considering factors such as training cost and efficiency, we chose to use LoRA for training based on Llama-2 (embedding/lm_head are fully trained. See training wiki). We believe that Llama-2 already has a certain understanding of Chinese, and incremental training through LoRA can quickly supplement Chinese understanding and generation capabilities. As for whether full pre-training on Llama-2 will be better than LoRA, there is currently no reference conclusion available. Therefore, the use of LoRA in this project is the result of weighing various factors, not just considering the performance of the model.

Question 5: Does Llama-2 series support tools that support the first-gen LLaMA?

Answer: The tools we adapted in the first-gen project will gradually migrate to this project, but this process will take some time. At the same time, we strongly recommend keeping an eye on the adaptation progress of the corresponding third-party tools for Llama-2. The main differences between the first and second-generation models are: 1) The instruction template of our Alpaca-2 is different from the first generation; 2) The 34B/70B models need to support GQA (this project does not involve these two model types now); There are also some minor differences. In summary, those with strong hands-on ability can adapt themselves, or refer to the adaptation of third-party tools to Llama-2.

Question 6: Is Chinese-Alpaca-2 trained from Llama-2-Chat?

Answer: No. All of our models are based on Meta's Llama-2 (non-chat version). Specifically, Chinese-LLaMA-2 is trained on Llama-2 with large-scale Chinese text. Chinese-Alpaca-2 is further trained on Chinese-LLaMA-2 with selected instruction data to make it capable of chatting and following instructions. All models are trained using LoRA, which is a type of efficient training method.

Question 7: Why does training with 24GB VRAM lead to an OOM error when fine-tuning chinese-alpaca-2-7b?

Answer: You can troubleshoot the issue from the perspectives of dependent versions, data, and trainable parameters:

  • The required peft version is 0.3.0.dev0, which can be installed using pip install git+https://github.com/huggingface/peft.git@13e53fc.
  • For the sft script, if max_seq_length is set to 1024, or for the pt script, if block_size is set to 1024, you can delete the existing data cache and modify the length to 512 to continue training.
  • per_device_train_batch_size is set to 1; setting it higher will result in an OOM error.

Question 8: Can the 16K long-context version model replace the standard version model?

Answer: This depends on the usage scenario. If you're mainly dealing with contexts within 4K, it's recommended to use the standard version model. If you're primarily dealing with 8K-16K contexts, then the 16K long-context version model is recommended. Both of the above models can be further extended in context length using the NTK method (without additional training). It should be noted that using the 16K model requires inference scripts or third-party tools that support the customized RoPE feature, rather than using it directly like the standard model. It is advisable to carefully read the project's Wiki to ensure the 16K model is loaded and used correctly.

Question 9: How to interprete the results of third-party benchmarks?

Answer: We noticed that there are many benchmarks for LLM evaluation, which provides different views for this purpose. As a kind remider, until now, we only officially submitted to C-Eval benchmark (results without asterisks). We did not submit our model to other benchmark, so we can not guarantee the correctness of these results. If you need reproducible results (especially for academic purposes), you are kindly advised to refer to the results in our project's README (and technical report). We provide reproducible inference scripts for these numbers to ensure that our models are performed as expected in a correct way.

Question 10: Will you release 34B or 70B models?

Answer: Meta did not release 34B model yet, and thus we will make decisions after that. In the meantime, please consider using our Chinese-LLaMA-Plus-33B and Chinese-Alpaca-Pro-33B models from our first generation project. Considering training cost and effectiveness, we are not planning to work on 70B model at the moment.

Question 11: Why the long-context model is 16K context, not 32K or 100K?

Answer: The long-context models are really necessary for daily usage. Considering OpenAI's gpt-3.5-turbo-16k and gpt-4-32k API, we think that 16K-32K might be an appropriate range for long-context usage, and thus we release 16K model series in this spirit. Note that, if you need longer context support, you can always refer to our script to extend context size via NTK method (24K-32K) without further tuning.

Question 12: Why does the Alpaca model reply that it is ChatGPT?

Answer: We did not add identity information during the training of Alpaca models. In this context, the output largely depends on the content of training data. As we have used large-scale instruction data crawled from ChatGPT, the model tends to behave as if it is ChatGPT. You can construct identity data yourself and perform further fine-tuning on our models to customize your expected response.

Question 13: Why is the adapter_model.bin in the pt_lora_model or sft_lora_model folder only a few hundred kb?

Answer: The reason for this is the use of DeepSpeed ZeRO-3 training. Suppose the directory of the intermediate model in step 50 is as follows:

|-- checkpoint-50
    |-- global_step50
    |-- sft_lora_model
    |   |-- adapter_config.json 471
    |   |-- adapter_model.bin 127K 
    |   |-- special_tokens_map.json 435 
    |   |-- tokenizer.model 825K
    |   |-- tokenizer_config.json 766 
    |-- adapter_config.json 471
    |-- adapter_model.bin 1.2G
    |-- latest
    |-- rng_state_0.pth
    ···

You can use adapter_model.bin in the checkpoint-50 folder for merging.

Clone this wiki locally