-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper #390
Comments
|
|
If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks! |
Thank you very much for your attention to EasyEdit. We will address your issue shortly. |
Apologies for the issue regarding the results. EasyEdit has been under maintenance to improve its editing performance, and we have optimized the EasyEdit code, which has led to better results (possibly an improvement for AdaLoRA). Due to limited computational resources recently, we will update the paper as soon as possible (aiming for 1-2 weeks, and we will notify you). As for the other issues, we will address them as soon as possible. |
You can check the code here:
|
Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation. |
Clarification of MisunderstandingThere has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:
Why Rewrite Accuracy Can Be Greater Than 0 Before EditingFor example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0. |
Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. |
Hello, As to the error you met in ROME, I will try to reproduce it and will let you know. Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12). |
I got it, by the default value of |
Thanks for your clarification! I previously mistakenly thought that |
Thanks for your timely reply! So, is my understanding of the method of averaging metrics and the bug reporting in the |
Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated |
Thanks for your enthusiasm. If further discussion is needed, I will contact you via WeChat. |
I just find a little bug in the |
You're right, I have updated the code here. |
Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods. |
Hello! I'm pleased to inform you that I have retested some of these settings (AdaLoRA, ROME and FT-L on ZSRE) and achieved metrics quite similar to your updated results. Again, looking forward to your next update and greatly appreciating your ongoing efforts. |
Thanks for your explanation, I'll check the relevant codes and papers for more details. I noticed that though the Edit Succ. of AdaLoRA increases (from 69.86 to 100.0 on ZSRE) after this update, the locality of it also decreases a lot (from 72.21 to 35.16 on ZSRE). Is this phenomenon explainable or predictable? |
From my point of view, the original loss failed to learn the updated knowledge and left the model unchanged that much. But it's truly an interesting phenomenon as the FT-L and FT-M do not show this trend. Maybe we need more tests here. |
Hello, |
Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:
As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric. |
Certainly, this is an interesting phenomenon worthy of study. |
While reviewing the code, I came across a confusing section in the |
Quite weird, I will double-check my results to see if I made a mistake when I typed the results in the table. |
The collate function is not used, I will delete it. I construct the input here |
Is it ok for you to share the results.json? Maybe I can check where the discrepancy comes from? |
Yes, here are the AdaLoRA ROME's results.json files on Wikidata_counterfact |
|
In fact, I am just a beginner in the field of knowledge Editing. I happened to be reading the ROME paper today, and I have just reached the relevant section of ROME ("Here W is the original matrix, C = KK^T is a constant that we pre-cache by estimating the uncentered covariance of k from a sample of Wikipedia text"). :) |
Well, I can reproduce your results using the provided environments! |
Can we determine which package has the greatest impact? I guess the transformers and peft package may be the reason. A control variate method combined with "binary search" may help us find these package(s). |
Yeah, the peft would influence the performance of LoRA. I change from 0.7.0 to 0.12.0 and can get your results. |
I can also reproduce the Adalora result on wikidata_counterfact with a new environment following your requirement.txt, but the result of ROME is also different, and I'll check the latest reason you just mentioned. Anyhow, wish you a good sleep |
To help you better find potential bugs, I think these .py code files (I may add some comments to help myself read and check the codes, but as far as I remember, I do not change the essence) should be helpful. |
I neglected to rerun the experiment last night. Today, I plan to assess whether setting use_fast=True yields a different outcome. Upon obtaining the new results, I will promptly reach out to you. UPDATE: Due to limited computation resources, I still need more time to check ROME. |
I can get your ROME results when I make sure the tokenizer can correctly build the prompt in |
That is inspiring news, and may I have a short summary of the discussion here:
we can finally get the same results on FT-L & AdaLoRA & ROME as I reported. |
Yes |
Hello, I think the BUG in ROME has nothing to do with the fast tokenizer, and here is my experiment: Firstly, I changed these codes in
to
After I run with As you can see:
So I think the problem should be in the string (input) level but not the model/tokenizer level. Besides, there are also some discussions about the llama tokenizer on token id '29871'. |
It's not the reason, I will check the tokenizer config to see what happened, maybe something wrong with the tokenizer setting. |
It turns out that the reason is from |
Good, It seems that we have resolved all the issues. |
Well, we didn't conduct experiments on these two methods under knowedit. But from my view, this would not affect the performance as they would not use the peft module and the tokenizer also does not involve decoding in their methods. |
Well, I plan to read through the codes as well as papers of these used methods recently, and then I'll try to integrate MEND and SERAC in the current framework. |
Great, we will update the new results in arxiv this week. |
Dear StarLooo, Thank you very much for raising this issue. After a long and thorough debugging process, the problem has finally been identified, and we will work on an update as soon as possible. EasyEdit will continue to optimize, improve, and add new features. We look forward to more collaboration and communication with you. EasyEdit Team |
Update some results using your provided environment. You can check again, I think a little difference is acceptable. WikiData_recent
ZsRE
WikiData_counterfact
WikiBio
|
Within the allowable randomness range, these AdaLoRA/ROME/FT-L/FT-M results are consistent with what I previously reproduced on Wikidata_recent&Wikidata_counterfact&ZSRE. I'll also try other models and datasets. |
I provisionally close this issue and you can reopen if you meet other problems. |
Dear StarLooo, We have fixed the bug and will update the paper on arXiv tomorrow (the README has been updated). We have written a pinned issue statement explaining the cause of this issue and included an announcement in the News. Thank you very much for your help! The following is a statement. Dear all: Recently, with the help from the community (special thanks to @StarLooo), we will update the KnowEdit results (Llama2-7b-chat) in Table 4 of the paper ‘A Comprehensive Study of Knowledge Editing for Large Language Models’. Overall, the results have improved, primarily due to the following reasons: 1. AdaLoRA Optimization: we follow the FT-M instead of the FT-L, which trains the same FFN layer as FT-L using cross-entropy loss on the target answer while masking the original text. This approach not only yields better results but also highlights the optimal performance of AdaLoRA. Meanwhile, the peft version would also affect the performance. 2. ROME and MEMIT Updates: The results are updated after identifying missing components in the local version of the Llama2-7b-chat files (specifically, the legacy feature in the tokenizer_config.json). If you are using the official Llama2-7b-chat model downloaded directly from HF, this issue should not affect your results. Also, we fix a bug related to the padding_size for these two methods, which will influence the performance when you compute the results for batch inputs. We deeply apologize for any inconvenience caused by this bug. We will continue improving EasyEdit, updating this paper, and welcome everyone to engage in discussions and share ideas. EasyEdit Team |
Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:
I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific
generation_config
settings (includingdo_sample
,temperature
,top_p
, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.The text was updated successfully, but these errors were encountered: