Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper #390

Closed
StarLooo opened this issue Oct 17, 2024 · 89 comments
Closed
Labels
bug Something isn't working

Comments

@StarLooo
Copy link

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

  1. concern about the generation config
    I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.
@StarLooo
Copy link
Author

  1. perhaps bug in the summary_metrics function
    According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on Wikidata_recent for example) :
    for the Edit Succ.: compute the average rewrite_acc of all edit samples.
    for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples.
    for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples.
    for the Fluency: compute the average ngram_entropy of all edit samples.
    If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary.
    Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

@StarLooo
Copy link
Author

  1. the meaning of pre-edit metrics
    I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

@StarLooo
Copy link
Author

  1. can not reproduce the KnowEdit results when using AdaLoRA or ROME
    I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains either inf, nan or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in RuntimeError: probability tensor contains either inf, nan or element < 0 #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

@XeeKee
Copy link
Collaborator

XeeKee commented Oct 17, 2024

Thank you very much for your attention to EasyEdit. We will address your issue shortly.

@zxlzr
Copy link
Contributor

zxlzr commented Oct 17, 2024

  1. can not reproduce the KnowEdit results when using AdaLoRA or ROME
    I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains either inf, nan or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in RuntimeError: probability tensor contains either inf, nan or element < 0 #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

Apologies for the issue regarding the results. EasyEdit has been under maintenance to improve its editing performance, and we have optimized the EasyEdit code, which has led to better results (possibly an improvement for AdaLoRA). Due to limited computational resources recently, we will update the paper as soon as possible (aiming for 1-2 weeks, and we will notify you). As for the other issues, we will address them as soon as possible.

@zxlzr zxlzr added the question Further information is requested label Oct 17, 2024
@XeeKee
Copy link
Collaborator

XeeKee commented Oct 17, 2024

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

  1. concern about the generation config
    I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.

You can check the code here:

def test_prediction_acc(model, tok, hparams, prompts, targets, device, locality=False, vanilla_generation=False):
. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

@XeeKee
Copy link
Collaborator

XeeKee commented Oct 17, 2024

  1. perhaps bug in the summary_metrics function
    According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on Wikidata_recent for example) :
    for the Edit Succ.: compute the average rewrite_acc of all edit samples.
    for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples.
    for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples.
    for the Fluency: compute the average ngram_entropy of all edit samples.
    If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary.
    Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

@XeeKee
Copy link
Collaborator

XeeKee commented Oct 17, 2024

  1. the meaning of pre-edit metrics
    I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

  1. Fluency
    This metric evaluates whether the smoothness of the output has changed after the model performs an edit.

  2. Portability
    This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.

  3. Locality
    This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.

  4. Rewrite Accuracy (rewrite_acc)
    This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

@XeeKee
Copy link
Collaborator

XeeKee commented Oct 17, 2024

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise.
My WeChat ID: xi1786594371

@littlefive5
Copy link
Collaborator

littlefive5 commented Oct 17, 2024

Hello,
I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework.
Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements.
The results you currently get is right.
I am in the process of running additional experiments to gather the correct data.
Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

As to the error you met in ROME, I will try to reproduce it and will let you know.

Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

@StarLooo
Copy link
Author

You can check the code here:

def test_prediction_acc(model, tok, hparams, prompts, targets, device, locality=False, vanilla_generation=False):

. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

I got it, by the default value of do_samples=false, the EasyEdit uses greedy decoding by default setting, but as you mentioned, if the user's local generation_config.json file contains some specific generation hyper-params, it may cause warnings or unexpected generation strategy. Thanks for your timely reply!

@StarLooo
Copy link
Author

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

  1. Fluency
    This metric evaluates whether the smoothness of the output has changed after the model performs an edit.
  2. Portability
    This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.
  3. Locality
    This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.
  4. Rewrite Accuracy (rewrite_acc)
    This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

Thanks for your clarification! I previously mistakenly thought that rewrite_acc is a metric based on the exact match. By the way, why the token level accuracy was chosen instead of the exact match when designing the Edit Succ. metric? Was it for a smoother measurement?

@StarLooo
Copy link
Author

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

Thanks for your timely reply! So, is my understanding of the method of averaging metrics and the bug reporting in the summary_metrics function, along with my attempted modifications, correct?

@StarLooo
Copy link
Author

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier.
Besides, can you further explain or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

@StarLooo
Copy link
Author

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371

Thanks for your enthusiasm. If further discussion is needed, I will contact you via WeChat.

@StarLooo
Copy link
Author

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan in Portability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

@littlefive5
Copy link
Collaborator

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan in Portability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

You're right, I have updated the code here.

@littlefive5
Copy link
Collaborator

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier. Also, can you explain more or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

@StarLooo
Copy link
Author

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

Hello! I'm pleased to inform you that I have retested some of these settings (AdaLoRA, ROME and FT-L on ZSRE) and achieved metrics quite similar to your updated results. Again, looking forward to your next update and greatly appreciating your ongoing efforts.

@StarLooo
Copy link
Author

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

Thanks for your explanation, I'll check the relevant codes and papers for more details. I noticed that though the Edit Succ. of AdaLoRA increases (from 69.86 to 100.0 on ZSRE) after this update, the locality of it also decreases a lot (from 72.21 to 35.16 on ZSRE). Is this phenomenon explainable or predictable?

@littlefive5
Copy link
Collaborator

littlefive5 commented Oct 20, 2024

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

Thanks for your explanation, I'll check the relevant codes and papers for more details. I noticed that though the Edit Succ. of AdaLoRA increases (from 69.86 to 100.0 on ZSRE) after this update, the locality of it also decreases a lot (from 72.21 to 35.16 on ZSRE). Is this phenomenon explainable or predictable?

From my point of view, the original loss failed to learn the updated knowledge and left the model unchanged that much. But it's truly an interesting phenomenon as the FT-L and FT-M do not show this trend. Maybe we need more tests here.

@littlefive5
Copy link
Collaborator

littlefive5 commented Oct 20, 2024

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

DataSet Metric AdaLoRA ROME MEMIT FT-L
WikiDatarecent Edit Succ.

100.00 97.40 97.05 55.74
Portability

65.51 40.21 41.49 40.78
Locality

53.71 37.33 35.57 43.70
Fluency

577.26 578.72 573.72 524.83
ZsRE Edit Succ.

100.00 97.22 95.33 53.93
Portability

58.03 51.99 52.70 45.64
Locality

35.16 65.66 63.49 38.07
Fluency

548.08 571.60 563.31 492.77
WikiBio Edit Succ.

100.00 96.05 95.07 66.33
Locality

38.49 67.99 68.77 40.27
Fluency

617.70 617.33 617.22 606.58
WikiDatacounterfact Edit Succ.

100 98.63 98.02 45.15
Portability

73.04 42.81 45.72 33.57
Locality

64.36 28.08 25.32 50.50
Fluency

580.29 582.46 575.96 527.58
As to the error you met in ROME, I will try to reproduce it and will let you know.

Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email (yyztodd@zju.edu.cn) or WeChat (yaoxiaoyun12).

Hello,
I have updated these results and the arxiv paper will be updated this week.
If you meet other problems or bugs, leave a message. Thanks again for your help!

@StarLooo
Copy link
Author

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset Metric AdaLoRA ROME FT-L FT-M
ZSRE Edit Succ. 100.0 96.61 53.93 99.99
Portability 56.52 52.65 45.67 60.31
Locality 38.05 65.26 38.05 45.29
Fluency 563.57 573.44 493.01 552.26
Wikidatacounterfact Edit Succ. 100.0 98.58 45.15 100.0
Portability 69.89 43.72 33.60 74.37
Locality 34.02 67.62 25.67 36.59
Fluency 584.68 584.05 528.28 575.55

As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

@StarLooo
Copy link
Author

From my point of view, the original loss failed to learn the updated knowledge and left the model unchanged that much. But it's truly an interesting phenomenon as the FT-L and FT-M do not show this trend. Maybe we need more tests here.

Certainly, this is an interesting phenomenon worthy of study.

@StarLooo
Copy link
Author

While reviewing the code, I came across a confusing section in the collate_fn method of the KnowEditDataset class within KnowEdit.py. Specifically, In this line of code: it seems that the relation_specificity and forgetfulness are mutually exclusive when constructing the batch. Although to my understanding, the collate_fn method has no effect when we do a normal single edit just using the default setting, and perhaps also has no effect on the sequential edit setting, but it can indeed influence the batch edit setting. I don't understand the usage of this line of code.

@littlefive5
Copy link
Collaborator

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset Metric AdaLoRA ROME FT-L FT-M
ZSRE Edit Succ. 100.0 96.61 53.93 99.99
Portability 56.52 52.65 45.67 60.31
Locality 38.05 65.26 38.05 45.29
Fluency 563.57 573.44 493.01 552.26
Wikidatacounterfact Edit Succ. 100.0 98.58 45.15 100.0
Portability 69.89 43.72 33.60 74.37
Locality 34.02 67.62 25.67 36.59
Fluency 584.68 584.05 528.28 575.55
As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

Quite weird, I will double-check my results to see if I made a mistake when I typed the results in the table.

@littlefive5
Copy link
Collaborator

While reviewing the code, I came across a confusing section in the collate_fn method of the KnowEditDataset class within KnowEdit.py. Specifically, In this line of code: it seems that the relation_specificity and forgetfulness are mutually exclusive when constructing the batch. Although to my understanding, the collate_fn method has no effect when we do a normal single edit just using the default setting, and perhaps also has no effect on the sequential edit setting, but it can indeed influence the batch edit setting. I don't understand the usage of this line of code.

The collate function is not used, I will delete it. I construct the input here

@littlefive5
Copy link
Collaborator

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset Metric AdaLoRA ROME FT-L FT-M
ZSRE Edit Succ. 100.0 96.61 53.93 99.99
Portability 56.52 52.65 45.67 60.31
Locality 38.05 65.26 38.05 45.29
Fluency 563.57 573.44 493.01 552.26
Wikidatacounterfact Edit Succ. 100.0 98.58 45.15 100.0
Portability 69.89 43.72 33.60 74.37
Locality 34.02 67.62 25.67 36.59
Fluency 584.68 584.05 528.28 575.55
As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

Is it ok for you to share the results.json? Maybe I can check where the discrepancy comes from?

@StarLooo
Copy link
Author

Is it ok for you to share the results.json? Maybe I can check where the discrepancy comes from?

Yes, here are the AdaLoRA ROME's results.json files on Wikidata_counterfact
LoRA_counterfact_7b-chat_results.json
ROME_counterfact_7b-chat_results.json

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

Yes, I remember before I update the enironment, the ROME's result is also good, but it seems that after I update the environment, the results is not that good. Can you provide me your current environment, including the torch and python verison.

Also, how do you compute the covirance of wikipedia?

  1. Here is my current environment derived by pip freeze
    myrequirements.txt

  2. Sorry, but I do not understand the meaning of "the covirance (covariance) of wikipedia". What's the usage of it and which part of codes are related to it?

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

Yes, I remember before I update the enironment, the ROME's result is also good, but it seems that after I update the environment, the results is not that good. Can you provide me your current environment, including the torch and python verison.
Also, how do you compute the covirance of wikipedia?

  1. Here is my current environment derived by pip freeze
    myrequirements.txt
  2. Sorry, but I do not understand the meaning of "the covirance (covariance) of wikipedia". What's the usage of it and which part of codes are related to it?

In fact, I am just a beginner in the field of knowledge Editing. I happened to be reading the ROME paper today, and I have just reached the relevant section of ROME ("Here W is the original matrix, C = KK^T is a constant that we pre-cache by estimating the uncentered covariance of k from a sample of Wikipedia text"). :)

@littlefive5
Copy link
Collaborator

Well, I can reproduce your results using the provided environments!
So the reason here is the environment.

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

Well, I can reproduce your results using the provided environments! So the reason here is the environment.

Can we determine which package has the greatest impact? I guess the transformers and peft package may be the reason. A control variate method combined with "binary search" may help us find these package(s).

@littlefive5
Copy link
Collaborator

Yeah, the peft would influence the performance of LoRA. I change from 0.7.0 to 0.12.0 and can get your results.
For ROME, I will check again.

@littlefive5
Copy link
Collaborator

littlefive5 commented Nov 4, 2024

I finally found the reason for the difference in ROME and MEMIT.
I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is

self.tok = AutoTokenizer.from_pretrained(self.model_name)

while in the current code, it is

self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)

Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing:
image
in compute_v in ROME.

While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as:
image

Hence, when it is set as True,
the computing is correct:
image

So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.

Thanks for your help, I really need to sleep now🤣.

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

Well, I can reproduce your results using the provided environments! So the reason here is the environment.

I can also reproduce the Adalora result on wikidata_counterfact with a new environment following your requirement.txt, but the result of ROME is also different, and I'll check the latest reason you just mentioned. Anyhow, wish you a good sleep

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

I finally found the reason for the difference in ROME and MEMIT. I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is

self.tok = AutoTokenizer.from_pretrained(self.model_name)

while in the current code, it is

self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)

Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing: image in compute_v in ROME.

While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as: image

Hence, when it is set as True, the computing is correct: image

So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.

Thanks for your help, I really need to sleep now🤣.

Perhaps bad information is that in my codes, the initialization of llama tokenizer is also set with use_fast=False in editor.py:

            elif 'llama' in self.model_name.lower():
                self.model = AutoModelForCausalLM.from_pretrained(self.model_name, **model_kwargs)
                self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
                self.tok.pad_token_id = self.tok.eos_token_id

So, I do not know if it is the only reason for these differences, and I'll try ROME by setting use_fast=True this night.
Another question is that when I try to rerun using use_fast=True or change the package version, should I remove the original pre-edit files?

@StarLooo
Copy link
Author

StarLooo commented Nov 4, 2024

To help you better find potential bugs, I think these .py code files (I may add some comments to help myself read and check the codes, but as far as I remember, I do not change the essence) should be helpful.
some_of_my_py_files.zip

@littlefive5
Copy link
Collaborator

littlefive5 commented Nov 4, 2024

I finally found the reason for the difference in ROME and MEMIT. I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is

self.tok = AutoTokenizer.from_pretrained(self.model_name)

while in the current code, it is

self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)

Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing: image in compute_v in ROME.
While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as: image
Hence, when it is set as True, the computing is correct: image
So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.
Thanks for your help, I really need to sleep now🤣.

Perhaps bad information is that in my codes, the initialization of llama tokenizer is also set with use_fast=False in editor.py:

            elif 'llama' in self.model_name.lower():
                self.model = AutoModelForCausalLM.from_pretrained(self.model_name, **model_kwargs)
                self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
                self.tok.pad_token_id = self.tok.eos_token_id

So, I do not know if it is the only reason for these differences, and I'll try ROME by setting use_fast=True this night. Another question is that when I try to rerun using use_fast=True or change the package version, should I remove the original pre-edit files?

You donot need to remove the pre_edit.
I need your rome log file and your new output.

@StarLooo
Copy link
Author

StarLooo commented Nov 5, 2024

You donot need to remove the pre_edit. I need your rome log file and your new output.

I neglected to rerun the experiment last night. Today, I plan to assess whether setting use_fast=True yields a different outcome. Upon obtaining the new results, I will promptly reach out to you.

UPDATE: Due to limited computation resources, I still need more time to check ROME.

@littlefive5
Copy link
Collaborator

I can get your ROME results when I make sure the tokenizer can correctly build the prompt in compute_v.
I think you can just check whether your code adds the blank here. If it's setting true, the results is great but if it failed to build the prompt, the results is not that goo.

@StarLooo
Copy link
Author

StarLooo commented Nov 5, 2024

I can get your ROME results when I make sure the tokenizer can correctly build the prompt in compute_v. I think you can just check whether your code adds the blank here. If it's setting true, the results is great but if it failed to build the prompt, the results is not that goo.

That is inspiring news, and may I have a short summary of the discussion here:
After:

  1. update the environment to new version especially the version of peft package
  2. make sure left pad in test_prediction_acc
  3. make sure the tokenizer can correctly build the prompt in compute_v of ROME

we can finally get the same results on FT-L & AdaLoRA & ROME as I reported.

@littlefive5
Copy link
Collaborator

Yes

@StarLooo
Copy link
Author

StarLooo commented Nov 6, 2024

Hello, I think the BUG in ROME has nothing to do with the fast tokenizer, and here is my experiment:

Firstly, I changed these codes in compute_v for better debug:
from

    rewriting_prompts, kl_prompts = [
        context.format(request["prompt"]) + tok.decode(target_ids[:-1])
        for context in context_templates
    ], ["{} is a"]
    all_prompts = rewriting_prompts + kl_prompts

to

kl_prompts = ["{} is a"]
rewriting_prompts = []
for context in context_templates:
    part_1 = context.format(request["prompt"])
    part_2 = tok.decode(target_ids[:-1])
    rewriting_prompt = context.format(request["prompt"]) + tok.decode(target_ids[:-1])
    print(f"use fast tokenizer? {tok.is_fast}")
    print(f"target_ids[:-1]:\n{target_ids[:-1]}")
    print(f"context:\n#{context}#")
    print(f"part_1:\n#{part_1}#")
    print(f"part_2:\n#{part_2}#")
    print(f"rewriting_prompt:\n{rewriting_prompt}")
    input("press any key to confirm")
    rewriting_prompts.append(rewriting_prompt)
all_prompts = rewriting_prompts + kl_prompts

After I run with use_fast=False, I get this output information:
image

As you can see:

  1. the is_fast of tok returns False, which verifies that we are not using Fast Tokenizer
  2. the token id '29871' plays the role of blank, which is we need
  3. the ' Syria' enclosed in the red box originally has a blank

So I think the problem should be in the string (input) level but not the model/tokenizer level.

Besides, there are also some discussions about the llama tokenizer on token id '29871'.

@littlefive5
Copy link
Collaborator

Interesting,
if I run your code in your provided environment,
image
When use_fast is False,
it looks like this.

@StarLooo
Copy link
Author

StarLooo commented Nov 6, 2024

Interesting, if I run your code in your provided environment, image When use_fast is False, it looks like this.

It's quite weird. Your target_ids[-1] is also [29871, 8713], but after decoding, the blank disappears.

But if I test directly decode the [29871, 8713]:

    print(tokenizer.is_fast)
    x = [29871, 8713]
    print(f"#{tokenizer.decode(x)}#")

the result also includes the blank:
image

Could it be an issue with the console font display?

@littlefive5
Copy link
Collaborator

It's not the reason, I will check the tokenizer config to see what happened, maybe something wrong with the tokenizer setting.

@littlefive5
Copy link
Collaborator

It turns out that the reason is from legacy, a property for the origin LLAMAtokenizer.
In my tokenizer, the legacy is set to true (I don't know why, maybe our group's tokenizer config is too old and missing.)
Since the LLAMATokenizerFast do not have the legacy feature, when I set it as true, it would not influence the behavior.

image

image

@StarLooo
Copy link
Author

StarLooo commented Nov 6, 2024

It turns out that the reason is from legacy, a property for the origin LLAMAtokenizer. In my tokenizer, the legacy is set to true (I don't know why, maybe our group's tokenizer config is too old and missing.) Since the LLAMATokenizerFast do not have the legacy feature, when I set it as true, it would not influence the behavior.

image

image

Good, It seems that we have resolved all the issues.
Another concern of mine is whether other methods like SERAC, GRACE, and scenarios considering sequential editing such as WISE are also affected. I noticed that in the run_knowedit_llama2.py, it imports MENDHyperParams and SERACHparams, but doesn't contain any process about these two edit methods.

@littlefive5
Copy link
Collaborator

Well, we didn't conduct experiments on these two methods under knowedit. But from my view, this would not affect the performance as they would not use the peft module and the tokenizer also does not involve decoding in their methods.
Also, the padding_side in these methods is setting as 'left'

@StarLooo
Copy link
Author

StarLooo commented Nov 6, 2024

Well, we didn't conduct experiments on these two methods under knowedit. But from my view, this would not affect the performance as they would not use the peft module and the tokenizer also does not involve decoding in their methods. Also, the padding_side in these methods is setting as 'left'

Well, I plan to read through the codes as well as papers of these used methods recently, and then I'll try to integrate MEND and SERAC in the current framework.

@littlefive5
Copy link
Collaborator

Great, we will update the new results in arxiv this week.

@zxlzr
Copy link
Contributor

zxlzr commented Nov 7, 2024

Dear StarLooo,

Thank you very much for raising this issue. After a long and thorough debugging process, the problem has finally been identified, and we will work on an update as soon as possible.

EasyEdit will continue to optimize, improve, and add new features. We look forward to more collaboration and communication with you.

EasyEdit Team

@zxlzr zxlzr changed the title [Fixing bug in progress] Difficulties to reproduce the KnowEdit results in the survey paper [Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper Nov 7, 2024
@littlefive5
Copy link
Collaborator

littlefive5 commented Nov 8, 2024

Update some results using your provided environment. You can check again, I think a little difference is acceptable.

WikiData_recent

Method Edit Succ. Portability Locality Fluency
AdaLoRA 100.00 64.69 56.42 579.57
ROME 97.18 55.25 54.77 579.66
MEMIT 97.05 56.37 52.15 573.89
FT-L 55.75 40.86 43.70 529.24
FT-M 100.00 65.44 64.33 574.32

ZsRE

Method Edit Succ. Portability Locality Fluency
AdaLoRA 100.00 58.03 75.76 563.56
ROME 96.77 52.63 53.67 573.75
MEMIT 95.37 52.67 48.32 563.56
FT-L 53.93 45.64 73.42 493.01
FT-M 99.98 60.31 89.78 552.26

WikiData_counterfact

Method Edit Succ. Portability Locality Fluency
AdaLoRA 100.00 69.89 70.31 580.29
ROME 98.57 55.92 51.97 584.04
MEMIT 98.05 58.56 46.62 575.96
FT-L 45.15 33.60 50.48 528.26
FT-M 100.00 74.36 76.76 575.62

WikiBio

Method Edit Succ. Locality Fluency
AdaLoRA 100.00 81.28 618.45
ROME 96.08 62.74 617.69
MEMIT 94.40 61.51 616.65
FT-L 66.33 79.86 606.95
FT-M 100.00 93.38 612.69

@StarLooo
Copy link
Author

StarLooo commented Nov 9, 2024

Update some results using your provided environment. You can check again, I think a little difference is acceptable.

Within the allowable randomness range, these AdaLoRA/ROME/FT-L/FT-M results are consistent with what I previously reproduced on Wikidata_recent&Wikidata_counterfact&ZSRE. I'll also try other models and datasets.
Thanks for your tinely update.

@littlefive5
Copy link
Collaborator

I provisionally close this issue and you can reopen if you meet other problems.

@zxlzr
Copy link
Contributor

zxlzr commented Nov 18, 2024

Dear StarLooo,

We have fixed the bug and will update the paper on arXiv tomorrow (the README has been updated). We have written a pinned issue statement explaining the cause of this issue and included an announcement in the News. Thank you very much for your help!

The following is a statement.

Dear all:

Recently, with the help from the community (special thanks to @StarLooo), we will update the KnowEdit results (Llama2-7b-chat) in Table 4 of the paper ‘A Comprehensive Study of Knowledge Editing for Large Language Models’. Overall, the results have improved, primarily due to the following reasons:

1. AdaLoRA Optimization: we follow the FT-M instead of the FT-L, which trains the same FFN layer as FT-L using cross-entropy loss on the target answer while masking the original text. This approach not only yields better results but also highlights the optimal performance of AdaLoRA. Meanwhile, the peft version would also affect the performance.

2. ROME and MEMIT Updates: The results are updated after identifying missing components in the local version of the Llama2-7b-chat files (specifically, the legacy feature in the tokenizer_config.json). If you are using the official Llama2-7b-chat model downloaded directly from HF, this issue should not affect your results. Also, we fix a bug related to the padding_size for these two methods, which will influence the performance when you compute the results for batch inputs.

We deeply apologize for any inconvenience caused by this bug.

We will continue improving EasyEdit, updating this paper, and welcome everyone to engage in discussions and share ideas.

EasyEdit Team

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants