Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation Metrics #4

Open
WWEISONG opened this issue Nov 11, 2024 · 13 comments
Open

Evaluation Metrics #4

WWEISONG opened this issue Nov 11, 2024 · 13 comments

Comments

@WWEISONG
Copy link

Hi Shilin,

It is a great work, and thanks for releasing it to the public.

I was confused about the TPR@0.1%FPR, and from your code, it seems there is only bit accuracy. Could you please indicate how you calculate TPR? Does it require the decoded watermark exactly match the secret (say 100% bit accuracy)?

@Shilin-LU
Copy link
Owner

Hi, thank you for your interest!

  • TPR@0.1%FPR is a statistical metric that applies to a set of images rather than individual images. Therefore, it isn’t included in our single-image demo. However, we plan to release this metric alongside our full benchmark once our work is accepted, so please stay tuned!

  • Decoded accuracy does not need to be 100%. To put it simply:
    If a model decodes unwatermarked images with about 50% accuracy and watermarked images with 75% accuracy, statistical testing can likely classify the images correctly.
    However, if a model frequently achieves high bit accuracy on unwatermarked images, even high accuracy on watermarked images could signal a high false positive rate, indicating poor performance as a watermarking model.

  • Why some methods achieve high bit accuracy but low TPR@0.1%FPR: This occurs when methods often misclassify unwatermarked images as watermarked, mistakenly decoding a message from unwatermarked images.

  • ROC may be of interest to you for understanding TPR@0.1%FPR. For a simple implementation: calculate the decoding accuracies of both unwatermarked and watermarked images, plot the ROC, and use the following code to determine TPR at various FPRs (Alternatively, you can wait for our upcoming benchmark, which will include complete metric calculation codes):

from sklearn import metrics

def compute_auroc_tpr_fpr(unwmed, wmed):
    fpr, tpr, thresholds = metrics.roc_curve(unwmed, wmed, pos_label=1)
    auc = metrics.auc(fpr, tpr)
    tpr_at_1_fpr = tpr[np.where(fpr < 0.01)[0][-1]]
    tpr_at_01_fpr= tpr[np.where(fpr < 0.001)[0][-1]]
    return auc, tpr_at_1_fpr, tpr_at_01_fpr

@WWEISONG
Copy link
Author

Thank you very much. This makes a lot of sense to me.

@WWEISONG
Copy link
Author

By the way, when will the W-Bench be released? Thanks in advance!

@Shilin-LU
Copy link
Owner

Shilin-LU commented Nov 14, 2024

By the way, when will the W-Bench be released? Thanks in advance!

Hi, W-Bench will be released once our work is published. Thanks for your interest!

@WWEISONG
Copy link
Author

Hi Shilin, thank you very much.

Can you please help check the link of "VINE-B-Enc" and "VINE-B-Dec", which both navigate to the encoder link without the decoder? Thanks in advance!

@Shilin-LU
Copy link
Owner

Hi Shilin, thank you very much.

Can you please help check the link of "VINE-B-Enc" and "VINE-B-Dec", which both navigate to the encoder link without the decoder? Thanks in advance!

Thank you for your reminder! I have corrected it!

@WWEISONG
Copy link
Author

WWEISONG commented Nov 28, 2024

Hi Shilin,

I used the diffusion model as the regeneration attack (https://github.com/XuandongZhao/WatermarkAttacker) to test VINE. However, it shows VINE can only reach around 17% TPR@0.1%FPR, which is quite different from the results reported in VINE. Did you try this existing attack before? And could you please share with me which diffusion/pipeline you used to evaluate VINE under regeneration attack? Thanks in advance.

@Shilin-LU
Copy link
Owner

Hi, thank you for your interest in VINE.

  • For deterministic regeneration, we have released the corresponding implementation. Regarding stochastic regeneration, we used the same repository and pipeline that you referenced (WatermarkAttacker). Our experiments indicate that this approach should not significantly affect our watermarks in terms of TPR@0.1%FPR under 200+ noise steps.

  • If you're observing a TPR@0.1%FPR of around 17%, there might be an issue with the implementation. Possible factors to consider include the number of images used or the statistical testing process. It would be helpful if you could provide more details about your setup.

  • Additionally, please stay tuned, as we will be releasing the complete W-Bench later, which should offer more comprehensive evaluation tools.

@WWEISONG
Copy link
Author

Thanks for your responses.

I strictly followed the implementation (actually without any modification, just use the code they upload...) and diffusion models referenced in the work. I tried v2-1 and v1-4 with noise steps of 30, 60, 100 on both OpenImage and COCO datasets (2000 samples). The TPR@0.1%FPR is consistently much lower.

@Shilin-LU
Copy link
Owner

Thanks for your responses.

There might be an issue with your statistical testing process. I recommend trying other image editing methods, such as UltraEdit or image inversion, which we have provided. This can help verify the accuracy of your statistical tests and determine whether the TPR@0.1% FPR is within the expected range.

@WWEISONG
Copy link
Author

I also tested other image editing methods, including the VAE regeneration from the same reference. Overall, they can reach similar (~10% fluctuation, which is normal) performance as reported in VINE. So, I assume the statistical testing process is OK.

Regarding the diffusion one, which is also quite a standard implementation from https://github.com/XuandongZhao/WatermarkAttacker, it seems VINE is not robust against this one (tested many times with different settings, performance reached 15%-25% only on 2000 images randomly selected from each of OpenImage and COCO). Could you share the samples or the diffusion pipeline with me?

@Shilin-LU
Copy link
Owner

Could you share the samples or the diffusion pipeline with me?

We use the same diffusion pipeline as the one in the repository you mentioned. For sample results, you can refer to Appendix F.1 (Image Regeneration), specifically Figures 10 and 11, in our arXiv paper.

We’ve been quite busy recently, but we will be releasing W-Bench soon, which will include these image editing methods and statistical testing codes. You may want to wait for its release for a more comprehensive set of tools.

Thank you for your understanding.

@WWEISONG
Copy link
Author

Many thanks. I am looking forward to seeing your implementation for W-Bench.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants