Evaluation Metrics #4

WWEISONG · 2024-11-11T05:28:14Z

Hi Shilin,

It is a great work, and thanks for releasing it to the public.

I was confused about the TPR@0.1%FPR, and from your code, it seems there is only bit accuracy. Could you please indicate how you calculate TPR? Does it require the decoded watermark exactly match the secret (say 100% bit accuracy)?

Shilin-LU · 2024-11-11T06:28:36Z

Hi, thank you for your interest!

TPR@0.1%FPR is a statistical metric that applies to a set of images rather than individual images. Therefore, it isn’t included in our single-image demo. However, we plan to release this metric alongside our full benchmark once our work is accepted, so please stay tuned!
Decoded accuracy does not need to be 100%. To put it simply:
If a model decodes unwatermarked images with about 50% accuracy and watermarked images with 75% accuracy, statistical testing can likely classify the images correctly.
However, if a model frequently achieves high bit accuracy on unwatermarked images, even high accuracy on watermarked images could signal a high false positive rate, indicating poor performance as a watermarking model.
Why some methods achieve high bit accuracy but low TPR@0.1%FPR: This occurs when methods often misclassify unwatermarked images as watermarked, mistakenly decoding a message from unwatermarked images.
ROC may be of interest to you for understanding TPR@0.1%FPR. For a simple implementation: calculate the decoding accuracies of both unwatermarked and watermarked images, plot the ROC, and use the following code to determine TPR at various FPRs (Alternatively, you can wait for our upcoming benchmark, which will include complete metric calculation codes):

from sklearn import metrics

def compute_auroc_tpr_fpr(unwmed, wmed):
    fpr, tpr, thresholds = metrics.roc_curve(unwmed, wmed, pos_label=1)
    auc = metrics.auc(fpr, tpr)
    tpr_at_1_fpr = tpr[np.where(fpr < 0.01)[0][-1]]
    tpr_at_01_fpr= tpr[np.where(fpr < 0.001)[0][-1]]
    return auc, tpr_at_1_fpr, tpr_at_01_fpr

WWEISONG · 2024-11-11T06:39:05Z

Thank you very much. This makes a lot of sense to me.

WWEISONG · 2024-11-14T10:32:19Z

By the way, when will the W-Bench be released? Thanks in advance!

Shilin-LU · 2024-11-14T11:23:59Z

By the way, when will the W-Bench be released? Thanks in advance!

Hi, W-Bench will be released once our work is published. Thanks for your interest!

WWEISONG · 2024-11-16T10:05:06Z

Hi Shilin, thank you very much.

Can you please help check the link of "VINE-B-Enc" and "VINE-B-Dec", which both navigate to the encoder link without the decoder? Thanks in advance!

Shilin-LU · 2024-11-17T01:13:38Z

Hi Shilin, thank you very much.

Can you please help check the link of "VINE-B-Enc" and "VINE-B-Dec", which both navigate to the encoder link without the decoder? Thanks in advance!

Thank you for your reminder! I have corrected it!

WWEISONG · 2024-11-28T15:35:00Z

Hi Shilin,

I used the diffusion model as the regeneration attack (https://github.com/XuandongZhao/WatermarkAttacker) to test VINE. However, it shows VINE can only reach around 17% TPR@0.1%FPR, which is quite different from the results reported in VINE. Did you try this existing attack before? And could you please share with me which diffusion/pipeline you used to evaluate VINE under regeneration attack? Thanks in advance.

Shilin-LU · 2024-11-29T01:09:20Z

Hi, thank you for your interest in VINE.

For deterministic regeneration, we have released the corresponding implementation. Regarding stochastic regeneration, we used the same repository and pipeline that you referenced (WatermarkAttacker). Our experiments indicate that this approach should not significantly affect our watermarks in terms of TPR@0.1%FPR under 200+ noise steps.
If you're observing a TPR@0.1%FPR of around 17%, there might be an issue with the implementation. Possible factors to consider include the number of images used or the statistical testing process. It would be helpful if you could provide more details about your setup.
Additionally, please stay tuned, as we will be releasing the complete W-Bench later, which should offer more comprehensive evaluation tools.

WWEISONG · 2024-11-29T01:25:52Z

Thanks for your responses.

I strictly followed the implementation (actually without any modification, just use the code they upload...) and diffusion models referenced in the work. I tried v2-1 and v1-4 with noise steps of 30, 60, 100 on both OpenImage and COCO datasets (2000 samples). The TPR@0.1%FPR is consistently much lower.

Shilin-LU · 2024-11-29T01:48:37Z

Thanks for your responses.

There might be an issue with your statistical testing process. I recommend trying other image editing methods, such as UltraEdit or image inversion, which we have provided. This can help verify the accuracy of your statistical tests and determine whether the TPR@0.1% FPR is within the expected range.

WWEISONG · 2024-11-29T02:05:20Z

I also tested other image editing methods, including the VAE regeneration from the same reference. Overall, they can reach similar (~10% fluctuation, which is normal) performance as reported in VINE. So, I assume the statistical testing process is OK.

Regarding the diffusion one, which is also quite a standard implementation from https://github.com/XuandongZhao/WatermarkAttacker, it seems VINE is not robust against this one (tested many times with different settings, performance reached 15%-25% only on 2000 images randomly selected from each of OpenImage and COCO). Could you share the samples or the diffusion pipeline with me?

Shilin-LU · 2024-11-29T02:28:37Z

Could you share the samples or the diffusion pipeline with me?

We use the same diffusion pipeline as the one in the repository you mentioned. For sample results, you can refer to Appendix F.1 (Image Regeneration), specifically Figures 10 and 11, in our arXiv paper.

We’ve been quite busy recently, but we will be releasing W-Bench soon, which will include these image editing methods and statistical testing codes. You may want to wait for its release for a more comprehensive set of tools.

Thank you for your understanding.

WWEISONG · 2024-11-29T03:07:59Z

Many thanks. I am looking forward to seeing your implementation for W-Bench.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Metrics #4

Evaluation Metrics #4

WWEISONG commented Nov 11, 2024

Shilin-LU commented Nov 11, 2024

WWEISONG commented Nov 11, 2024

WWEISONG commented Nov 14, 2024

Shilin-LU commented Nov 14, 2024 •

edited

Loading

WWEISONG commented Nov 16, 2024

Shilin-LU commented Nov 17, 2024

WWEISONG commented Nov 28, 2024 •

edited

Loading

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Evaluation Metrics #4

Evaluation Metrics #4

Comments

WWEISONG commented Nov 11, 2024

Shilin-LU commented Nov 11, 2024

WWEISONG commented Nov 11, 2024

WWEISONG commented Nov 14, 2024

Shilin-LU commented Nov 14, 2024 • edited Loading

WWEISONG commented Nov 16, 2024

Shilin-LU commented Nov 17, 2024

WWEISONG commented Nov 28, 2024 • edited Loading

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Shilin-LU commented Nov 29, 2024

WWEISONG commented Nov 29, 2024

Shilin-LU commented Nov 14, 2024 •

edited

Loading

WWEISONG commented Nov 28, 2024 •

edited

Loading