Holistic Evaluation of Text-to-Image Models (HEIM) #1939

teetone · 2023-10-24T20:44:18Z

Merging HEIM changes (scenarios, models, and metrics) changes into HELM.

…into heim_merge

yifanmai

Looks promising, much better than the last PR!

Partial review, since 14k loc is too much for me to review at once.

Also, can we arrange a mini design review meeting for the schema changes?

scripts/offline_eval/deepfloyd/deepfloyd.py

setup.cfg

src/helm/benchmark/adaptation/adapter_spec.py

src/helm/benchmark/scenarios/scenario.py

src/helm/common/request.py

src/helm/benchmark/metrics/statistic.py

scripts/offline_eval/deepfloyd/deepfloyd.py

src/helm/benchmark/metrics/image_generation/detectors/vitdet.py

setup.cfg

…into heim_merge

yifanmai

Partial review, will do the rest in a couple of hours.

src/helm/proxy/clients/clip_score_client.py

src/helm/proxy/clients/auto_client.py

src/helm/config/model_metadata.yaml

src/helm/config/model_deployments.yaml

src/helm/proxy/clients/image_generation/__init__.py

src/helm/benchmark/run_expander.py

src/helm/benchmark/scenarios/test_math_scenario.py

src/helm/benchmark/test_model_properties.py

src/helm/common/file_caches/local_file_cache.py

src/helm/proxy/services/remote_service.py

src/helm/benchmark/window_services/image_generation/clip_window_service.py

src/helm/benchmark/window_services/image_generation/lexica_search_window_service.py

src/helm/common/file_caches/file_cache.py

src/helm/common/file_caches/local_file_cache.py

yifanmai · 2023-12-20T05:07:36Z

src/helm/proxy/clients/auto_client.py

+            # Initialize `FileCache` for text-to-image model APIs
+            local_file_cache_path: str = os.path.join(self.cache_path, "output", host_organization)
+            file_cache: FileCache = LocalFileCache(local_file_cache_path, file_extension="png")


Make file_cache a provider_binding rather than a constant_binding i.e. lazily create the folder in a lambda.

Do you mind explaining provider_binding vs constant_binding a bit more and sharing how the lambda looks like?

For provider_binding, the value should be a parameter-less function that returns the value, rather than the value itself. It can either be a lambda or a named method.

I'll send you a PR for this.

Opened PR #2175 for this.

yifanmai · 2023-12-20T05:15:50Z

src/helm/common/file_caches/local_file_cache.py

+        with open(file_path, "wb") as f:
+            f.write(compute())
+
+        return file_path


I think this should be relative to base_path, as opposed to relative to the current working directory? Currently this means that I can't change my cache from "prod_env/cache" to anything else without also having to migrate all my requests strings.

The problem is we need the images when computing metrics, but the metrics are not aware of base_path. They just have the paths in ScenarioState.

In this pull request, the MediaObject in RequestResponse in the ScenarioState requires a path. The HEIM pull requests populates this path with things like prod_env/cache/openai/77437ea482144bf7b9275a0acee997db.png. I have a bunch of issues with this:

The name file cache, and the location of the cache in prod_env/cache, suggests that it is temporary storage. However, it is not actually temporary - prod_env/cache contains the only copy of generated images, and deleting prod_env/cache results in loss of data that comprises part of the scenario_state.json files.

The paths in the responses will have the prod_env/cache prefix baked in which means that if the user sets the --local-path flag to something else, this breaks all paths in the responses cache and the scenario_state.json files, and the user has to do some manual migration.

It's not really clear what relative file paths should be relative to - this pull request takes the position that it should be relative to the current working directory, but there are advantages to making it relative to the prod_env/cache , or to the benchmark_output.

We do not upload anything in prod_env/cache, which means that the publicly uploaded scenario_state.json files will contain paths to unavailable files.

Even if I wanted to upload the files in prod_env/cache to a web server, I would be forced to upload them with prod_env/cache as part of the URL so that the links keep working.

Images from different run specs and different run suites are mixed, so I can't upload the images for a specific suite or run - I'd have to write a script to do this.

Some possible alternatives:

Remove FileCache: Store the image in the response cache. Both SQLite and MongoDB have binary blob types that do this (or we can base64 encode, which is less efficient).

Write out the files to the run folder. i.e. /benchmark_output/run/some_scenario:model=some_model/media/77437ea482144bf7b9275a0acee997db.png In MediaObject in RequestResponse in the ScenarioState, the paths will be media/77437ea482144bf7b9275a0acee997db.png - relative to the location of scenario_state.json.

Summarizing some stuff from a Slack converation:

@teetone:

I thought of another solution that is decent, quick to implement and addresses most of the issues you mentioned. We can cache the images in prod_env when the clients receive a request. When we generate the Scenario State, we can create a copy of the files in benchmark_output and specify the copied path in Scenario State, so benchmark_output will have the copy of the images. The downside is we now will have an extra copy of the images.

Me:

That sounds reasonable as well - a second copy is fine because the cache copy is temporary, so we only have one permanent copy.
It wouldn't address the baked-in paths in the response cache, but I would be okay with compromising on that.
Also, with this, we wouldn't need the base64 encoding any more - we can just link to the actual file. So we actually save on storage by eliminating the base64 encoded copy (because binary is more efficient than base64)

Percy:

Having a temporary cache in prod_env/cache makes sense because it mirrors the way we cache everything else.
But at that point, why not just store in mongo or the sqlite db?

To understand the proposal for scenario_state.json, the files would just be parallel to that? It seems like prod_env/cache should not appear in the paths.

teetone added 21 commits October 22, 2023 19:50

added image generation scenarios

46b1a22

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

77f6716

…into heim_merge

added deepfloyd offline script

7a33e7b

text-to-image models and adapter

e986fec

perturbations

081f6a6

translate

34afa98

text-to-image metrics

7fe7280

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

d1175b9

…into heim_merge

heim conf files

0273b33

window services for text-to-image models

15662b9

file caching

8df8eb8

requests

33a5eab

clients

fe0dcb3

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

9d5eba5

…into heim_merge

pass over clients

c0a3c38

resolve merge conflicts

691c656

pass on metrics

6beb673

pass over metrics

bf999c0

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

b7aef97

…into heim_merge

added HEIM run specs

555bcdf

hf tokens

7435b11

teetone requested a review from yifanmai October 24, 2023 20:44

yifanmai requested changes Oct 25, 2023

View reviewed changes

yifanmai reviewed Oct 25, 2023

View reviewed changes

setup.cfg Outdated Show resolved Hide resolved

yifanmai reviewed Oct 25, 2023

View reviewed changes

setup.cfg Show resolved Hide resolved

teetone added 5 commits October 25, 2023 15:38

resolve merge conflicts

fc63ae9

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

13fcd79

…into heim_merge

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

54e2059

…into heim_merge

rename image generation params

41b772c

style check

96d3f81

teetone added 19 commits December 2, 2023 16:18

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

178ca73

…into heim_merge

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

1419095

…into heim_merge

resolve merge conflicts

4da4f86

resolve merge conflicts

1c11b9b

resolve merge conflicts

779f75a

pass through image generation parameters

bdb76f2

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

d4fb816

…into heim_merge

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

6a23880

…into heim_merge

DALL-E client cleanup

f0e4274

rename dall-e-2 to match openai's name

08103d3

rename dalle2 window service to be more general (dall-e-3)

d55e1fb

dall-e 3

9e27ba2

support segmind/SSD-1B

d7cd500

skip

53c0141

support stabilityai/stable-diffusion-xl-base-1.0

8cabc53

support segmind/Segmind-Vega

974a9e2

remove original instance

5338a88

Merge branch 'main' of https://github.com/stanford-crfm/benchmarking …

67da209

…into heim_merge

minor improve logging

66f6d5d

yifanmai requested changes Dec 20, 2023

View reviewed changes

teetone added 2 commits December 21, 2023 00:52

cleanup

b9dc2e7

resolved merge conflicts

3a93754

teetone requested a review from yifanmai December 20, 2023 16:03

fix test

b57cb83

yifanmai approved these changes Dec 21, 2023

View reviewed changes

yifanmai merged commit 346f765 into main Dec 21, 2023
6 checks passed

yifanmai deleted the heim_merge branch December 21, 2023 00:23

brianwgoldman pushed a commit that referenced this pull request Dec 21, 2023

Holistic Evaluation of Text-to-Image Models (HEIM) (#1939)

2eec6ea

yifanmai mentioned this pull request Jan 5, 2024

These two links seem not work in https://crfm.stanford.edu/heim/latest/ #2031

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Holistic Evaluation of Text-to-Image Models (HEIM) #1939

Holistic Evaluation of Text-to-Image Models (HEIM) #1939

teetone commented Oct 24, 2023 •

edited

Loading

yifanmai left a comment

yifanmai left a comment

yifanmai Dec 20, 2023

teetone Dec 20, 2023

yifanmai Dec 20, 2023

yifanmai Dec 21, 2023

yifanmai Dec 21, 2023

yifanmai Dec 20, 2023

teetone Dec 20, 2023

yifanmai Dec 21, 2023

Holistic Evaluation of Text-to-Image Models (HEIM) #1939

Holistic Evaluation of Text-to-Image Models (HEIM) #1939

Conversation

teetone commented Oct 24, 2023 • edited Loading

yifanmai left a comment

Choose a reason for hiding this comment

yifanmai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teetone commented Oct 24, 2023 •

edited

Loading