Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VHELM conf #2641

Merged
merged 45 commits into from
May 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
0eeef8b
add earth mover similarity + white
teetone May 12, 2024
815cf3b
image2structure schema
teetone May 12, 2024
285a6a1
clean up run entries
teetone May 12, 2024
2384ae5
fix schema
teetone May 14, 2024
63997ad
fix schema
teetone May 14, 2024
cc5efb0
rename
teetone May 14, 2024
318c53c
fix schema
teetone May 14, 2024
100dc21
update instructions
teetone May 14, 2024
a3d2d65
clean up schema
teetone May 14, 2024
08177fe
Merge branch 'main' of https://github.com/stanford-crfm/helm into i2s…
teetone May 15, 2024
75f1619
update schema
teetone May 16, 2024
bb3200f
general information
teetone May 17, 2024
abe4d84
check
teetone May 19, 2024
dbb8247
check
teetone May 19, 2024
1582583
check
teetone May 19, 2024
e51530e
story
teetone May 23, 2024
c59cc82
add both mementos subset
teetone May 23, 2024
75ce76d
fix
teetone May 23, 2024
1e67307
split up
teetone May 23, 2024
cc5eed8
include all bingo
teetone May 23, 2024
ea9ec69
include all bingo
teetone May 23, 2024
04465b3
Merge branch 'main' of https://github.com/stanford-crfm/helm into i2s…
teetone May 24, 2024
f8394f9
remove
teetone May 24, 2024
18d21e0
max train instances=3
teetone May 24, 2024
0fc7db8
fix name
teetone May 24, 2024
d89f806
compute general information
teetone May 24, 2024
bbb7349
basic metrics
teetone May 24, 2024
b8b116f
resize image
teetone May 25, 2024
bdfe66d
disable for now
teetone May 25, 2024
e0f5afe
fix
teetone May 25, 2024
fb8a548
schema update
teetone May 25, 2024
c10d80e
fix schema
teetone May 25, 2024
c756390
palyra vision
teetone May 25, 2024
5381060
use truncate logic
teetone May 25, 2024
b40b81a
use truncate logic
teetone May 25, 2024
f94f450
change mapping
teetone May 25, 2024
3e1b87c
fix type check
teetone May 25, 2024
91b61b7
fix flake8
teetone May 25, 2024
c1fbac5
specify groups
teetone May 27, 2024
f61946e
specify groups
teetone May 27, 2024
cac88a8
schema metrics
teetone May 27, 2024
0c40835
remove unused metrics
teetone May 27, 2024
972c6fd
image2structure difficulties
teetone May 28, 2024
360b06f
resolve merge conflict
teetone May 29, 2024
44ad7ed
set max_train_instances to 0
teetone May 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -37,4 +37,4 @@ entries: [
# wild examples
{description: "image2webpage:subset=real,model=vlm", priority: 1, groups: ["image2webpage"]}
{description: "image2latex:subset=real,model=vlm", priority: 1, groups: ["image2webpage"]}
]
]
195 changes: 195 additions & 0 deletions src/helm/benchmark/presentation/run_entries_vhelm.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# Conf file for VHELM: Holistic Evaluation of Vision-Language Models (VLMs)
entries: [

################################################# Main experiments #################################################

####################################################################################################################
# Accuracy: Is the output semantically correct, given the text and image inputs?
####################################################################################################################

# Questions about natural images
{description: "vqa:model=vlm", priority: 1, groups: ["vqa_base"]}
{description: "viz_wiz:model=vlm", priority: 1}

# Image captioning
{description: "flickr30k:model=vlm", priority: 1}

####################################################################################################################
# Reasoning: Does the model understand objects, counts, and spatial and temporal relations?
# Can the model reason about both the text (e.g., negation, word order, etc.) and image (e.g., visual
# understanding or detection), i.e., visio-linguistic compositional reasoning?
####################################################################################################################

# Real-world visual reasoning
{description: "gqa:model=vlm", priority: 1}

# MathVista
{description: "math_vista:grade=elementary_school,question_type=multi_choice,model=vlm", priority: 1}
{description: "math_vista:grade=elementary_school,question_type=free_form,model=vlm", priority: 1}

{description: "math_vista:grade=high_school,question_type=multi_choice,model=vlm", priority: 1}
{description: "math_vista:grade=high_school,question_type=free_form,model=vlm", priority: 1}

{description: "math_vista:grade=college,question_type=multi_choice,model=vlm", priority: 1}
{description: "math_vista:grade=college,question_type=free_form,model=vlm", priority: 1}

{description: "math_vista:grade=daily_life,question_type=multi_choice,model=vlm", priority: 1}
{description: "math_vista:grade=daily_life,question_type=free_form,model=vlm", priority: 1}

# Website
{description: "image2webpage:subset=css,model=vlm", priority: 1, groups: ["image2webpage"]}

# Seed bench
{description: "seed_bench:subject=visual-reasoning,model=vlm", priority: 1}
{description: "seed_bench:subject=instance-interaction,model=vlm", priority: 1}

####################################################################################################################
# Knowledge: Does the model have knowledge about the world or specific domains?
####################################################################################################################

# A-OKVQA tests for general world knowledge
{description: "a_okvqa:model=vlm", priority: 1, groups: ["a_okvqa_base"]}

# MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
{description: "mmmu:subject=Accounting,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Agriculture,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Architecture_and_Engineering,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Art,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Art_Theory,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Basic_Medical_Science,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Biology,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Chemistry,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Clinical_Medicine,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Computer_Science,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Design,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Diagnostics_and_Laboratory_Medicine,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Economics,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Electronics,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Energy_and_Power,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Finance,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Geography,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=History,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Literature,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Manage,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Marketing,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Materials,question_type=multiple-choice,model=vlm", priority: 1}
# Covered by MathVista
# {description: "mmmu:subject=Math,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Mechanical_Engineering,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Music,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Pharmacy,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Physics,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Psychology,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Public_Health,question_type=multiple-choice,model=vlm", priority: 1}
{description: "mmmu:subject=Sociology,question_type=multiple-choice,model=vlm", priority: 1}

# MME (fine-grained tasks)
{description: "mme:subject=posters,model=vlm", priority: 1}
{description: "mme:subject=celebrity,model=vlm", priority: 1}
{description: "mme:subject=artwork,model=vlm", priority: 1}
{description: "mme:subject=landmark,model=vlm", priority: 1}

####################################################################################################################
# Originality: Does the model generate creative content (e.g., poetry, art)?
####################################################################################################################

# {description: "mementos:subject=comics,num_respondents=1,model=vlm", priority: 1}
# {description: "mementos:subject=dailylife,num_respondents=1,model=vlm", priority: 1}

####################################################################################################################
# Bias: Are the generations biased in demographic representation (e.g., gender, skin tone)?
####################################################################################################################

{description: "pairs:model=vlm,subset=occupations,person=black_man", priority: 1}
{description: "pairs:model=vlm,subset=occupations,person=black_woman", priority: 1}
{description: "pairs:model=vlm,subset=occupations,person=white_man", priority: 1}
{description: "pairs:model=vlm,subset=occupations,person=white_woman", priority: 1}

{description: "pairs:model=vlm,subset=potential_crime,person=black_man", priority: 1}
{description: "pairs:model=vlm,subset=potential_crime,person=black_woman", priority: 1}
{description: "pairs:model=vlm,subset=potential_crime,person=white_man", priority: 1}
{description: "pairs:model=vlm,subset=potential_crime,person=white_woman", priority: 1}

{description: "pairs:model=vlm,subset=status,person=black_man", priority: 1}
{description: "pairs:model=vlm,subset=status,person=black_woman", priority: 1}
{description: "pairs:model=vlm,subset=status,person=white_man", priority: 1}
{description: "pairs:model=vlm,subset=status,person=white_woman", priority: 1}

####################################################################################################################
# Fairness: Does the model exhibit performance disparities across social groups (e.g., gender, dialect)?
####################################################################################################################

{description: "vqa:model=vlm,data_augmentation=dialect_deterministic", priority: 1, groups: ["vqa_dialect"]}
{description: "a_okvqa:model=vlm,data_augmentation=dialect_deterministic", priority: 1, groups: ["a_okvqa_dialect"]}

# Crossmodal-3600 dataset also can measure geographic bias and robustness.
# Geographic bias refers to the tendency to favor or prioritize information, perspectives, resources,
# or experiences from certain geographic locations over others
{description: "crossmodal_3600:model=vlm,location=english,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=spanish,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=chinese,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=hindi,language=english", priority: 1}

{description: "crossmodal_3600:model=vlm,location=cusco_quechua,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=maori,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=swahili,language=english", priority: 1}
{description: "crossmodal_3600:model=vlm,location=telugu,language=english", priority: 1}

####################################################################################################################
# Toxicity: Does the model generate toxic or inappropriate content? Can the model identify toxic
# or inappropriate content?
####################################################################################################################

{description: "hateful_memes:model=vlm", priority: 1}

{description: "mm_safety_bench:subset=illegal_activity,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=hate_speech,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=malware_generation,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=physical_harm,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=economic_harm,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=fraud,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=sex,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=political_lobbying,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=privacy_violence,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=legal_opinion,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=financial_advice,model=vlm", priority: 1}
{description: "mm_safety_bench:subset=health_consultation,model=vlm", priority: 1}
# Has some examples related to bias
{description: "mm_safety_bench:subset=government_decision,model=vlm", priority: 1}

####################################################################################################################
# Robustness: Is the model robust to invariant input (text/image) perturbations?
####################################################################################################################

{description: "vqa:model=vlm,data_augmentation=robustness", priority: 1, groups: ["vqa_robustness"]}
{description: "a_okvqa:model=vlm,data_augmentation=robustness", priority: 1, groups: ["a_okvqa_robustness"]}

{description: "unicorn:subject=OODCV-VQA,model=vlm", priority: 1}
{description: "unicorn:subject=Sketchy-VQA,model=vlm", priority: 1}

{description: "bingo:subject=Region,model=vlm", priority: 1}
{description: "bingo:subject=OCR,model=vlm", priority: 1}
{description: "bingo:subject=Factual,model=vlm", priority: 1}
{description: "bingo:subject=T2I,model=vlm", priority: 1}
{description: "bingo:subject=I2I,model=vlm", priority: 1}

{description: "pope:model=vlm", priority: 1}

####################################################################################################################
# Multilinguality: Are languages other than English supported?
####################################################################################################################
{description: "vqa:model=vlm,data_augmentation=chinese", priority: 1, groups: ["vqa_chinese"]}
{description: "vqa:model=vlm,data_augmentation=hindi", priority: 1, groups: ["vqa_hindi"]}
{description: "vqa:model=vlm,data_augmentation=spanish", priority: 1, groups: ["vqa_spanish"]}

{description: "a_okvqa:model=vlm,data_augmentation=chinese", priority: 1, groups: ["a_okvqa_chinese"]}
{description: "a_okvqa:model=vlm,data_augmentation=hindi", priority: 1, groups: ["a_okvqa_hindi"]}
{description: "a_okvqa:model=vlm,data_augmentation=spanish", priority: 1, groups: ["a_okvqa_spanish"]}


############################################## Additional experiments ##############################################

{description: "vqa:model=vlm,max_train_instances=all", priority: 1, groups: ["vqa_ablation_in_context"]}
{description: "a_okvqa:model=vlm,max_train_instances=all", priority: 1, groups: ["a_okvqa_ablation_in_context"]}

]
22 changes: 22 additions & 0 deletions src/helm/benchmark/presentation/run_entries_vhelm_debug.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Conf file for VHELM: Holistic Evaluation of Vision-Language Models (VLMs)
entries: [
{description: "pope:model=vlm", priority: 1}

{description: "unicorn:subject=OODCV-VQA,model=vlm", priority: 1}
{description: "unicorn:subject=Sketchy-VQA,model=vlm", priority: 1}

{description: "bingo:subject=Region,model=vlm", priority: 1}
{description: "bingo:subject=OCR,model=vlm", priority: 1}
{description: "bingo:subject=Factual,model=vlm", priority: 1}
{description: "bingo:subject=T2I,model=vlm", priority: 1}
{description: "bingo:subject=I2I,model=vlm", priority: 1}

{description: "seed_bench:subject=visual-reasoning,model=vlm", priority: 1}
{description: "seed_bench:subject=instance-interaction,model=vlm", priority: 1}

{description: "mme:subject=posters,model=vlm", priority: 1}
{description: "mme:subject=celebrity,model=vlm", priority: 1}
{description: "mme:subject=artwork,model=vlm", priority: 1}
{description: "mme:subject=landmark,model=vlm", priority: 1}

]
Loading