diff --git a/README.md b/README.md
index 717a898..4e17a13 100644
--- a/README.md
+++ b/README.md
@@ -3,16 +3,16 @@
LLM Comparator is an interactive visualization tool for analyzing side-by-side
LLM evaluation results. It is designed to help people qualitatively analyze how
responses from two models differ at example- and slice-levels. Users can
-interactively discover insights like "Model A's responses are better than B's on
-email rewriting tasks because Model A tends to generate bulleted lists more
-often."
+interactively discover insights like *"Model A's responses are better than B's
+on email rewriting tasks because Model A tends to generate bulleted lists more
+often."*
![Screenshot of LLM Comparator interface](documentation/images/llm_comparator_screenshot.png)
## Using LLM Comparator
-You can open LLM Comparator at https://pair-code.github.io/llm-comparator/.
+You can play with LLM Comparator at https://pair-code.github.io/llm-comparator/.
You can either select one of the example files we provide, or you can upload
your own JSON file (e.g.,
@@ -25,19 +25,19 @@ that follows our format which we describe below.
We provide an example file for comparing
the model responses between [Gemma](https://ai.google.dev/gemma) 1.1 and 1.0
for prompts obtained from the
-[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations). You can click the link below to play with it:
+[Chatbot Arena Conversations dataset](https://huggingface.co/datasets/lmsys/chatbot_arena_conversations).
+You can click the link below to play with it:
https://pair-code.github.io/llm-comparator/?results_path=https://pair-code.github.io/llm-comparator/data/example_arena.json
The tool helps you analyze *when* and *why* Gemma 1.1 is better or worse than
-1.0 and *how* responses from two models qualitatively differ.
+1.0 and *how* responses from two models differ.
-- ***When***: The **Score Distribution** panel shows that the quality of
-responses from Model A (Gemma 1.1) is considered better than that from Model B
-(Gemma 1.0) (larger blue area than orange),
-according to the LLM-based evaluation method
+- ***When***: The **Score Distribution** and **Metrics by Prompt Category**
+panels show that the quality of responses from Model A (Gemma 1.1) is considered
+better than that from Model B (Gemma 1.0) (larger blue area than orange;
+>50% win rate), according to the LLM-based evaluation method
([LLM-as-a-judge](https://arxiv.org/abs/2306.05685)).
-This holds true for most prompt categories
-(as in **Metrics by Prompt Category** panel).
+This holds true for most prompt categories (e.g., Humanities, Math).
- ***Why***: The **Rationale Summary** panel dives into the reasons behind these
score differences.
In this case, the LLM judge focused mostly on the amount of details. It also
@@ -60,8 +60,8 @@ must follow the schema described below.
We assume that a user has a set of input prompts to test. For each prompt, they
need to prepare the responses to the prompt from two LLMs (i.e., Model A, Model
-B), and a numerical score obtained from automatic side-by-side evaluation (also
-known as [LLM-as-a-judge](https://arxiv.org/abs/2306.05685) or
+B), and a numerical score obtained from side-by-side evaluation (e.g.,
+[LLM-as-a-judge](https://arxiv.org/abs/2306.05685),
[AutoSxS](https://cloud.google.com/vertex-ai/generative-ai/docs/models/side-by-side-eval)).
A positive score represents that A's response is better than B's; a negative
score indicates B is better; and zero meaning a tie.
@@ -83,7 +83,7 @@ All the fields presented below are required.
"examples": [
{
"input_text": "This is a prompt.",
- "tags": ["Coding"], # A list of keywords for categorizing prompts
+ "tags": ["Math"], # A list of keywords for categorizing prompts
"output_text_a": "Response to the prompt from the first model (A)",
"output_text_b": "Response to the prompt from the other model (B)",
"score": -1.25, # Score from the judge LLM
@@ -100,13 +100,13 @@ All the fields presented below are required.
### Additional Data
-Users can optionally provide additional information to be analyzed in LLM
+You can optionally provide additional information to be analyzed in LLM
Comparator.
#### Custom Fields
If you have additional information about each prompt, it can be displayed as
-a column in the table and aggregated information is visualized as a chart
+columns in the table and aggregated information is visualized as charts
on the right side of the interface. It supports various data types, such as:
- `number`: Numeric data, visualized as histograms (e.g., word count for prompt,
@@ -231,18 +231,18 @@ npm run serve
## Citing LLM Comparator
-If you use LLM Comparator as part of your work, please cite our paper at
-https://arxiv.org/abs/2402.10524.
+If you use LLM Comparator as part of your work, please cite our research paper
+at https://arxiv.org/abs/2402.10524.
```
@inproceedings{kahng2024comparator,
- title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of
- Large Language Models},
+ title={{LLM Comparator}: Visual Analytics for Side-by-Side Evaluation of Large Language Models},
author={Kahng, Minsuk and Tenney, Ian and Pushkarna, Mahima and Liu, Michael Xieyang and Wexler, James and Reif, Emily and Kallarackal, Krystal and Chang, Minsuk and Terry, Michael and Dixon, Lucas},
- booktitle={Extended Abstracts of the CHI Conference on Human Factors in
- Computing Systems},
+ booktitle={Extended Abstracts of the CHI Conference on Human Factors in Computing Systems},
year={2024},
publisher={ACM},
+ doi={10.1145/3613905.3650755},
+ url={https://arxiv.org/abs/2402.10524}
}
```
diff --git a/client/app.ts b/client/app.ts
index 46879e0..df665bc 100644
--- a/client/app.ts
+++ b/client/app.ts
@@ -15,7 +15,6 @@
* limitations under the License.
*/
-// tslint:disable:g3-no-void-expression
// tslint:disable:no-new-decorators
import './components/charts';
import './components/custom_functions';
@@ -89,14 +88,14 @@ export class LlmComparatorAppElement extends MobxLitElement {
- The json file must contain these three properties: "metadata", "models",
- and "examples".
+ The json file must contain these three properties:
+ metadata,
+ models,
+ and examples.
- Each example must have "input_text", "tags", "output_text_a",
- "output_text_b", and "score".
+ Each example in examples must have
+ input_text,
+ tags,
+ output_text_a,
+ output_text_b,
+ and score.
Please refer to our document for details:
${documentationLink}
@@ -94,7 +100,7 @@ export class DatasetSelectionElement extends MobxLitElement {
const textareaPlaceholder = 'Enter a URL to load the json file from.';
const urlLoadPath =
- this.appState.appLink + '?results_path=https://.../results.json';
+ this.appState.appLink + '?results_path=https://.../...json';
const panelIntro = html`
Enter the URL path of a json file prepared for LLM Comparator.`;
const panelOutro = html`
diff --git a/client/components/example_details.ts b/client/components/example_details.ts
index 52ca96d..146e212 100644
--- a/client/components/example_details.ts
+++ b/client/components/example_details.ts
@@ -153,7 +153,7 @@ export class ExampleDetailsElement extends MobxLitElement {
`;
}
- // TODO(b/311725252): Create a separate data-table component.
+ // TODO: Create a separate data-table component.
private renderRaterTable() {
const selectedExample = this.selectedExample;
if (selectedExample == null) {
@@ -237,18 +237,17 @@ export class ExampleDetailsElement extends MobxLitElement {
`;
const styleScore = classMap({
'score': true,
@@ -467,8 +476,6 @@ export class ExampleTableElement extends MobxLitElement {
) :
html``;
- const styleHolder = this.styleHolder(example);
-
// Custom fields.
const renderCustomField = (field: Field, columnIndex: number) => {
if (field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY) {
diff --git a/client/components/histogram.ts b/client/components/histogram.ts
index 0d06786..e9e4ac8 100644
--- a/client/components/histogram.ts
+++ b/client/components/histogram.ts
@@ -38,7 +38,7 @@ import {styles} from './histogram.css';
/**
* Component for histograms for the distribution of scores or custom funcs.
- * TODO(b/311744307): Extract common parts in the bar chart.
+ * TODO: Extract common parts in the bar chart.
*/
@customElement('comparator-histogram')
export class HistogramElement extends MobxLitElement {
diff --git a/client/components/metrics_by_slice.css b/client/components/metrics_by_slice.css
index 0b19e7f..a6de395 100644
--- a/client/components/metrics_by_slice.css
+++ b/client/components/metrics_by_slice.css
@@ -1,3 +1,8 @@
+thead {
+ position: sticky;
+ top: 0;
+}
+
th.score-avg {
width: 98px; /* width sum for score-avg-number and score-avg-chart */
}
@@ -111,9 +116,9 @@ rect.bar.win-rate-result-tie {
fill: var(--comparator-grey-400);
}
-.collapsed {
+.collapsed .table-container {
max-height: 220px;
- overflow-y: hidden;
+ overflow-y: scroll;
}
line.middle-point-vertical {
diff --git a/client/components/metrics_by_slice.ts b/client/components/metrics_by_slice.ts
index 5b2c5c6..d2a31cf 100644
--- a/client/components/metrics_by_slice.ts
+++ b/client/components/metrics_by_slice.ts
@@ -57,7 +57,7 @@ enum SortOrder {
}
/**
- * Component for visualizing Autorater scores.
+ * Component for visualizing avg score and win rate metrics by slice.
*/
@customElement('comparator-metrics-by-slice')
export class MetricsBySliceElement extends MobxLitElement {
@@ -257,6 +257,7 @@ export class MetricsBySliceElement extends MobxLitElement {
}
}
+ // Render a confidence interval chart for average scores.
private renderScoreConfIntervalChart(
avgScore: number | null,
intervalLeft: number,
@@ -296,7 +297,7 @@ export class MetricsBySliceElement extends MobxLitElement {
),
});
- // TODO(b/325506046): Use tooltip for confidence interval details.
+ // TODO: Use tooltip for confidence interval details.
const tooltipText = `${`95% CI: [${intervalLeft.toFixed(
3,
)}, ${intervalRight.toFixed(3)}]`}`;
@@ -356,10 +357,6 @@ export class MetricsBySliceElement extends MobxLitElement {
),
});
- // Only visualizes when individual rating data are available.
- // If unavailable, it is likely the cases where ratings are flattened and
- // same prompts are repeated (like in Bard Eval).
- // For these cases, confidence intervals would be misleading.
const renderScoreConfIntervalChart = this.renderScoreConfIntervalChart(
avgScore,
intervalLeft,
@@ -376,6 +373,7 @@ export class MetricsBySliceElement extends MobxLitElement {
`;
}
+ // Render a win rate chart using a stacked percentage bar chart.
private renderWinRateChart(
winRate: number,
entry: SliceWinRate,
@@ -419,7 +417,7 @@ export class MetricsBySliceElement extends MobxLitElement {
y2=${this.barHeight * 0.5} />`
: svg``;
- // TODO(b/325506046): Use tooltip for confidence interval details.
+ // TODO: Use tooltip for confidence interval details.
const tooltipText =
intervalLeft != null && intervalRight != null
? `${`95% CI: [${intervalLeft.toFixed(3)}, ${intervalRight.toFixed(
@@ -618,7 +616,6 @@ export class MetricsBySliceElement extends MobxLitElement {
}
override render() {
- // prettier-ignore
return html`${this.renderWinRateBySliceChart()}`;
}
}
diff --git a/client/components/rationale_summary.css b/client/components/rationale_summary.css
index b0579cc..21ce424 100644
--- a/client/components/rationale_summary.css
+++ b/client/components/rationale_summary.css
@@ -4,7 +4,7 @@ th.example-count {
}
th.remove {
- width: 32px;
+ width: 26px;
}
text.bar-count-text {
diff --git a/client/components/rationale_summary.ts b/client/components/rationale_summary.ts
index d7a0b2f..9b2e829 100644
--- a/client/components/rationale_summary.ts
+++ b/client/components/rationale_summary.ts
@@ -51,7 +51,7 @@ export class RationaleSummaryElement extends MobxLitElement {
private readonly widthOfNumberLabel = 10;
// Whether to show the "others" category (id=0).
- // TODO(kahng): Implemented, but decided not to display it for now.
+ // TODO: Implemented, but decided not to display it for now.
@observable showOthers = false;
@observable sortColumn = 'A'; // label, A, or B
@@ -297,10 +297,7 @@ export class RationaleSummaryElement extends MobxLitElement {
What are some clusters of the rationales used by the rater
- when it thinks
- ${this.sortColumn === 'A' || this.sortColumn === 'B'
- ? `${this.sortColumn}`
- : 'either A or B'} is better?
+ when it thinks A or B is better?
diff --git a/client/components/score_histogram.ts b/client/components/score_histogram.ts
index f7d194a..8e0b1d9 100644
--- a/client/components/score_histogram.ts
+++ b/client/components/score_histogram.ts
@@ -31,7 +31,7 @@ import {AppState} from '../services/state_service';
import {styles} from './score_histogram.css';
/**
- * Component for visualizing Autorater scores.
+ * Component for visualizing the score distribution as a histogram.
*/
@customElement('comparator-score-histogram')
export class ScoreHistogramElement extends MobxLitElement {
diff --git a/client/components/settings.ts b/client/components/settings.ts
index 43a60e5..81e6801 100644
--- a/client/components/settings.ts
+++ b/client/components/settings.ts
@@ -77,7 +77,7 @@ declare global {
}
/**
- * Renders the data table settings pop-up.
+ * Renders the data table settings pop-up on the left side.
*/
@customElement('comparator-settings')
export class ComparatorSettingsElement extends MobxLitElement {
diff --git a/client/components/toolbar.ts b/client/components/toolbar.ts
index 006daab..0a01f6c 100644
--- a/client/components/toolbar.ts
+++ b/client/components/toolbar.ts
@@ -28,7 +28,7 @@ import {AppState} from '../services/state_service';
import {styles} from './toolbar.css';
/**
- * Toolbar component.
+ * Toolbar component at the top of the main table.
*/
@customElement('comparator-toolbar')
export class ToolbarElement extends MobxLitElement {
@@ -260,39 +260,43 @@ export class ToolbarElement extends MobxLitElement {
`;
}
}
diff --git a/client/lib/types.ts b/client/lib/types.ts
index 545dfe1..f06c48b 100644
--- a/client/lib/types.ts
+++ b/client/lib/types.ts
@@ -46,7 +46,7 @@ export interface IndividualRating {
rating_label: string | null;
is_flipped: boolean | null;
rationale: string | null;
- // TODO(b/324469307): Support more types.
+ // TODO: Support more types.
custom_fields: {
[key: string]: string | Array;
};
@@ -100,10 +100,10 @@ export interface SequenceChunk {
// tslint:disable:enforce-name-casing
export interface Example {
index: number;
- input_text: string | SequenceChunk[];
- output_text_a: string | SequenceChunk[];
- output_text_b: string | SequenceChunk[];
+ input_text: string|SequenceChunk[];
tags: string[];
+ output_text_a: string|SequenceChunk[];
+ output_text_b: string|SequenceChunk[];
score: number | null;
individual_rater_scores: IndividualRating[];
rationale_list: RationaleListItem[];
@@ -171,7 +171,6 @@ export interface CustomFieldSchema {
export interface Metadata {
source_path: string;
custom_fields_schema: CustomFieldSchema[];
- sampling_step_size: number;
}
/**
@@ -343,7 +342,7 @@ export interface HistogramSpec {
/**
* Interface for a custom field for ratings selection.
* (only supporting per_rating_per_model_category for now)
- * TODO(b/324469307): Support more per-rating types.
+ * TODO: Support more per-rating types.
*/
export interface RatingChartSelection {
fieldId: string;
diff --git a/client/lib/utils.ts b/client/lib/utils.ts
index 6f216f0..77aa272 100644
--- a/client/lib/utils.ts
+++ b/client/lib/utils.ts
@@ -677,9 +677,8 @@ export function getBarFilterLabel(
* Helper for cleaning LLM-generated values.
*/
export function cleanValue(val: string | null) {
- // There exist many variants of "issues" (e.g., "No Issues", "No issues(s),
- // "Major issue(s)", etc. We detect them and replace with "issues".
- return val == null ? val : val.replace(/\bissues?\(?(s)?\)?$/i, 'issues');
+ // We can include some manual data cleaning pipelines.
+ return val;
}
/**
diff --git a/client/services/state_service.ts b/client/services/state_service.ts
index fbe62a9..76b56c2 100644
--- a/client/services/state_service.ts
+++ b/client/services/state_service.ts
@@ -21,7 +21,7 @@ import {computed, makeObservable, observable} from 'mobx';
import {BUILT_IN_DEMO_FILES, DEFAULT_COLUMN_LIST, DEFAULT_HISTOGRAM_SPEC, DEFAULT_NUM_EXAMPLES_TO_DISPLAY, DEFAULT_RATIONALE_CLUSTER_SIMILARITY_THRESHOLD, DEFAULT_SORTING_CRITERIA, DEFAULT_WIN_RATE_THRESHOLD, FIELD_ID_FOR_INPUT, FIELD_ID_FOR_OUTPUT_A, FIELD_ID_FOR_OUTPUT_B, FIELD_ID_FOR_RATIONALE_LIST, FIELD_ID_FOR_RATIONALES, FIELD_ID_FOR_SCORE, FIVE_POINT_LIKERT_HISTOGRAM_SPEC, INITIAL_CUSTOM_FUNCTIONS,} from '../lib/constants';
import type {ChartSelectionKey, CustomFieldSchema, CustomFunction, Example, Field, HistogramSpec, IndividualRating, Metadata, Model, RatingChartSelection, RationaleCluster, RationaleListItem, SortCriteria,} from '../lib/types';
import {AOrB, ChartType, CustomFuncReturnType, DataResponse, ErrorResponse, FieldType, SortColumn, SortOrder,} from '../lib/types';
-import {compareNumbersWithNulls, compareStringsWithNulls, computeSimilaritiesBetweenVectorAndNormalizedMatrix, convertToNumber, extractTextFromTextOrSequenceChunks, getFieldIdForCustomFunc, getHistogramBinIndexFromValue, getMinAndMax, groupByAndSortKeys, groupByValues, initializeCustomFuncSelections, isPerRatingFieldType, mergeTwoArrays, normalizeVector, searchText,} from '../lib/utils';
+import {compareNumbersWithNulls, compareStringsWithNulls, convertToNumber, extractTextFromTextOrSequenceChunks, getFieldIdForCustomFunc, getHistogramBinIndexFromValue, getMinAndMax, groupByAndSortKeys, groupByValues, initializeCustomFuncSelections, isPerRatingFieldType, mergeTwoArrays, searchText,} from '../lib/utils';
import {CustomFunctionService} from './custom_function_service';
import {Service} from './service';
@@ -35,13 +35,7 @@ export class AppState extends Service {
makeObservable(this);
}
- @observable datasetPath: string | null = null;
- @observable isDatasetPathUploadedFile = false;
- @observable isOpenDatasetSelectionPanel = true;
-
- @observable targetTeam = 'app'; // app, gemini, bard, etc.
- @observable exampleDatasetPaths: string[] = BUILT_IN_DEMO_FILES;
-
+ // Fields from data files.
@observable
metadata: Metadata = {
source_path: '',
@@ -52,11 +46,27 @@ export class AppState extends Service {
@observable examples: Example[] = [];
@observable rationaleClusters: RationaleCluster[] = [];
+ // Dataset path.
+ @observable datasetPath: string|null = null;
+ @observable isDatasetPathUploadedFile = false;
+ @observable isOpenDatasetSelectionPanel = true;
+
+ @observable exampleDatasetPaths: string[] = BUILT_IN_DEMO_FILES;
+
+ // Tags.
+ @observable selectedTag: string|null = null;
+
+ // Table sorting.
@observable currentSorting: SortCriteria = DEFAULT_SORTING_CRITERIA;
- @observable selectedExample: Example | null = null;
- @observable selectedTag: string | null = null;
+ // Example expansion (key: index). If not exists, assume false.
+ @observable isExampleExpanded: {[key: number]: boolean} = {};
+ getIsExampleExpanded(index: number): boolean {
+ return this.isExampleExpanded[index] ?? false;
+ }
+ // Example details.
+ @observable selectedExample: Example|null = null;
@observable showSelectedExampleDetails = false;
@observable exampleDetailsPanelExpanded = false;
@@ -75,7 +85,7 @@ export class AppState extends Service {
DEFAULT_RATIONALE_CLUSTER_SIMILARITY_THRESHOLD;
// Columns.
- // TODO(b/315147299): Use a url service to sync the visibility state.
+ // TODO: Use a url service to sync the visibility state.
@observable columns: Field[] = DEFAULT_COLUMN_LIST;
// Charts.
@@ -88,7 +98,7 @@ export class AppState extends Service {
// For simple bar charts, a single-item array, e.g., [null] (non-selected);
// for grouped bar charts, a two-item array, e.g., ['sports', null]
// (if the bar for 'sports' is selected for A; no bars are selected for B).
- // TODO(b/315722619): Merge selection variables into one.
+ // TODO: Merge selection variables into one.
@observable selectedBarChartValues: {[key: string]: Array} =
{};
@@ -209,8 +219,6 @@ export class AppState extends Service {
);
}
- @observable sampleCountForCheckingRatingLevelDataAvailability = 10;
-
// Custom functions.
@observable customFunctions: {[key: number]: CustomFunction} = {};
@@ -281,7 +289,7 @@ export class AppState extends Service {
return (
this.columns
.filter((field: Field) => field.type === FieldType.PER_MODEL_NUMBER)
- // TODO(b/315388387): Will not need when custom functions are
+ // TODO: Will not need when custom functions are
// merged.
.filter((field: Field) => field.id.startsWith('custom_field:')));
}
@@ -297,7 +305,7 @@ export class AppState extends Service {
field.type === FieldType.PER_RATING_PER_MODEL_CATEGORY,
)
// Exclude custom functions
- // TODO(b/315388387): Will not need when custom functions are
+ // TODO: Will not need when custom functions are
// merged.
.filter((field: Field) => field.id.startsWith('custom_field:')));
}
@@ -489,7 +497,7 @@ export class AppState extends Service {
return examples;
}
- // TODO(b/326139568): Merge with the side-by-side histograms.
+ // TODO: Merge with the side-by-side histograms.
private applyHistogramFilterForCustomFuncs(
examplesBeforeThisFilter: Example[],
excludeId: number|null = null,
@@ -991,7 +999,9 @@ export class AppState extends Service {
this.selectedTag = null;
this.selectedCustomFuncId = null;
- // TODO(b/315722619) Merge selection variables.
+ this.isExampleExpanded = {};
+
+ // TODO Merge selection variables.
this.selectedHistogramBinForScores = null;
this.selectedHistogramBinForCustomFields = {};
this.selectedBarChartValues = {};
@@ -1035,8 +1045,7 @@ export class AppState extends Service {
params[key] = decodeURIComponent(value);
}
- // Get results_path (and cns_path) parameter from url.
- // The cns_path is for those who have used the older versions.
+ // Get path parameters from url.
if (params.hasOwnProperty('results_path')) {
const datasetPath = params['results_path'];
// Get max examples parameter from url.
@@ -1058,9 +1067,6 @@ export class AppState extends Service {
samplingStepSize,
columnsToHide,
);
- } else if (params.hasOwnProperty('cns_path')) {
- const datasetPath = params['cns_path'];
- this.loadData(datasetPath, null);
}
}
@@ -1129,11 +1135,11 @@ export class AppState extends Service {
} else {
this.histogramSpecForScores = DEFAULT_HISTOGRAM_SPEC;
}
- // TODO(b/338112225): Support custom higher ranges (e.g., 5.0 to -5.0).
+ // TODO: Support custom higher ranges (e.g., 5.0 to -5.0).
}
// Add histogram spec for custom functions with return type number.
- // TODO(b/326139568): Merge with the side-by-side histograms.
+ // TODO: Merge with the side-by-side histograms.
private addHistogramSpecForCustomFunc(customFunc: CustomFunction) {
if (customFunc.returnType === CustomFuncReturnType.NUMBER) {
const fieldId = getFieldIdForCustomFunc(customFunc.id);
@@ -1258,7 +1264,7 @@ export class AppState extends Service {
});
}
- // Load data either from the server or uploaded file.
+ // Load data either from a specified path or uploaded file.
async loadData(
datasetPath: string,
fileObject: File | null = null,
@@ -1274,7 +1280,7 @@ export class AppState extends Service {
try {
// Load data from the uploaded file.
const fileContent = await this.readFileContent(fileObject);
- // TODO(b/333119821): Validate the format of the uploaded file.
+ // TODO: Validate the format of the uploaded file.
const jsonResponse = JSON.parse(fileContent);
dataResponse = jsonResponse as DataResponse;
} catch (error) {
@@ -1299,7 +1305,7 @@ export class AppState extends Service {
throw new Error(errorMessage);
}
if (response.status === 502) {
- // TODO(b/316021912): Use a corp domain url.
+ // TODO: Use a corp domain url.
const errorMessage =
'Failed to load the dataset. The server may not exist anymore, ' +
'possibly with updated URLs. Try opening this URL ' +
@@ -1332,7 +1338,7 @@ export class AppState extends Service {
// Assign indices to examples.
example.index = index;
- // TODO(b/338112784): Check if all the required fields exist.
+ // TODO: Check if all the required fields exist.
// Assign indices to individual ratings.
example.individual_rater_scores.forEach(
@@ -1443,7 +1449,7 @@ export class AppState extends Service {
this.customFieldsOfPerRatingType.forEach((ratingField: Field) => {
// Change the key from field name to field id.
this.examples.forEach((ex: Example) => {
- // TODO(b/324469307): Support more per-rating types.
+ // TODO: Support more per-rating types.
if (ratingField.type === FieldType.PER_RATING_STRING) {
ex.individual_rater_scores.forEach((rating: IndividualRating) => {
// Change the key from field name to field id.
@@ -1468,7 +1474,7 @@ export class AppState extends Service {
// Perform group-by aggregations over ratings.
this.examples.forEach((ex: Example) => {
- // TODO(b/324469307): Support more per-rating types.
+ // TODO: Support more per-rating types.
if (ratingField.type === FieldType.PER_RATING_STRING) {
// Simply concatenate strings.
ex.custom_fields[ratingField.id] = ex.individual_rater_scores
@@ -1537,17 +1543,12 @@ export class AppState extends Service {
this.runCustomFunction(this.examples, customFunc);
});
- const statusMessage = `Loaded the dataset of ${
- this.examples.length
- } examples.${
- this.metadata.sampling_step_size > 1
- ? ` Because of the large size, we sampled data from every ${this.metadata.sampling_step_size} examples.`
- : ''
- }`;
+ const statusMessage =
+ `Loaded the dataset of ${this.examples.length} examples.`;
this.updateStatusMessage(statusMessage, true);
// Update URL.
- // TODO(b/315147299): Create a URL service to keep URL and app in sync.
+ // TODO: Create a URL service to keep URL and app in sync.
const url = new URL(window.location.href);
if (this.isDatasetPathUploadedFile === false) {
url.searchParams.set('results_path', this.datasetPath);
@@ -1656,6 +1657,7 @@ export class AppState extends Service {
}
}
+ // Remove a rationale cluster row.
removeCluster(clusterId: number) {
if (clusterId === this.selectedRationaleClusterId) {
this.selectedRationaleClusterId = null;
diff --git a/docs/data/example_tiny.json b/docs/data/example_tiny.json
index 79c9204..dd13b2a 100644
--- a/docs/data/example_tiny.json
+++ b/docs/data/example_tiny.json
@@ -1,30 +1,52 @@
{
"metadata": {
"source_path": "n/a: synthetic data for LLM Comparator demo",
- "custom_fields_schema": []
+ "custom_fields_schema": [
+ {"name": "language", "type": "per_model_category"}
+ ]
},
"models": [
- {"name": "ABC 1.1"},
- {"name": "ABC 1.0"}
+ {"name": "ABC v0.6"},
+ {"name": "ABC v0.5"}
],
"examples": [
{
- "input_text": "Which city should I visit in South Korea?",
- "tags": ["Travel"],
- "output_text_a": "You can visit Seoul, the capital of South Korea.",
- "output_text_b": "You can visit Seoul, Busan, and Jeju.",
+ "input_text": "What is LLM Comparator?",
+ "tags": ["Technology"],
+ "output_text_a": "LLM Comparator is an interactive tool for analyzing results from side-by-side LLM evaluation. It visualizes model performance and helps users explore individual responses.\n\nIt has been developed by the People + AI Research Team at Google. The code is available at https://github.com/PAIR-code/llm-comparator.",
+ "output_text_b": "LLM Comparator is a tool for comparing LLM responses from two different models.",
"score": 0.5,
- "individual_rater_scores": [],
- "custom_fields": {}
- },
+ "individual_rater_scores": [
+ {"score": 1.0, "rating_label": "A is better", "is_flipped": false, "rationale": "Response A is more detailed."},
+ {"score": 1.5, "rating_label": "A is much better", "is_flipped": false, "rationale": "Response A provides more information."},
+ {"score": -0.5, "rating_label": "B is slightly better", "is_flipped": true, "rationale": "Response B more succinctly answers the question."},
+ {"score": 0.0, "rating_label": "same", "is_flipped": true, "rationale": "Both provide correct information."}
+ ],
+ "custom_fields": {
+ "language": ["English", "English"]
+ }
+ },
{
"input_text": "How to draw bar charts using Python?",
- "tags": ["Coding"],
- "output_text_a": "I don't know it.",
- "output_text_b": "You can use some data visualization libraries.",
+ "tags": ["Technology"],
+ "output_text_a": "Bar charts can be created by using data visualization libraries.",
+ "output_text_b": "You can draw bar charts using data visualization libraries.\n\n- Matplotlib is a very popular, established library primarily for creating static plots.\n- Plotly is a web-based visualization library for creating a variety of interactive charts.\n- Altair is a declarative visualization library based on a simple, expressive grammar.",
"score": -1.0,
"individual_rater_scores": [],
- "custom_fields": {}
+ "custom_fields": {
+ "language": ["English", "English"]
+ }
+ },
+ {
+ "input_text": "Which city should I visit in South Korea?",
+ "tags": ["Travel"],
+ "output_text_a": "You can visit Seoul.\n\nSeoul is the capital of South Korea. It is the country's largest city with a population of nearly ten million people.",
+ "output_text_b": "Sure, I can tell you. You can visit Seoul, Busan, and Jeju.\n\n- Seoul is the capital city with historic palaces and skyscrapers\n- Busan is the second-largest city with beautiful beaches\n- Jeju is a volcanic island and famous for its natural scenery.",
+ "score": 0.5,
+ "individual_rater_scores": [],
+ "custom_fields": {
+ "language": ["English", "English"]
+ }
},
{
"input_text": "Hi, how are you?",
@@ -32,8 +54,23 @@
"output_text_a": "Good, how are you?",
"output_text_b": "Hi, how are you?",
"score": 0.0,
+ "individual_rater_scores": [
+ {"score": 0.0, "rationale": "There is no meaningful difference."}
+ ],
+ "custom_fields": {
+ "language": ["English", "English"]
+ }
+ },
+ {
+ "input_text": "How to say hello in Korean?",
+ "tags": [],
+ "output_text_a": "안녕하세요?",
+ "output_text_b": "I don't speak Korean.",
+ "score": 1.5,
"individual_rater_scores": [],
- "custom_fields": {}
+ "custom_fields": {
+ "language": ["Korean", "English"]
+ }
}
]
}
\ No newline at end of file
diff --git a/docs/dev_sources.concat.js b/docs/dev_sources.concat.js
index 7d87c8a..cef5ac3 100644
--- a/docs/dev_sources.concat.js
+++ b/docs/dev_sources.concat.js
@@ -11382,11 +11382,6 @@ rect.clickable-transparent-area.selected:hover {
constructor(customFunctionService) {
super();
this.customFunctionService = customFunctionService;
- this.datasetPath = null;
- this.isDatasetPathUploadedFile = false;
- this.isOpenDatasetSelectionPanel = true;
- this.targetTeam = "app";
- this.exampleDatasetPaths = BUILT_IN_DEMO_FILES;
this.metadata = {
source_path: "",
custom_fields_schema: [],
@@ -11395,9 +11390,14 @@ rect.clickable-transparent-area.selected:hover {
this.models = [{ name: "" }, { name: "" }];
this.examples = [];
this.rationaleClusters = [];
+ this.datasetPath = null;
+ this.isDatasetPathUploadedFile = false;
+ this.isOpenDatasetSelectionPanel = true;
+ this.exampleDatasetPaths = BUILT_IN_DEMO_FILES;
+ this.selectedTag = null;
this.currentSorting = DEFAULT_SORTING_CRITERIA;
+ this.isExampleExpanded = {};
this.selectedExample = null;
- this.selectedTag = null;
this.showSelectedExampleDetails = false;
this.exampleDetailsPanelExpanded = false;
this.hasRationaleClusters = false;
@@ -11428,7 +11428,6 @@ rect.clickable-transparent-area.selected:hover {
this.isShowTagChips = true;
this.isShowSidebar = true;
this.numberOfLinesPerOutputCell = 7;
- this.sampleCountForCheckingRatingLevelDataAvailability = 10;
this.customFunctions = {};
this.histogramSpecForCustomFuncs = {};
this.histogramSpecForCustomFuncsOfDiff = {};
@@ -11438,6 +11437,9 @@ rect.clickable-transparent-area.selected:hover {
this.valueDomainsForCustomFields = {};
makeObservable(this);
}
+ getIsExampleExpanded(index) {
+ return this.isExampleExpanded[index] ?? false;
+ }
resetSearchFilter(fieldId) {
this.searchFilters[fieldId] = "";
this.searchFilterInputs[fieldId] = "";
@@ -11636,7 +11638,7 @@ rect.clickable-transparent-area.selected:hover {
}
return examples;
}
- // TODO(b/326139568): Merge with the side-by-side histograms.
+ // TODO: Merge with the side-by-side histograms.
applyHistogramFilterForCustomFuncs(examplesBeforeThisFilter, excludeId = null, excludeModel = null) {
let examples = examplesBeforeThisFilter;
Object.values(this.customFunctions).filter(
@@ -11930,6 +11932,7 @@ rect.clickable-transparent-area.selected:hover {
this.selectedExample = null;
this.selectedTag = null;
this.selectedCustomFuncId = null;
+ this.isExampleExpanded = {};
this.selectedHistogramBinForScores = null;
this.selectedHistogramBinForCustomFields = {};
this.selectedBarChartValues = {};
@@ -11972,9 +11975,6 @@ rect.clickable-transparent-area.selected:hover {
samplingStepSize,
columnsToHide
);
- } else if (params.hasOwnProperty("cns_path")) {
- const datasetPath = params["cns_path"];
- this.loadData(datasetPath, null);
}
}
// Update the sorting option.
@@ -12032,7 +12032,7 @@ rect.clickable-transparent-area.selected:hover {
}
}
// Add histogram spec for custom functions with return type number.
- // TODO(b/326139568): Merge with the side-by-side histograms.
+ // TODO: Merge with the side-by-side histograms.
addHistogramSpecForCustomFunc(customFunc) {
if (customFunc.returnType === "Number" /* NUMBER */) {
const fieldId = getFieldIdForCustomFunc(customFunc.id);
@@ -12130,7 +12130,7 @@ rect.clickable-transparent-area.selected:hover {
reader.readAsText(file);
});
}
- // Load data either from the server or uploaded file.
+ // Load data either from a specified path or uploaded file.
async loadData(datasetPath, fileObject = null, maxNumExamplesToDisplay = null, samplingStepSize = null, columnsToHide = []) {
this.isOpenDatasetSelectionPanel = false;
this.updateStatusMessage("Loading the dataset... Please wait...");
@@ -12317,7 +12317,7 @@ rect.clickable-transparent-area.selected:hover {
this.selectionsFromCustomFuncResults[newId] = initializeCustomFuncSelections();
this.runCustomFunction(this.examples, customFunc);
});
- const statusMessage = `Loaded the dataset of ${this.examples.length} examples.${this.metadata.sampling_step_size > 1 ? ` Because of the large size, we sampled data from every ${this.metadata.sampling_step_size} examples.` : ""}`;
+ const statusMessage = `Loaded the dataset of ${this.examples.length} examples.`;
this.updateStatusMessage(statusMessage, true);
const url = new URL(window.location.href);
if (this.isDatasetPathUploadedFile === false) {
@@ -12402,6 +12402,7 @@ rect.clickable-transparent-area.selected:hover {
this.updateStatusMessage(error, false);
}
}
+ // Remove a rationale cluster row.
removeCluster(clusterId) {
if (clusterId === this.selectedRationaleClusterId) {
this.selectedRationaleClusterId = null;
@@ -12428,40 +12429,40 @@ rect.clickable-transparent-area.selected:hover {
};
__decorateClass([
observable
- ], AppState.prototype, "datasetPath", 2);
+ ], AppState.prototype, "metadata", 2);
__decorateClass([
observable
- ], AppState.prototype, "isDatasetPathUploadedFile", 2);
+ ], AppState.prototype, "models", 2);
__decorateClass([
observable
- ], AppState.prototype, "isOpenDatasetSelectionPanel", 2);
+ ], AppState.prototype, "examples", 2);
__decorateClass([
observable
- ], AppState.prototype, "targetTeam", 2);
+ ], AppState.prototype, "rationaleClusters", 2);
__decorateClass([
observable
- ], AppState.prototype, "exampleDatasetPaths", 2);
+ ], AppState.prototype, "datasetPath", 2);
__decorateClass([
observable
- ], AppState.prototype, "metadata", 2);
+ ], AppState.prototype, "isDatasetPathUploadedFile", 2);
__decorateClass([
observable
- ], AppState.prototype, "models", 2);
+ ], AppState.prototype, "isOpenDatasetSelectionPanel", 2);
__decorateClass([
observable
- ], AppState.prototype, "examples", 2);
+ ], AppState.prototype, "exampleDatasetPaths", 2);
__decorateClass([
observable
- ], AppState.prototype, "rationaleClusters", 2);
+ ], AppState.prototype, "selectedTag", 2);
__decorateClass([
observable
], AppState.prototype, "currentSorting", 2);
__decorateClass([
observable
- ], AppState.prototype, "selectedExample", 2);
+ ], AppState.prototype, "isExampleExpanded", 2);
__decorateClass([
observable
- ], AppState.prototype, "selectedTag", 2);
+ ], AppState.prototype, "selectedExample", 2);
__decorateClass([
observable
], AppState.prototype, "showSelectedExampleDetails", 2);
@@ -12558,9 +12559,6 @@ rect.clickable-transparent-area.selected:hover {
__decorateClass([
computed
], AppState.prototype, "isScoreDivergingScheme", 1);
- __decorateClass([
- observable
- ], AppState.prototype, "sampleCountForCheckingRatingLevelDataAvailability", 2);
__decorateClass([
observable
], AppState.prototype, "customFunctions", 2);
@@ -13263,7 +13261,7 @@ line.axis {
.numExamples=${filteredExamples.length}>
`;
}
- // TODO(b/326139568): Merge into the side-by-side histogram code in charts.ts.
+ // TODO: Merge into the side-by-side histogram code in charts.ts.
renderChartForNumberType(customFunc) {
const getHistogramSpec = () => this.appState.histogramSpecForCustomFuncs[customFunc.id];
const getHistogramSpecForDiff = () => this.appState.histogramSpecForCustomFuncsOfDiff[customFunc.id];
@@ -13578,8 +13576,8 @@ line.axis {
}
.panel-instruction {
- color: #555;
- line-height: 16px;
+ color: var(--comparator-grey-800);
+ line-height: 18px;
margin: 5px 0;
padding: 2px 0;
}
@@ -13654,11 +13652,17 @@ input, button {
const documentationLink = "https://github.com/PAIR-code/llm-comparator";
return x`
- The json file must contain these three properties: "metadata", "models",
- and "examples".
+ The json file must contain these three properties:
+ metadata,
+ models,
+ and examples.
- Each example must have "input_text", "tags", "output_text_a",
- "output_text_b", and "score".
+ Each example in examples must have
+ input_text,
+ tags,
+ output_text_a,
+ output_text_b,
+ and score.
Please refer to our document for details:
${documentationLink}
@@ -13681,7 +13685,7 @@ input, button {
"selected": this.appState.datasetPath === datasetPath
});
const textareaPlaceholder = "Enter a URL to load the json file from.";
- const urlLoadPath = this.appState.appLink + "?results_path=https://.../results.json";
+ const urlLoadPath = this.appState.appLink + "?results_path=https://.../...json";
const panelIntro = x`
Enter the URL path of a json file prepared for LLM Comparator.`;
const panelOutro = x`
@@ -13982,7 +13986,7 @@ td.rationale {
.isFlipXAxis=${() => this.appState.isFlipScoreHistogramAxis}>
`;
}
- // TODO(b/311725252): Create a separate data-table component.
+ // TODO: Create a separate data-table component.
renderRaterTable() {
const selectedExample = this.selectedExample;
if (selectedExample == null) {
@@ -14047,10 +14051,7 @@ td.rationale {