Spaces:
Running
A newer version of the Gradio SDK is available: 6.9.0
Guide: Adding a New Task to the MEDIC Leaderboard
This guide provides step-by-step instructions for adding a new evaluation task to the leaderboard system.
Overview
Adding a new task requires changes in several files:
src/about.py- Define column structure and enumssrc/display/utils.py- Register columns and create column listssrc/populate_optimized.py- Add sorting logic for the new tasksrc/ui/leaderboard_v2.py- Add UI components for the task- Data files - Ensure evaluation results are in the correct format
Step 1: Define Column Structure in src/about.py
1.1 Create a Column Dataclass
Define a dataclass for your task's columns:
@dataclass
class MyTaskColumn:
benchmark: str # Key in the JSON data file
metric: str # Metric type (usually "score")
col_name: str # Display name in the leaderboard
class MyTaskColumns(Enum):
mytask_column0 = MyTaskColumn("metric1", "score", "Metric 1")
mytask_column1 = MyTaskColumn("metric2", "score", "Metric 2")
mytask_column2 = MyTaskColumn("metric3", "score", "Metric 3")
Example from existing code:
@dataclass
class ACIColumn:
benchmark: str
metric: str
col_name: str
class ACIColumns(Enum):
aci_column0 = ACIColumn("coverage", "score", "Coverage")
aci_column1 = ACIColumn("conform", "score", "Conformity")
aci_column2 = ACIColumn("fact", "score", "Consistency")
1.2 Import Your Enum
Make sure to import your new enum class at the top of src/about.py if needed, or it will be automatically available.
Step 2: Update Column Definitions in src/display/utils.py
2.1 Add Column Flag to ColumnContent Dataclass
Add a new boolean flag for your task in the ColumnContent dataclass:
@dataclass
class ColumnContent:
# ... existing fields ...
mytask_col: bool = False # Add this line
Location: Around line 43-74 in src/display/utils.py
2.2 Register Columns in auto_eval_column_dict
Add your task's columns to the auto_eval_column_dict list:
for column in MyTaskColumns:
auto_eval_column_dict.append(
[
column.name,
ColumnContent,
ColumnContent(
column.value.col_name,
"number",
True, # displayed_by_default
False, # hidden
mytask_col=True,
invariant=False
),
]
)
Example from existing code:
for column in ACIColumns:
auto_eval_column_dict.append(
[
column.name,
ColumnContent,
ColumnContent(column.value.col_name, "number", True, False, aci_col=True, invariant=False),
]
)
Location: Around line 130-250 in src/display/utils.py
2.3 Create Column Lists
Create two lists for your task:
- Column list for filtering:
MYTASK_COLS = [
c.name for c in fields(AutoEvalColumn)
if not c.hidden and (c.mytask_col or c.invariant)
]
- Benchmark column list:
MYTASK_BENCHMARK_COLS = [
t.value.col_name for t in MyTaskColumns
]
Location: Around line 494-595 in src/display/utils.py
Example from existing code:
ACI_COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden and (c.aci_col or c.invariant)]
ACI_BENCHMARK_COLS = [t.value.col_name for t in ACIColumns]
2.4 Export Column Lists (if needed)
If you need to import these lists elsewhere, make sure they're exported in the imports section at the top of files that use them.
Step 3: Add Sorting Logic in src/populate_optimized.py
Add sorting logic for your task in the _process_single_subset function:
elif subset == "mytask":
df = df.sort_values(by=["Primary Metric Name"], ascending=False)
Location: Around line 73-106 in src/populate_optimized.py
Example from existing code:
elif subset == "aci":
df = df.sort_values(by=[AutoEvalColumn.overall.name], ascending=False)
Note: Replace "Primary Metric Name" with the actual column name from your data that should be used for sorting.
Step 4: Add UI Component in src/ui/leaderboard_v2.py
4.1 Add Task to Categories List
Add your task to the categories list:
categories = [
"๐
Medical Summarization",
# ... existing tasks ...
"๐
My New Task", # Add your task here
# ... rest of tasks ...
]
Location: Around line 23-35 in src/ui/leaderboard_v2.py
4.2 Create UI Column
Add a new column for your task:
# X. My New Task
with gr.Column(visible=False) as col_mytask:
category_columns["๐
My New Task"] = col_mytask
# Optional: Add benchmark information accordion
with gr.Accordion("โน๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
gr.Markdown("", elem_classes="markdown-text") # Add description if needed
# Create the leaderboard UI
create_leaderboard_ui_v2(
subset_name="mytask", # Must match the subset name used in data files
column_choices=[
c.name
for c in fields(AutoEvalColumn)
if not c.hidden and not c.never_hidden and (c.invariant or c.mytask_col)
],
default_columns=[
c.name
for c in fields(AutoEvalColumn)
if c.displayed_by_default
and not c.hidden
and not c.never_hidden
and (c.invariant or c.mytask_col)
],
)
# Optional: Add generation templates accordion if applicable
with gr.Accordion("๐ฌ Generation templates", open=False):
with gr.Accordion("Response generation", open=False):
render_generation_templates(task="mytask", generation_type="response_generation")
Location: Add this section around line 200-500 in src/ui/leaderboard_v2.py, following the pattern of existing tasks.
4.3 Handle Subtasks (Optional)
If your task has subtasks (like MedCalc with Direct Answer, One Shot CoT, Zero Shot CoT), wrap the UI in tabs:
with gr.Column(visible=False) as col_mytask:
category_columns["๐
My New Task"] = col_mytask
with gr.Accordion("โน๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
gr.Markdown("", elem_classes="markdown-text")
with gr.Tabs(elem_classes="tab-buttons2", visible=True):
with gr.TabItem("Subtask 1", id=0):
create_leaderboard_ui_v2(
subset_name="mytask_subtask1",
column_choices=[...],
default_columns=[...],
)
with gr.TabItem("Subtask 2", id=1):
create_leaderboard_ui_v2(
subset_name="mytask_subtask2",
column_choices=[...],
default_columns=[...],
)
Example reference: See MedCalc implementation around line 201-256 in src/ui/leaderboard_v2.py.
Step 5: Data File Requirements
5.1 Data Structure
Your evaluation results JSON files should follow this structure:
{
"full_model": "model_name",
"revision": "commit_sha",
"subset": "mytask",
"metric1": 85.5,
"metric2": 92.3,
"metric3": 88.1,
...
}
5.2 Column Names
The benchmark field in your MyTaskColumn enum must match the keys in your JSON data files.
5.3 File Location
Place your evaluation result files in the appropriate directory that the system reads from (typically configured in EVAL_RESULTS_PATH).
Step 6: Update load_all_datasets_parallel (if needed)
If your task needs special handling in the data loading function, update src/populate_optimized.py:
def load_all_datasets_parallel(results_path: str, requests_path: str, max_workers: int = 4):
# ... existing code ...
# Add your task to the datasets dictionary
datasets["mytask"] = _process_single_subset(
results_path,
requests_path,
MYTASK_COLS,
MYTASK_BENCHMARK_COLS,
"your_evaluation_metric", # e.g., "score", "accuracy", etc.
"mytask"
)
Location: Around line 116-230 in src/populate_optimized.py
Note: The evaluation metric should match what's used in your data files.
Step 7: Testing Checklist
After making all changes:
- Task appears in the category selector
- Task can be selected and displays correctly
- Table shows all expected columns
- Data loads correctly from JSON files
- Sorting works correctly
- Search and filter functionality works
- Column selection works
- Benchmark information accordion displays (if added)
- Generation templates work (if added)
Common Patterns
Pattern 1: Simple Task (No Subtasks)
Example: Medical Summarization, Med Safety
- Single UI component
- No tabs needed
- Direct
create_leaderboard_ui_v2call
Pattern 2: Task with Subtasks
Example: MedCalc, EHRSQL, MedEC
- Wrap in
gr.Tabs(elem_classes="tab-buttons2") - Each subtask has its own
gr.TabItem - Each subtask needs its own column flag and enum
Pattern 3: Task with Multiple Languages
Example: Open Ended Evaluation
- Use
create_open_ended_evaluation_tabspattern - Create separate enum for each language
- Each language needs its own column flag
Pattern 4: Task Sharing Columns
Example: ACI and SOAP both use medical_summarization_col pattern
- Can share column flags if columns are identical
- Use same column flag in multiple tasks
Important Notes
Subset Name Consistency: The
subset_nameparameter increate_leaderboard_ui_v2must match:- The
subsetfield in your JSON data files - The key used in
load_all_datasets_parallel - The sorting logic in
_process_single_subset
- The
Column Flag Naming: Use
snake_casefor column flags (e.g.,mytask_col,my_new_task_col)Column Display Names: Use human-readable names in
col_name(e.g., "Overall Score", "Coverage", "Accuracy")Default Display: Set
displayed_by_default=Truefor important metrics that should be visible by defaultInvariant Columns: Columns like "Model", "Revision" are
invariant=Trueand appear in all tasksHidden Columns: Use
hidden=Truefor columns that shouldn't appear in the UINever Hidden: Use
never_hidden=Truefor essential columns like "Model" that can't be hidden
Example: Complete Task Implementation
Here's a complete example for a hypothetical task called "Clinical Reasoning":
1. src/about.py
@dataclass
class ClinicalReasoningColumn:
benchmark: str
metric: str
col_name: str
class ClinicalReasoningColumns(Enum):
cr_column0 = ClinicalReasoningColumn("accuracy", "score", "Accuracy")
cr_column1 = ClinicalReasoningColumn("reasoning_quality", "score", "Reasoning Quality")
cr_column2 = ClinicalReasoningColumn("safety_score", "score", "Safety Score")
2. src/display/utils.py
# In ColumnContent dataclass
clinical_reasoning_col: bool = False
# In auto_eval_column_dict
for column in ClinicalReasoningColumns:
auto_eval_column_dict.append(
[
column.name,
ColumnContent,
ColumnContent(column.value.col_name, "number", True, False, clinical_reasoning_col=True, invariant=False),
]
)
# Column lists
CLINICAL_REASONING_COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden and (c.clinical_reasoning_col or c.invariant)]
CLINICAL_REASONING_BENCHMARK_COLS = [t.value.col_name for t in ClinicalReasoningColumns]
3. src/populate_optimized.py
elif subset == "clinical_reasoning":
df = df.sort_values(by=["Accuracy"], ascending=False)
4. src/ui/leaderboard_v2.py
categories = [
# ... existing tasks ...
"๐
Clinical Reasoning",
# ... rest ...
]
# In the UI section
with gr.Column(visible=False) as col_clinical_reasoning:
category_columns["๐
Clinical Reasoning"] = col_clinical_reasoning
with gr.Accordion("โน๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
gr.Markdown("Clinical Reasoning evaluates...", elem_classes="markdown-text")
create_leaderboard_ui_v2(
subset_name="clinical_reasoning",
column_choices=[
c.name
for c in fields(AutoEvalColumn)
if not c.hidden and not c.never_hidden and (c.invariant or c.clinical_reasoning_col)
],
default_columns=[
c.name
for c in fields(AutoEvalColumn)
if c.displayed_by_default
and not c.hidden
and not c.never_hidden
and (c.invariant or c.clinical_reasoning_col)
],
)
Troubleshooting
Issue: Task doesn't appear in UI
- Check that task is added to
categorieslist - Verify column is created in
category_columnsdictionary - Ensure
switch_categoryfunction includes your task
Issue: No data shows up
- Verify
subset_namematches data filesubsetfield - Check that data is loaded in
load_all_datasets_parallel - Verify column names match between enum and data files
Issue: Columns don't appear
- Check column flag is set correctly in
ColumnContent - Verify columns are registered in
auto_eval_column_dict - Check column list includes your task's columns
Issue: Sorting doesn't work
- Verify sorting logic is added in
_process_single_subset - Check column name used for sorting matches data file key
Additional Resources
See existing task implementations for reference:
- Simple task:
med_safety(line 460-483 inleaderboard_v2.py) - Task with subtasks:
medcalc(line 201-256 inleaderboard_v2.py) - Task with info:
medical_summarization(line 172-198 inleaderboard_v2.py)
- Simple task:
Column definitions:
src/display/utils.pylines 43-595Task definitions:
src/about.pylines 1-300UI implementation:
src/ui/leaderboard_v2.pylines 172-500