MEDIC-Benchmark / ADDING_NEW_TASK.md
pkanithi's picture
medic v2
af627af

A newer version of the Gradio SDK is available: 6.9.0

Upgrade

Guide: Adding a New Task to the MEDIC Leaderboard

This guide provides step-by-step instructions for adding a new evaluation task to the leaderboard system.

Overview

Adding a new task requires changes in several files:

  1. src/about.py - Define column structure and enums
  2. src/display/utils.py - Register columns and create column lists
  3. src/populate_optimized.py - Add sorting logic for the new task
  4. src/ui/leaderboard_v2.py - Add UI components for the task
  5. Data files - Ensure evaluation results are in the correct format

Step 1: Define Column Structure in src/about.py

1.1 Create a Column Dataclass

Define a dataclass for your task's columns:

@dataclass
class MyTaskColumn:
    benchmark: str  # Key in the JSON data file
    metric: str     # Metric type (usually "score")
    col_name: str   # Display name in the leaderboard

class MyTaskColumns(Enum):
    mytask_column0 = MyTaskColumn("metric1", "score", "Metric 1")
    mytask_column1 = MyTaskColumn("metric2", "score", "Metric 2")
    mytask_column2 = MyTaskColumn("metric3", "score", "Metric 3")

Example from existing code:

@dataclass
class ACIColumn:
    benchmark: str
    metric: str
    col_name: str

class ACIColumns(Enum):
    aci_column0 = ACIColumn("coverage", "score", "Coverage")
    aci_column1 = ACIColumn("conform", "score", "Conformity")
    aci_column2 = ACIColumn("fact", "score", "Consistency")

1.2 Import Your Enum

Make sure to import your new enum class at the top of src/about.py if needed, or it will be automatically available.


Step 2: Update Column Definitions in src/display/utils.py

2.1 Add Column Flag to ColumnContent Dataclass

Add a new boolean flag for your task in the ColumnContent dataclass:

@dataclass
class ColumnContent:
    # ... existing fields ...
    mytask_col: bool = False  # Add this line

Location: Around line 43-74 in src/display/utils.py

2.2 Register Columns in auto_eval_column_dict

Add your task's columns to the auto_eval_column_dict list:

for column in MyTaskColumns:
    auto_eval_column_dict.append(
        [
            column.name,
            ColumnContent,
            ColumnContent(
                column.value.col_name, 
                "number", 
                True,  # displayed_by_default
                False,  # hidden
                mytask_col=True, 
                invariant=False
            ),
        ]
    )

Example from existing code:

for column in ACIColumns:
    auto_eval_column_dict.append(
        [
            column.name,
            ColumnContent,
            ColumnContent(column.value.col_name, "number", True, False, aci_col=True, invariant=False),
        ]
    )

Location: Around line 130-250 in src/display/utils.py

2.3 Create Column Lists

Create two lists for your task:

  1. Column list for filtering:
MYTASK_COLS = [
    c.name for c in fields(AutoEvalColumn) 
    if not c.hidden and (c.mytask_col or c.invariant)
]
  1. Benchmark column list:
MYTASK_BENCHMARK_COLS = [
    t.value.col_name for t in MyTaskColumns
]

Location: Around line 494-595 in src/display/utils.py

Example from existing code:

ACI_COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden and (c.aci_col or c.invariant)]
ACI_BENCHMARK_COLS = [t.value.col_name for t in ACIColumns]

2.4 Export Column Lists (if needed)

If you need to import these lists elsewhere, make sure they're exported in the imports section at the top of files that use them.


Step 3: Add Sorting Logic in src/populate_optimized.py

Add sorting logic for your task in the _process_single_subset function:

elif subset == "mytask":
    df = df.sort_values(by=["Primary Metric Name"], ascending=False)

Location: Around line 73-106 in src/populate_optimized.py

Example from existing code:

elif subset == "aci":
    df = df.sort_values(by=[AutoEvalColumn.overall.name], ascending=False)

Note: Replace "Primary Metric Name" with the actual column name from your data that should be used for sorting.


Step 4: Add UI Component in src/ui/leaderboard_v2.py

4.1 Add Task to Categories List

Add your task to the categories list:

categories = [
    "๐Ÿ… Medical Summarization",
    # ... existing tasks ...
    "๐Ÿ… My New Task",  # Add your task here
    # ... rest of tasks ...
]

Location: Around line 23-35 in src/ui/leaderboard_v2.py

4.2 Create UI Column

Add a new column for your task:

# X. My New Task
with gr.Column(visible=False) as col_mytask:
    category_columns["๐Ÿ… My New Task"] = col_mytask
    
    # Optional: Add benchmark information accordion
    with gr.Accordion("โ„น๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
        gr.Markdown("", elem_classes="markdown-text")  # Add description if needed
    
    # Create the leaderboard UI
    create_leaderboard_ui_v2(
        subset_name="mytask",  # Must match the subset name used in data files
        column_choices=[
            c.name
            for c in fields(AutoEvalColumn)
            if not c.hidden and not c.never_hidden and (c.invariant or c.mytask_col)
        ],
        default_columns=[
            c.name
            for c in fields(AutoEvalColumn)
            if c.displayed_by_default
            and not c.hidden
            and not c.never_hidden
            and (c.invariant or c.mytask_col)
        ],
    )
    
    # Optional: Add generation templates accordion if applicable
    with gr.Accordion("๐Ÿ’ฌ Generation templates", open=False):
        with gr.Accordion("Response generation", open=False):
            render_generation_templates(task="mytask", generation_type="response_generation")

Location: Add this section around line 200-500 in src/ui/leaderboard_v2.py, following the pattern of existing tasks.

4.3 Handle Subtasks (Optional)

If your task has subtasks (like MedCalc with Direct Answer, One Shot CoT, Zero Shot CoT), wrap the UI in tabs:

with gr.Column(visible=False) as col_mytask:
    category_columns["๐Ÿ… My New Task"] = col_mytask
    with gr.Accordion("โ„น๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
        gr.Markdown("", elem_classes="markdown-text")
    with gr.Tabs(elem_classes="tab-buttons2", visible=True):
        with gr.TabItem("Subtask 1", id=0):
            create_leaderboard_ui_v2(
                subset_name="mytask_subtask1",
                column_choices=[...],
                default_columns=[...],
            )
        with gr.TabItem("Subtask 2", id=1):
            create_leaderboard_ui_v2(
                subset_name="mytask_subtask2",
                column_choices=[...],
                default_columns=[...],
            )

Example reference: See MedCalc implementation around line 201-256 in src/ui/leaderboard_v2.py.


Step 5: Data File Requirements

5.1 Data Structure

Your evaluation results JSON files should follow this structure:

{
    "full_model": "model_name",
    "revision": "commit_sha",
    "subset": "mytask",
    "metric1": 85.5,
    "metric2": 92.3,
    "metric3": 88.1,
    ...
}

5.2 Column Names

The benchmark field in your MyTaskColumn enum must match the keys in your JSON data files.

5.3 File Location

Place your evaluation result files in the appropriate directory that the system reads from (typically configured in EVAL_RESULTS_PATH).


Step 6: Update load_all_datasets_parallel (if needed)

If your task needs special handling in the data loading function, update src/populate_optimized.py:

def load_all_datasets_parallel(results_path: str, requests_path: str, max_workers: int = 4):
    # ... existing code ...
    
    # Add your task to the datasets dictionary
    datasets["mytask"] = _process_single_subset(
        results_path, 
        requests_path, 
        MYTASK_COLS, 
        MYTASK_BENCHMARK_COLS, 
        "your_evaluation_metric",  # e.g., "score", "accuracy", etc.
        "mytask"
    )

Location: Around line 116-230 in src/populate_optimized.py

Note: The evaluation metric should match what's used in your data files.


Step 7: Testing Checklist

After making all changes:

  • Task appears in the category selector
  • Task can be selected and displays correctly
  • Table shows all expected columns
  • Data loads correctly from JSON files
  • Sorting works correctly
  • Search and filter functionality works
  • Column selection works
  • Benchmark information accordion displays (if added)
  • Generation templates work (if added)

Common Patterns

Pattern 1: Simple Task (No Subtasks)

Example: Medical Summarization, Med Safety

  • Single UI component
  • No tabs needed
  • Direct create_leaderboard_ui_v2 call

Pattern 2: Task with Subtasks

Example: MedCalc, EHRSQL, MedEC

  • Wrap in gr.Tabs(elem_classes="tab-buttons2")
  • Each subtask has its own gr.TabItem
  • Each subtask needs its own column flag and enum

Pattern 3: Task with Multiple Languages

Example: Open Ended Evaluation

  • Use create_open_ended_evaluation_tabs pattern
  • Create separate enum for each language
  • Each language needs its own column flag

Pattern 4: Task Sharing Columns

Example: ACI and SOAP both use medical_summarization_col pattern

  • Can share column flags if columns are identical
  • Use same column flag in multiple tasks

Important Notes

  1. Subset Name Consistency: The subset_name parameter in create_leaderboard_ui_v2 must match:

    • The subset field in your JSON data files
    • The key used in load_all_datasets_parallel
    • The sorting logic in _process_single_subset
  2. Column Flag Naming: Use snake_case for column flags (e.g., mytask_col, my_new_task_col)

  3. Column Display Names: Use human-readable names in col_name (e.g., "Overall Score", "Coverage", "Accuracy")

  4. Default Display: Set displayed_by_default=True for important metrics that should be visible by default

  5. Invariant Columns: Columns like "Model", "Revision" are invariant=True and appear in all tasks

  6. Hidden Columns: Use hidden=True for columns that shouldn't appear in the UI

  7. Never Hidden: Use never_hidden=True for essential columns like "Model" that can't be hidden


Example: Complete Task Implementation

Here's a complete example for a hypothetical task called "Clinical Reasoning":

1. src/about.py

@dataclass
class ClinicalReasoningColumn:
    benchmark: str
    metric: str
    col_name: str

class ClinicalReasoningColumns(Enum):
    cr_column0 = ClinicalReasoningColumn("accuracy", "score", "Accuracy")
    cr_column1 = ClinicalReasoningColumn("reasoning_quality", "score", "Reasoning Quality")
    cr_column2 = ClinicalReasoningColumn("safety_score", "score", "Safety Score")

2. src/display/utils.py

# In ColumnContent dataclass
clinical_reasoning_col: bool = False

# In auto_eval_column_dict
for column in ClinicalReasoningColumns:
    auto_eval_column_dict.append(
        [
            column.name,
            ColumnContent,
            ColumnContent(column.value.col_name, "number", True, False, clinical_reasoning_col=True, invariant=False),
        ]
    )

# Column lists
CLINICAL_REASONING_COLS = [c.name for c in fields(AutoEvalColumn) if not c.hidden and (c.clinical_reasoning_col or c.invariant)]
CLINICAL_REASONING_BENCHMARK_COLS = [t.value.col_name for t in ClinicalReasoningColumns]

3. src/populate_optimized.py

elif subset == "clinical_reasoning":
    df = df.sort_values(by=["Accuracy"], ascending=False)

4. src/ui/leaderboard_v2.py

categories = [
    # ... existing tasks ...
    "๐Ÿ… Clinical Reasoning",
    # ... rest ...
]

# In the UI section
with gr.Column(visible=False) as col_clinical_reasoning:
    category_columns["๐Ÿ… Clinical Reasoning"] = col_clinical_reasoning
    with gr.Accordion("โ„น๏ธ Benchmark Information", open=False, elem_classes="markdown-text"):
        gr.Markdown("Clinical Reasoning evaluates...", elem_classes="markdown-text")
    create_leaderboard_ui_v2(
        subset_name="clinical_reasoning",
        column_choices=[
            c.name
            for c in fields(AutoEvalColumn)
            if not c.hidden and not c.never_hidden and (c.invariant or c.clinical_reasoning_col)
        ],
        default_columns=[
            c.name
            for c in fields(AutoEvalColumn)
            if c.displayed_by_default
            and not c.hidden
            and not c.never_hidden
            and (c.invariant or c.clinical_reasoning_col)
        ],
    )

Troubleshooting

Issue: Task doesn't appear in UI

  • Check that task is added to categories list
  • Verify column is created in category_columns dictionary
  • Ensure switch_category function includes your task

Issue: No data shows up

  • Verify subset_name matches data file subset field
  • Check that data is loaded in load_all_datasets_parallel
  • Verify column names match between enum and data files

Issue: Columns don't appear

  • Check column flag is set correctly in ColumnContent
  • Verify columns are registered in auto_eval_column_dict
  • Check column list includes your task's columns

Issue: Sorting doesn't work

  • Verify sorting logic is added in _process_single_subset
  • Check column name used for sorting matches data file key

Additional Resources

  • See existing task implementations for reference:

    • Simple task: med_safety (line 460-483 in leaderboard_v2.py)
    • Task with subtasks: medcalc (line 201-256 in leaderboard_v2.py)
    • Task with info: medical_summarization (line 172-198 in leaderboard_v2.py)
  • Column definitions: src/display/utils.py lines 43-595

  • Task definitions: src/about.py lines 1-300

  • UI implementation: src/ui/leaderboard_v2.py lines 172-500