How to Evaluate AI-Generated Clinical Notes Reliably (Full Transcript)

A practical approach combining user feedback, automated metrics, and LLM-as-a-judge scoring against provider-signed EHR notes.

Download Transcript (DOCX)

Speakers

Add new speaker

[00:00:00] Speaker 1: For us, a lot of it is user feedback that we'll get like, Hey, this thing was spelt incorrectly or done this way, or I had to edit this wasn't picked up. So that's typically a lot of like the user feedback we get on specifically the transcript part. Then from like the final output part, we use a measure of like F1 scores, rogue, birds, LLM as a judge, workflows, essentially every note that we generate, we will compare it to the sign note that the provider submits and writes back to their electronic health record system. And we'll do essentially evals for F1, but also LLM as a judge. So every note that comes in, we'll process nightly and come up with like a score based off our grading criteria for this is a good medical notes versus not. So we'll find things because just because the provider didn't edit it doesn't mean it's a perfect note also. So that's why you need this like other signal to counteract in that in those cases.

Summary

The speaker explains how they evaluate transcript quality and final medical note generation. Transcript issues are often identified through user feedback (misspellings, missed edits). For the final note, they run nightly evaluations comparing generated notes to provider-signed notes using metrics like F1 and ROUGE, plus embedding/BERT-style measures and an "LLM-as-a-judge" workflow. They also note that provider non-edits don’t guarantee perfection, so additional scoring signals are needed to detect issues.

Copy

Download

Title

Evaluating Medical Transcripts and Notes with Metrics and LLM Judging

Copy

Download

Keywords

user feedback Remove

Remove

transcription quality Remove

Remove

medical note generation Remove

Remove

F1 score

Remove

ROUGE

Remove

BERT/BiRD

Remove

LLM as a judge Remove

Remove

evaluation workflow Remove

Remove

provider-signed notes Remove

Remove

electronic health record Remove

Remove

Copy

Download

Key Takeaways

Transcript quality problems are primarily surfaced via direct user feedback (e.g., misspellings, missed edits).
Final note quality is evaluated by comparing generated notes to provider-signed notes in the EHR.
Automated metrics such as F1 and ROUGE are used alongside semantic measures and LLM-as-a-judge grading.
Evaluations are run nightly to score every note against defined medical-note criteria.
Provider acceptance (not editing a note) is not a reliable proxy for correctness, so independent evaluation signals are necessary.

Copy

Download

Sentiments

Neutral: The tone is practical and process-focused, describing evaluation methods and safeguards without strong positive or negative emotion.

Copy

Download

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file