GoTranscript
>
All Services
>

Public/how To Evaluate Ai Generated Clinical Notes Reliably

How to Evaluate AI-Generated Clinical Notes Reliably (Full Transcript)

A practical approach combining user feedback, automated metrics, and LLM-as-a-judge scoring against provider-signed EHR notes.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: For us, a lot of it is user feedback that we'll get like, Hey, this thing was spelt incorrectly or done this way, or I had to edit this wasn't picked up. So that's typically a lot of like the user feedback we get on specifically the transcript part. Then from like the final output part, we use a measure of like F1 scores, rogue, birds, LLM as a judge, workflows, essentially every note that we generate, we will compare it to the sign note that the provider submits and writes back to their electronic health record system. And we'll do essentially evals for F1, but also LLM as a judge. So every note that comes in, we'll process nightly and come up with like a score based off our grading criteria for this is a good medical notes versus not. So we'll find things because just because the provider didn't edit it doesn't mean it's a perfect note also. So that's why you need this like other signal to counteract in that in those cases.

ai AI Insights
Arow Summary
The speaker explains how they evaluate transcript quality and final medical note generation. Transcript issues are often identified through user feedback (misspellings, missed edits). For the final note, they run nightly evaluations comparing generated notes to provider-signed notes using metrics like F1 and ROUGE, plus embedding/BERT-style measures and an "LLM-as-a-judge" workflow. They also note that provider non-edits don’t guarantee perfection, so additional scoring signals are needed to detect issues.
Arow Title
Evaluating Medical Transcripts and Notes with Metrics and LLM Judging
Arow Keywords
user feedback Remove
transcription quality Remove
medical note generation Remove
F1 score Remove
ROUGE Remove
BERT/BiRD Remove
LLM as a judge Remove
evaluation workflow Remove
provider-signed notes Remove
electronic health record Remove
Arow Key Takeaways
  • Transcript quality problems are primarily surfaced via direct user feedback (e.g., misspellings, missed edits).
  • Final note quality is evaluated by comparing generated notes to provider-signed notes in the EHR.
  • Automated metrics such as F1 and ROUGE are used alongside semantic measures and LLM-as-a-judge grading.
  • Evaluations are run nightly to score every note against defined medical-note criteria.
  • Provider acceptance (not editing a note) is not a reliable proxy for correctness, so independent evaluation signals are necessary.
Arow Sentiments
Neutral: The tone is practical and process-focused, describing evaluation methods and safeguards without strong positive or negative emotion.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript