Hacker News • 75일 전

캐나다 온타리오주 의료용 AI 필기 60% 오류

IMP

8/10

핵심 요약

캐나다 온타리오주 감사 결과, 의사들의 진료 기록을 자동으로 작성해주는 'AI 필기 시스템(AI Scribe)'이 환자 노트에 처방약을 잘못 기재하거나, 상담에 없는 내용을 날조하는 등 심각한 오류를 범한 것으로 나타났습니다. 특히 의료적 정확도가 평가 점수의 단 4%에 불과해 기형적인 평가 기준으로 인해 부정확한 시스템이 도입된 것이 원인으로 지적되어, 의료 분야 AI 도입 시 철저한 검증의 중요성을 시사합니다.

번역된 본문

AI + ML: 잘못된 진단, 온타리오 감사관, 의사용 AI 필기 시스템이 기본적인 사실을 자주 틀린다고 지적

평가된 AI 필기 시스템(AI Scribe)의 60%가 환자 기록에서 처방약을 혼동했다고 감사관들이 밝혔다.

브랜든 비글리아롤로 (Brandon Vigliarolo) 작성 | 2026년 5월 14일 목요일 // 21:50 UTC 발행

캐나다 온타리오주 보건 당국이 승인한 AI 시스템이 중요한 세부 사항을 빠뜨리고, 잘못된 정보를 삽입하며, 환자나 의료진이 언급하지 않은 내용을 할당(Hallucination)하는 일이 일상적으로 발생했다. 이는 승인된 20개 공급업체의 시스템에 대한 주정부 차원의 감사 결과이다.

이번 조사 결과는 캐나다 온타리오주 감사원에서 나왔으며, 주 내 공공 서비스의 AI 사용 현황에 대한 더 큰 보고서의 일부로 포함되어 있다. 이 조사는 특히 온타리오주 보건부가 의사, 간호사 및 광범위한 보건 부문의 다른 의료 전문가들을 위해 시작한 'AI 필기(AI Scribe)' 프로그램을 중점적으로 다룬다.

조달 과정의 일환으로, 관계자들은 시뮬레이션된 의사-환자 대화 녹음을 사용하여 평가를 진행했다. 이후 의료 전문가들이 원본 녹음과 AI가 생성한 기록을 함께 검토하여 정확도를 평가했다.

그들이 발견한 것은, 솔직히 말해 중요한 상황에서 AI의 정확도를 우려하는 사람들에게는 충격적이었다.

보고서에 따르면 20개의 AI 시스템 중 9개가 "정보를 조작하여 환자 치료 계획에 제안을 추가"했으며, 이러한 내용은 녹음에서 전혀 논의되지 않은 것이었다. 평가자들은 샘플 보고서에서 잠재적으로 치명적인 잘못된 정보를 발견했는데, 예를 들어 종양이 발견되지 않았다고 하거나 환자가 불안해한다고 기록한 반면, 이러한 내용은 대화에서 한 번도 언급되지 않았다.

평가된 20개 시스템 중 12개가 환자 기록에 잘못된 약물 정보를 삽입했고, 17개의 시스템은 녹음에서 논의되었던 '환자의 정신 건강 문제에 대한 핵심 세부 정보'를 누락했다. 보고서에 따르면 6개의 시스템은 "환자의 정신 건강 문제를 완전히 또는 부분적으로 누락했거나 핵심 세부 사항이 빠져 있었다."

의사들의 신기술 도입을 지원하며 AI 필기 조달 과정에 참여했던 OntarioMD는 의사들이 AI 기록의 정확도를 수동으로 검토할 것을 권장했다. 그러나 보고서는 승인된 모든 AI 필기 시스템에 필수적인 검증 기능이 없다고 지적했다.

잘못된 평가 기준도 문제

AI 시스템이 실수를 하는 것은 그리 놀라운 일이 아니다. 이전에 보도된 바와 같이, 일반 소비자용 AI는 사용자에게 잘못된 의료 정보를 제공하는 경향이 있으며, 일부 연구에 따르면 대형 언어 모델(LLM)이 테스트 케이스의 약 80%에서 적절한 감별 진단을 제공하지 못한 것으로 나타났다. 그러나 여기서 평가된 도구는 일반 소비자가 아닌 의사를 위한 것이며, 이러한 열악한 성능에 대해서는 변명의 여지가 필요하다.

보고서의 상당 부분은 시스템이 '어떻게 평가되었는지'를 비판한다. 보고서에 따르면, AI 필기 성능의 다양한 범주에 부여된 가중치가 기형적이었다. 플랫폼 평가 점수의 30%가 온타리오주에 '국내 사업장(국적 및 지역 내 현존 여부)'을 가지고 있는지 여부에만 달려 있었던 반면, 의료 기록의 정확도는 총점의 4%에 불과했다.

편향성(Bias) 제어는 총 평가 점수의 2%만을 차지했고, 위협, 위험 및 개인정보 보호 평가는 또 다른 2%를 차지했으며, SOC 2 Type 2 규정 준수는 추가로 4% 포인트를 기여했다. 즉, 정확도, 편향성 제어, 핵심 보안 및 개인정보 보호 장치와 관련된 기준이 AI 필기 시스템의 총 평가 점수에서 극히 일부를 차지한 셈이다.

결론적으로, 이러한 잘못된 가중치 부여는 부정확하거나 편향된 의료 기록을 생성하거나 적절한 보호 기능이 부족한 AI 도구를 가진 공급업체가 선정될 수 있는 결과를 초래할 수 있다.

원문 보기

원문 보기 (영어)

AI + ML Sick and wrong: Ontario auditors find doctors' AI note takers routinely blow basic facts 60% of evaluated AI Scribe systems mixed up prescribed drugs in patient notes, auditors say Brandon Vigliarolo Brandon Vigliarolo Published thu 14 May 2026 // 21:50 UTC The AI systems approved for Ontario healthcare providers routinely missed critical details, inserted incorrect information, and hallucinated content that neither patients nor clinicians mentioned, according to a provincial audit of 20 approved vendors’ systems. The findings come from the Office of the Auditor General of Ontario, Canada, and are included in a larger report about the state of AI usage by public services in the province. They specifically address the AI Scribe program, the Ontario Ministry of Health initiated for physicians, nurse practitioners, and other healthcare professionals across the broader health sector. As part of the procurement process, officials conducted evaluations using simulated doctor-patient recordings. Medical professionals then reviewed the original recordings alongside the AI-generated notes to evaluate their accuracy. REG AD What they found was, frankly, shocking for anyone concerned about the accuracy of AI in critical situations. REG AD Nine out of 20 AI systems reportedly “fabricated information and made suggestions to patients' treatment plans” that weren’t discussed in the recordings. According to the report, evaluators spotted potentially devastating incorrect information in the sample reports, such as no masses being found, or patients being anxious, even though these things were never discussed in the recordings. Twelve of the 20 systems evaluated inserted incorrect drug information into patient notes, while 17 of the systems “missed key details about the patients’ mental health issues” that were discussed in the recordings. Six of the systems “missed the patients’ mental health issues fully or partially or were missing key details,” per the report. OntarioMD, a group that offers support for physicians in adopting new technologies and was involved in the AI Scribe procurement process, has recommended that doctors manually review their AI notes for accuracy, but the report notes there’s no mandatory attestation feature in any of the AI Scribe-approved systems. Bad evaluations don’t help, either AI systems making mistakes isn’t exactly shocking. As we’ve reported previously, consumer-focused AI has a tendency to provide bad medical information to users, and some studies have found large language models failed to produce appropriate differential diagnoses in roughly 80 percent of tested cases . But the tools evaluated here are for doctors, not consumers, and such poor performance necessitates explanation. A good portion of the report blames how the systems were evaluated. According to the report, the weight given to various categories of AI Scribe performances was wonky. While 30 percent of a platform’s evaluation score depended solely on whether they had a domestic presence in Ontario, the accuracy of medical notes contributed only 4 percent to the total score. MORE CONTEXT AI chatbots are no better at medical advice than a search engine AI doctor's assistant is easily swayed to change prescriptions, give bad medical advice ChatGPT Health wants your sensitive medical records so it can play doctor 'It looks sexy but it's wrong' – the problem with AI in biology and medicine Bias controls accounted for only 2 percent of the total evaluation score; threat, risk, and privacy assessments counted for another 2 percent; and SOC 2 Type 2 compliance contributed an additional 4 percentage points. In other words, criteria tied to accuracy, bias controls, and key security and privacy safeguards made up only a small portion of the total evaluation score for the AI Scribe systems. REG AD “Inaccurate weightings could result in the selection of vendors whose AI tools may produce inaccurate or biased medical records or lack adequate protection to safeguard sensitive personal health information,” the report said of the scoring regime. The Register reached out to the Ontario Health Ministry for its take on the report, and whether it was going to conform to its recommendations for the AI Scribe program, but we didn’t immediately hear back. A spokesperson for the Ministry told the CBC on Wednesday that more than 5,000 physicians in Ontario are participating in the AI Scribe program and there have been no known reports of patient harms associated with the technology. ® ai and ml ai + ml ai software canada healthcare

의료 AI AI 할루시네이션 AI 규제 AI 평가 기준