The Decoder • 89일 전

구글 딥마인드 AI, GPT-5.4 앞섰지만 전문의엔 못 미쳐

IMP

8/10

핵심 요약

구글 딥마인드가 개발한 'AI 공동 진료 시스템(AI co-clinician)'이 블라인드 평가에서 GPT-5.4 등 기존 AI 모델들을 제치고 우수한 진단 및 약물 처방 능력을 입혀냈습니다. 하지만 실제 의료 현장과 유사한 복합 시뮬레이션 평가에서는 위중한 경고 신호 포착이나 신체 진찰 같은 핵심 역역에서 여전히 경력 있는 전문의의 실력에 미치지 못하는 한계를 보였습니다. 이 연구는 AI가 의사를 대체할 수 없으며, 철저한 임상적 감독하에 의료진을 보조하는 형태로 활용되어야 함을 시사합니다.

번역된 본문

구글 딥마인드, GPT-5.4 앞선 'AI 공동 진료 시스템' 발표... 여전히 전문의엔 못 미쳐

구글 딥마인드가 환자 진료를 보조하기 위해 'AI 공동 진료 시스템(AI co-clinician)'을 개발하고 있다. 이 시스템은 시뮬레이션 연구에서 유망한 결과를 보여주고 있지만, 여전히 경험이 풍부한 숙련된 의사들에는 미치지 못하는 수준이다. 이 연구는 또한 왜 현재의 챗GPT 음성 모드가 의학적 상담은커녕 심각하고 복잡한 작업에 아직 적합하지 않은지를 보여준다.

이 'AI 공동 진료 시스템'은 연구진이 이른바 '삼각 진료(triadic care)' 개념을 중심으로 구축되었다. 즉, 의사가 임상적 권한과 감독권을 유지하는 가운데 AI 에이전트가 환자의 치료 과정을 돕는 방식이다. 의사의 지휘 아래 환자를 지원하며 의료 팀의 일원으로 활약하는 AI 시스템을 목표로 한다.

임상의의 관점에서 이 시스템을 평가하기 위해 연구팀은 학계 의사들과 협력하여 NOHARM 프레임워크를 변형하여 적용했으며, 여기서는 '위행 오류(errors of commission)'와 '누락 오류(errors of omission)' 두 가지 유형의 실수를 확인했다.

98개의 현실적인 일차 진료(1차 진료) 사례를 활용한 블라인드 비교 평가에서, 의사들은 기존의 임상 AI 시스템이나 GPT-5.4-thinking-with-search 모델의 답변보다 이 AI 공동 진료 시스템의 답변을 일관되게 더 선호했다. 이 시스템은 기존 임상 AI 시스템을 상대로 67 대 26, GPT-5.4 모델을 상대로 63 대 30이라는 압도적인 승리를 거두었다. 객관적인 분석 결과, 98개의 사례 중 단 1건에서만 이 시스템이 치명적인 오류를 범했다.

약물 관련 질문에서는 이 격차가 더욱 벌어졌다. RxQA 벤치마크는 2개국의 국가 의약품 목록에서 추출하고 면허를 가진 약사들이 검증한 활성 성분, 상호작용, 복용량에 관한 600개의 질문으로 구성된다. 이 질문들은 일차 진료 의사들에게도 까다롭다. 의사들은 참고서를 활용할 때 61.3%의 정답률을 보였고, 참고서 없이는 48.3%에 그쳤다. 반면 AI 공동 진료 시스템은 73.3%를 기록했으며, 이는 72.7%를 기록한 GPT-5.4 모델을 근소하게 앞서는 수치다. 의사들이 실제 업무 중에 약물을 검색하는 방식처럼 객관식이 아닌 개방형 질문을 던졌을 때 이 격차는 더욱 벌어졌다. 이 경우 AI 공동 진료 시스템은 95.0%의 품질 점수를 기록했고, 오픈AI 모델은 90.9%를 기록했다.

멀티모달 원격 진료, 진료실에 들어선 AI 텍스트 기반 지원을 넘어, 구글 딥마인드는 이 AI 시스템이 실시간 오디오와 비디오를 활용하는 원격 진료에서 어떻게 작동하는지 테스트하고 있다. 하버드와 스탠퍼드 의대 소속 의사들과 협력하여 연구진은 20개의 가상 임상 시나리오, 10명의 의사가 연기한 환자 배우, 총 120회의 가상 원격 진료를 진행하는 무작위 시뮬레이션 연구를 수행했다. 이 AI 시스템은 텍스트 전용 시스템의 한계를 뛰어넘는 능력을 보여주었다. 예를 들어, 환자의 흡입기 사용법을 교정해 주거나, 회전근개 손상을 파악하기 위해 환자에게 어깨 검사를 안내하는 등의 능력을 선보였다. 환자와 직접 대면하는 대화에서 AI 공동 진료 시스템은 '듀얼 에이전트(dual-agent)' 구조로 작동한다. '플래너(Planner)' 모듈이 대화를 지켜보며 '토커(Talker)' 에이전트가 안전한 임상 한계를 벗어나지 않도록 통제하는 방식이다. 의사가 이 시스템을 사용할 때는, 확실한 임상 근거를 우선시하며 의약품 정보를 검색하는 동안 검증 및 출처 확인 절차를 거친다.

경험 많은 의사는 여전히 최고 이 연구는 분류(Triage), 병력 청취, 임상적 추론, 의사소통 및 상담, 치료 단계, 경고 신호 포착, 신체 검진 등 7개 영역에 걸쳐 진료 품질의 140가지 이상의 측면을 평가했다. AI가 의사를 대체할 수 있을 것이라 기대했던 사람들에게 이 결론은 다소 씁쓸할 수 있다. 숙련된 전문의들은 여전히 AI 시스템보다 뛰어난 성능을 보였으며, 특히 위중한 경고 신호를 파악하고 신체 진찰을 수행하는 능력에서 AI를 압도했다.

원문 보기

원문 보기 (영어)

Google Deepmind's "AI co-clinician" beats GPT-5.4 in blind doctor tests but still trails experienced physicians Matthias Bastian View the LinkedIn Profile of Matthias Bastian May 1, 2026 Nano Banana Pro prompted by THE DECODER Key Points Google Deepmind is building an "AI co-clinician" designed to assist doctors in diagnosing and treating patients during everyday medical care. In blind evaluations using realistic general practitioner scenarios, physicians rated the system's responses higher than those from other AI tools, including GPT-5.4-thinking-with-search. Despite these promising results, a simulation study showed that experienced doctors still outperformed the AI, particularly when it came to identifying critical warning signs and conducting physical examinations. Ask about this article… Search Google Deepmind is building an "AI co-clinician" to help doctors care for patients. The system shows promising results in simulation studies but still trails experienced physicians. The research also shows why ChatGPT's voice mode isn't ready for serious tasks, let alone medical consultations. The "AI co-clinician" is built around what the researchers call "triadic care": AI agents help patients through their treatment while doctors keep clinical authority and oversight. The idea is to have an AI system that works as a member of the medical team, supporting patients under a clinician's supervision. To evaluate the system from a clinician's perspective, the team worked with academic physicians to adapt the NOHARM framework , checking for two types of mistakes: errors of commission and errors of omission. Ad In a blind comparison using 98 realistic primary care queries, doctors consistently picked the AI co-clinician's answers over leading evidence synthesis tools. It won 67 to 26 against an existing clinical AI system and 63 to 30 against GPT-5.4-thinking-with-search. In the objective analysis, the system logged a critical error in one of the 98 cases. Ad DEC_D_Incontent-1 The lead was even bigger on medication questions. The RxQA benchmark covers 600 questions on active ingredients, interactions, and dosages, drawn from national drug directories in two countries and vetted by licensed pharmacists. These questions are tough for primary care doctors: with reference books, they got 61.3 percent right, and just 48.3 percent without. The AI co-clinician scored 73.3 percent, just ahead of GPT-5.4-thinking-with-search at 72.7 percent. The gap widened when questions were asked open-ended rather than as multiple choice, the way doctors actually look things up on the job. Here the AI co-clinician hit a quality score of 95.0 percent, compared to 90.9 percent for OpenAI's model. Ad Multimodal telemedicine puts AI in the exam room Beyond text-based support, Google Deepmind is testing how the AI co-clinician handles real-time audio and video for telemedicine. Working with physicians at Harvard and Stanford, the team ran a randomized simulation study with 20 synthetic clinical scenarios, 10 doctors playing patient actors, and 120 hypothetical telemedicine visits in total. The AI co-clinician showed capabilities that go beyond what text-only systems can do. It corrected a patient's inhaler technique and walked patients through shoulder exams to spot a rotator cuff injury. Ad DEC_D_Incontent-2 Ad For patient-facing conversations, the AI co-clinician runs on a dual-agent setup: a "Planner" module watches the conversation to make sure the "Talker" agent stays within safe clinical limits. When doctors use the system, it prioritizes solid clinical evidence and runs verification and citation checks during lookups. Experienced doctors still come out on top The study scored more than 140 aspects of consultation quality across seven areas: triage, history taking, clinical reasoning, communication and counseling, treatment steps, spotting warning signs, and physical exams. The takeaway is sobering for anyone hoping AI can replace a doctor: experienced physicians beat the AI overall, especially when it came to catching "red flags" and guiding critical physical exams. Still, the AI co-clinician matched or beat primary care physicians in 68 of the 140 areas evaluated. OpenAI's GPT-realtime trailed both in all seven domains. The researchers conclude that systems like this work best as support tools for doctors, not as a replacement for clinical judgment. It's still unclear whether the research project will turn into an actual product. The results show progress in AI-driven evidence synthesis and telemedicine consultations, but they also make clear there's still a gap to close with experienced doctors, especially on safety-critical tasks like catching warning signs. "While it's early days, the promise is clear," says Deepmind researcher Alan Karthikesalingam. AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: Google Deepmind

AI 의료 구글 딥마인드 GPT-5.4 AI 공동 진료 시스템 임상 시험