The Decoder • 97일 전

오픈AI 의료용 챗GPT, 의사보다 뛰어난 성능 입증

IMP

9/10

핵심 요약

오픈AI가 의료 전문가를 위한 무료 도구인 'ChatGPT for Clinicians'를 출시했습니다. 이와 함께 공개된 'HealthBench Professional' 벤치마크에서 맞춤형 GPT-5.4 모델은 제한 없는 시간과 인터넷 접근 권한을 가진 실제 의사들의 점수(43.7점)를 크게 상회하는 59.0점을 기록했습니다. 이는 AI가 임상 실무에서 단순 지원을 넘어 의사들의 핵심 업무를 보조할 수 있는 강력한 잠재력을 시사하는 중요한 이정표입니다.

번역된 본문

오픈AI는 의료 전문가를 위한 챗GPT의 무료 버전인 'ChatGPT for Clinicians'를 출시했습니다. 새로운 벤치마크에 따르면, GPT-5.4는 의사들이 무제한의 시간과 인터넷 접속 권한을 가진 상태에서도 임상 과제에서 인간 의사보다 뛰어난 성능을 보였습니다.

오픈AI는 임상 업무에 특화된 버전의 챗GPT를 출시했습니다. 이 도구는 미국 내 인증된 의사, 고급 임상 자격을 갖춘 간호사, 의사 보조원 및 약사에게 무료로 제공됩니다. 이와 함께 회사는 임상 AI 과제를 위한 새로운 벤치마크인 'HealthBench Professional'을 공개했습니다. 오픈AI에 따르면 GPT-5.4는 이 벤치마크에서 인간 의사보다 뛰어난 성능을 발휘했습니다.

고안된 난이도의 벤치마크 HealthBench Professional은 진료, 문서 작성 및 기록, 의학 연구라는 세 가지 임상 분야에서 AI의 성능을 측정합니다. 이는 기존의 HealthBench를 기반으로 구축되었으며, 의사가 작성한 대화, 다단계 의사 평가, 대상별 데이터 필터링을 활용합니다.

오픈AI는 이 벤치마크가 매우 까다롭게 설계되었다고 밝혔습니다. 전체 예시의 약 1/3은 의사들이 모델의 약점을 적극적으로 찾아내려 하는 표적화된 '레드팀(Red Teaming)'에서 가져왔습니다. 가장 어려운 대화는 3.5배 과대 평가되어 포함되었습니다.

ChatGPT for Clinicians 환경에서 실행되는 GPT-5.4는 HealthBench Professional에서 전체 59.0점을 기록했습니다. 의사가 작성한 응답은 무제한의 시간과 인터넷 접속 권한이 주어졌음에도 43.7점에 그쳤습니다. 테스트된 다른 모든 모델은 이 버전보다 훨씬 낮은 점수를 받았습니다. 기본 GPT-5.4는 48.1점, 앤스로픽의 Claude Opus 4.7은 47.0점, 구글의 Gemini 3.1 Pro는 43.8점, xAI의 Grok 4.2는 36.1점을 기록했습니다.

의료용 작업 공간에서의 GPT-5.4는 기본 GPT-5.4보다 약 11점 정도 높은 점수를 받았습니다(59.0 vs 48.1). 이러한 차이가 임상 설정 자체에서 비롯된 것인지 아니면 벤치마크가 구축된 방식 때문인지는 불분명하며, 벤치마크 점수가 반드시 실제 임상 실무에 그대로 적용되는 것은 아닙니다.

99.6%의 응답이 신뢰할 만한 것으로 평가 여기에는 분명한 방법론적 쟁점이 존재합니다. 오픈AI가 벤치마크를 구축하고 자체 모델을 테스트했다는 점입니다. 회사는 스탠퍼드의 MedHELM 및 MedMarks와 같은 서드파티 평가에서도 오픈AI 모델이 최상위권을 차지한다는 점을 지적하며, 해당 벤치마크와 데이터셋은 공개되어 있다고 덧붙였습니다.

오픈AI는 수백 명의 의학 자문 위원과 함께 'ChatGPT for Clinicians'를 개발했다고 밝혔습니다. 출시 전 의사들이 일상적인 임상 업무에서 6,924건의 대화를 테스트했으며, 오픈AI 헬스 부서의 카란 싱할(Karan Singhal)에 따르면 응답의 99.6%가 안전하고 정확한 것으로 평가되었습니다.

세 명의 독립적인 의사가 각각 올바른 출처를 지정한 355개의 사례 하위 집합에서, 'ChatGPT for Clinicians'는 인간 의사보다 더 자주 해당 출처를 인용했습니다. 현재까지 총 70만 건 이상의 모델 응답이 의사들의 검토를 거쳤습니다. 오픈AI는 이 도구가 의사의 판단을 대체하기 위한 것이 아니라 임상의를 지원하기 위한 것임을 강조합니다.

임상 검색, 재사용 가능한 워크플로우 및 평생 교육 학점(CME) 오픈AI에 따르면, 'ChatGPT for Clinicians'는 회사의 최신 최고급 모델에 대한 무료 액세스, 수백만 건의 동료 평가(Peer-reviewed) 문헌을 활용하는 임상 검색 기능, 반복되는 워크플로우를 위한 템플릿, 평생 의학 교육(CME) 학점의 자동 인식 기능을 제공합니다.

원문 보기

원문 보기 (영어)

OpenAI says its new ChatGPT for Clinicians outperforms doctors on clinical tasks even when they have unlimited time and web access Matthias Bastian View the LinkedIn Profile of Matthias Bastian Apr 23, 2026 Nano Banana Pro prompted by THE DECODER Key Points OpenAI has launched "ChatGPT for Clinicians," a free AI tool designed specifically for everyday medical practice, available to verified healthcare professionals in the USA. The system includes features like real-time clinical searches across specialist literature, templates for recurring workflows, and automatic recognition of continuing medical education credits. Alongside the launch, OpenAI published the "HealthBench Professional" benchmark, where the customized GPT-5.4 version scored 59.0 points, outperforming human doctors, who scored 43.7 points despite having unlimited time and internet access. Ask about this article… Search OpenAI is rolling out ChatGPT for Clinicians, a free version of its chatbot for medical professionals. A new benchmark claims GPT-5.4 beats human doctors on clinical tasks, even when those doctors have unlimited time and internet access. OpenAI has launched a version of ChatGPT built specifically for clinical work . It's free for verified physicians, nurses with advanced clinical qualifications, physician assistants, and pharmacists in the US. Alongside it, the company is releasing HealthBench Professional , a new benchmark for clinical AI tasks. According to OpenAI, GPT-5.4 outperforms human doctors on it. A benchmark built to be hard HealthBench Professional measures AI performance across three clinical areas: consultations, writing and documentation, and medical research. It builds on the earlier HealthBench and uses doctor-written conversations, multi-level physician scoring, and targeted data filtering. Ad OpenAI says the benchmark was designed to be tough. About a third of the examples come from targeted "red teaming," where doctors actively tried to find weaknesses in the models. The hardest conversations were overrepresented by a factor of 3.5. Ad DEC_D_Incontent-1 GPT-5.4 running in the ChatGPT for Clinicians workspace scored 59.0 overall on HealthBench Professional. Doctor-written responses came in at 43.7, even with unlimited time and internet access. Every other model tested scored well below the Clinicians version: the base GPT-5.4 hit 48.1, Anthropic's Claude Opus 4.7 reached 47.0, Google's Gemini 3.1 Pro scored 43.8, and xAI's Grok 4.2 landed at 36.1. GPT-5.4 in the Clinicians workspace scores about 11 points higher than the base GPT-5.4 (59.0 vs. 48.1). How much of that comes from the clinical setup itself versus the way the benchmark is built is unclear, and benchmark scores don't necessarily translate to real clinical practice. Ad 99.6 percent of answers rated reliable There's an obvious methodological wrinkle here: OpenAI built the benchmark and tested its own models on it. The company points to third-party evaluations like Stanford's MedHELM and MedMarks , where OpenAI models also rank at the top, and the benchmark and dataset are openly available . OpenAI says ChatGPT for Clinicians was developed with hundreds of medical advisors. Before launch, doctors tested 6,924 conversations in their everyday clinical work, and 99.6 percent of the responses were rated safe and accurate, according to Karan Singhal from OpenAI's Health unit. Ad DEC_D_Incontent-2 In a subset of 355 examples where three independent doctors each specified correct sources, ChatGPT for Clinicians cited those sources more often than human doctors did. In total, more than 700,000 model responses have been reviewed by physicians so far. OpenAI stresses that the tool is meant to support clinicians, not replace their judgment. Ad Clinical search, reusable workflows, and CME credits According to OpenAI , ChatGPT for Clinicians comes with free access to the company's current frontier models, a clinical search function that pulls from millions of peer-reviewed sources with real-time citations, and a deep research feature for medical literature. There are also "skills," which let clinicians turn recurring workflows, like referral letters, prior authorizations, or patient instructions, into reusable templates. One unusual feature: clinical research done in ChatGPT can count toward continuing medical education (CME) credits in the US. On privacy, OpenAI says conversations won't be used for model training. Optional HIPAA compliance through a Business Associate Agreement is available for users handling protected health information. US-only for now, global rollout planned ChatGPT for Clinicians is launching only for verified clinicians in the US. OpenAI plans to expand internationally and is working with the Better Evidence Network on pilot projects outside the country. The company is also publishing a Health Blueprint with recommendations for responsibly integrating AI into the US healthcare system. The push comes as AI adoption in medicine accelerates. A 2026 survey from the American Medical Association found that 72 percent of US doctors now use AI in clinical practice, up from 48 percent the year before. OpenAI says millions of clinicians worldwide already use ChatGPT weekly, with usage more than doubling over the past year. Earlier this year, OpenAI launched ChatGPT for Healthcare for organizations , giving health systems institutional-level compliance and administrative controls. Anthropic , Microsoft , and Google are all pushing into the medical market with their own AI models too, with Google focusing especially on drug development through Google Deepmind . AI News Without the Hype – Curated by Humans Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive "AI Radar" frontier report six times a year, full archive access, and access to our comment section. Subscribe now Source: OpenAI

오픈AI 의료 AI GPT-5.4 벤치마크 ChatGPT