r/singularity • 89일 전

AI, 응급실 의사 진단 능력 능가… '협력'이 핵심

IMP

9/10

핵심 요약

최근 과학 저널 '사이언스(Science)'에 발표된 연구에 따르면, 최신 대형 언어 모델(LLM)이 실제 응급실 환자 데이터를 바탕으로 한 진단, 분류(Triage), 후속 조치 결정 등에서 인간 의사들의 성능과 맞먹거나 이를 능가하는 것으로 나타났습니다. 특히 제한된 정보만 주어지는 초기 분류 단계와 불확실성이 높은 상황에서 AI가 뛰어난 처리 능력을 보였습니다. 그러나 연구진은 AI가 곧바로 의사를 대체하는 것이 아니라 향후 철저한 임상 시험을 거쳐 인간 의사와의 협력 및 보조 도구로 활용되어야 한다고 강조했습니다.

번역된 본문

응급 진료 환경에서 인공지능이 인간 의사와 비교해 어느 정도의 성능을 보일지 생각해 본 적 있나요? 목요일에 발표된 새로운 연구 결과는 이 질문에 대해 다시 한번 생각해 보게 만듭니다.

과학 저널 '사이언스(Science)'에 발표된 이 연구에 따르면, 최첨단 대형 언어 모델이 다양한 일반적인 임상 과제에서 인간 의사보다 뛰어난 성능을 보였습니다. 실제 응급실 데이터와 수백 명의 의사 비교 데이터를 활용한 결과, 이 모델은 진단 선택, 응급 분류(Triage), 환자 관리를 위한 다음 단계 결정 등에서 인간 임상의의 성능에 필적하거나 이를 뛰어넘었습니다.

연구 저자들은 이러한 결과가 AI 모델이 인간 의사를 대체할 준비가 되었다는 것을 의미하지 않는다고 말했습니다. 대신, 이 결과는 의료 산업 전문가들이 AI 평가를 위한 더 빠르고 엄격한 기준과 의료 분야 AI 사용 규칙을 마련해야 함을 시사합니다.

연구진은 매사추세츠주의 한 의료 센터에서 무작위로 추출된 실제 응급실 환자 표본과 표준화된 임상 사례를 혼합한 6가지 실험을 통해 2024년 출시된 OpenAI의 'o1 시리즈' 대형 언어 모델을 테스트했습니다.

이 모델의 장점은 제한된 정보로 빠른 결정을 내려야 하는 초기 분류 단계에서 가장 두드러졌습니다. 더 많은 데이터가 제공됨에 따라 인간 임상의와 AI 모델 모두 성능이 향상되었지만, 연구에 따르면 대형 언어 모델(LLM)은 불확실성을 훨씬 더 잘 처리하며 파편화되거나 비정형화된 건강 데이터와 의료 기록을 더 효과적으로 활용했습니다.

이번 발견은 수십 년 동안 어려운 진단 사례를 통해 의료 컴퓨팅 시스템을 평가해 온 기반 위에 구축되었습니다. 이전 세대 LLM도 이미 기존 알고리즘 접근 방식보다 뛰어난 성능을 보였지만, 이번 연구가 특별한 이유는 그 규모와 실제 임상 시나리오에서 인간 의사와 AI 간의 일대일 직접 비교를 수행했다는 점입니다.

저자들은 이러한 결과에 대해 여전히 비판적으로 받아들여야 한다고 강조했습니다. 병원과 응급실에서의 실제 임상 진료는 종종 텍스트 기반의 추론보다는 시각적 및 청각적 단서에 의존하는데, AI는 이를 완전하고 정확하게 해석할 수 없기 때문입니다. 연구 논문은 "인간과 기계가 비텍스트 신호를 사용하는 데 있어 어떻게 효과적으로 협력할 수 있는지 평가하기 위한 후속 연구가 필요하다"고 지적했습니다.

AI 기반 의료를 고려할 때, 이번 연구에서 테스트되지 않은 안전성, 형평성, 비용 효율성 측면을 평가하는 것 또한 매우 중요합니다.

관련 기사: 애플의 AI 건강 조언이 도입된다면, 저는 준비되길 바랍니다.

하버드 의과대학 생물의학정보학 조교수인 아르준 만라이(Arjun Manrai)는 가상 기자 브리핑에서 "요약하자면, 이 모델은 매우 많은 수의 의사로 구성된 기준선을 능가했습니다. 자세한 내용은 보시겠지만, 여기에는 인증을 받은 현직 전문의들과 실제로 복잡하고 난해한 사례들이 포함되어 있었습니다"라고 말했습니다.

만라이 교수는 "일부 기업에서 이 결과를 어떻게 활용하려 하든, 우리의 연구 결과가 AI가 의사를 대체한다는 것을 의미한다고는 생각하지 않습니다"라며, "이는 의료를重塑(재편)할 정말 심오한 기술적 변화를 우리가 목격하고 있으며, 지금 당장 이 기술을 평가하고 전향적인 임상 시험을 통해 엄격하게 검증할 필요가 있음을 의미합니다"라고 덧붙였습니다.

규제 당국, 병원 및 의료 서비스 제공자는 이러한 도구를 배포하기 전에 모든 환자의 안전과 형평성을 보장하기 위해 철저히 협력하여 테스트해야 합니다.

목요일 '사이언스'에 함께 발표된 논평에서 호주 플린더스 대학교의 연구원인 애슐리 홉킨스(Ashley M. Hopkins)와 에릭 코넬리세(Eric Cornelisse)는 이번 연구가 의료 분야 AI 시스템에 대한 더 나은 평가를 향한 한 걸음이지만, 의학은 환자에게 최상의 진료를 보장하기 위해 엄격한 감독이 필요한 복잡한 분야라고 지적했습니다.

코넬리세는 성명을 통해 "우리는 의사들이 감독이나 평가 없이 진료하는 것을 허용하지 않으며, AI 역시 동등한 수준의 엄격한 기준을 적용받아야 합니다"라고 강조했습니다.

관련 기사: 연구에 따르면 AI 챗봇, 의료 진단의 절반 이상을 놓친다고 밝혀져

원문 보기

원문 보기 (영어)

Have you ever thought about how artificial intelligence compares to a human physician in an emergency diagnostic setting? New research published Thursday might have you thinking over this question.  The study, published in  the journal Science , found that a state-of-the-art large language model outperformed human doctors on a range of common clinical tasks. Using real emergency department data and hundreds of physician comparisons, the model matched or even exceeded human clinician performance in diagnostic choices, emergency triage and determining next steps in management.  The authors of the study said those results do not mean AI models are ready to replace human doctors. Instead, the results indicate that industry professionals need faster, more rigorous standards for evaluation and rules for using AI in medicine.  The researchers tested OpenAI's o1 series large language model, released in 2024, across six experiments that blended standardized clinical cases with a real-world sample of randomly selected emergency room patients at a medical center in Massachusetts.  The model's advantage was most evident in early-stage triage, when decisions must be made with little information. Both the human clinicians and the AI model improved as more data became available to them, but the study found that the LLM handled uncertainty far better, using fragmented or unstructured health data and notes more effectively. These findings build on decades of using difficult diagnostic cases to evaluate medical-computing systems. Earlier LLMs already outperformed older algorithmic approaches, but what sets this study apart is the scale and the head-to-head comparison between a human doctor and AI in a real clinical scenario.  The authors stressed that we should remain skeptical of these results. Real clinical work in hospitals and emergency rooms often relies on visual and auditory cues -- rather than text-based reasoning -- which AI cannot interpret fully and accurately. "Future work is needed to assess how humans and machines may effectively collaborate in the use of nontext signals," the study notes.  When considering AI-assisted medical care, it's also critical to assess whether it will be safe, equitable and cost-effective, aspects that were not tested in this study.  Read also:   If AI Health Advice From Apple Is Coming, I Want to Be Ready "Long story short, the model outperformed our very large physician baseline. You'll see this in detail, but this included board-certified, actively practicing physicians and real messy cases," Arjun Manrai, an assistant professor of Biomedical Informatics at Harvard Medical School, said during a virtual press briefing call.  "I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say, and how they're likely to use these results," Manrai said. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine, and that we need to evaluate this technology now, and rigorously conduct in prospective clinical trials."  Regulators, hospitals and healthcare providers should work together to test these tools thoroughly before they're deployed to ensure safety and equity for all patients.  In a commentary also published Thursday in Science, Ashley M. Hopkins and Eric Cornelisse, researchers at Flinders University in Australia, wrote that the study is a step toward better evaluation of AI systems in healthcare, but that medicine is a complex field that requires rigorous oversight to ensure patients receive the best possible care. "We do not allow doctors to practice without supervision and evaluation, and AI should be held to comparable standards," Cornelisse said in a statement. Read also:   AI Chatbots Miss More Than Half of Medical Diagnoses, Study Finds

의료 AI 대형 언어 모델 (LLM) 의료 진단 AI 공동 진료 사이언스 (Science)