Hacker News • 91일 전

AI에 2만 7천번 탄수화물 계산을 요청했으나

IMP

9/10

핵심 요약

최근 공개된 연구 preprint에 따르면, 당뇨병 환자의 인슐린 투여에 직결되는 AI 기반 탄수화물 계산 기능이 매우 심각한 수준의 오차와 환각 현상을 보여줍니다. 최신 AI 모델들에 음식 사진을 500회 이상 반복 제출한 결과, 동일한 사진임에도 매번 상이한 탄수화물 수치를 반환하며 최대 429g의 편차를 보였습니다. 이는 잘못된 인슐린 투여로 생명을 위협할 수 있는 수치이므로, 의료 및 건강 분야의 AI 에이전트 도입 시 극도의 주의가 필요합니다.

번역된 본문

ChatGPT에게 당신의 점심 식사 탄수화물 양을 추정해 달라고 요청해 보세요. 이제 다시 요청해 보고, 또 다시 요청해 보세요. 500번 반복해서요. 매번 같은 대답을 기대하실 겁니다. 같은 사진, 같은 모델, 같은 질문이니까요. 하지만 같은 대답을 얻지 못할 것입니다. 심지어 비슷하지도 않습니다. 그리고 그 차이는 저혈당 응급 상황을 유발할 만큼 충분히 큽니다.

이것이 제가 방금 preprint로 발표한 연구의 핵심 발견이며, 당뇨병 앱에서 AI 기반 탄수화물 계산 기능을 사용하는 모든 사람에게 직접적인 영향을 미칩니다.

이 연구에서 저는 13장의 음식 사진(실제 식사를 스마트폰으로 촬영한 것으로, 실제로 사용하는 방식 그대로임)을 4개의 주요 AI 모델에 제출했습니다: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro, Google Gemini 3.1 Pro Preview. 각 사진은 모든 모델에 500번 이상 전송되었습니다. 매번 동일한 프롬프트, 동일한 사진, 동일한 설정이 사용되었습니다. 총 26,904개의 쿼리가 진행되었습니다. 모든 테스트는 이 모델들이 제공하는 가장 낮은 무작위성(randomness) 설정에서 이루어졌습니다.

사용된 프롬프트는 오픈소스 자동 인슐린 전달 시스템인 iAPS에서 사용하는 것을 변형한 것으로, 단순한 예시가 아닌 실제 프로덕션 환경에서 쓰이는 프롬프트입니다.

모델 스스로 일치하지 않는 결과 모든 모델은 반복된 쿼리에서 동일한 사진에 대해 각기 다른 탄수화물 추정치를 반환했습니다. 하지만 불일치의 정도는 상당히 달랐습니다. 각 점은 13개의 테스트 이미지 중 하나를 나타냅니다. 바이올린 형태는 데이터의 퍼짐 정도를 보여줍니다. Claude의 변동폭은 대부분의 이미지에서 5% 미만으로 군집화되는 반면, Gemini 모델들은 정기적으로 10~20%를 넘어섭니다.

최악의 사례는 무엇이었을까요? 바로 빠에야 사진이었습니다. 각 모델에 이 사진을 500번 이상 보냈을 때 일어난 일은 다음과 같습니다: 모든 점이 하나의 쿼리입니다. 같은 사진, 같은 프롬프트, 같은 모델입니다.

Gemini 2.5 Pro의 추정치는 55g에서 484g까지 펼쳐졌습니다. 이는 429g의 범위로, 1:10 ICR(인슐린 대 탄수화물 비율) 기준으로 42.9 단위의 인슐린에 해당합니다. 반면 Claude의 추정치는 비교적 조밀하게 모여 있었습니다.

단 한 장의 사진으로부터 42.9 단위의 인슐린 오차가 발생합니다. 이는 단순한 반올림 오차가 아닙니다. 생명을 위협할 수 있는 잠재적 치명성을 가진 문제입니다.

사용자에게 보이지 않는 변동성 당뇨병 앱에서 사진을 찍으면 단 하나의 숫자만 돌아옵니다. 사용자는 자신이 받은 숫자가 모델의 일반적인 합의치에 가까운 것인지, 아니면 볼 수 없는 분포의 극단적인 이상치인지 전혀 알 방법이 없습니다. Claude의 경우 그 단일 숫자는 아마도 모델의 합의에 근접할 것입니다. 하지만 Gemini 2.5 Pro의 경우, 결과가 어디로 튈지 전혀 알 수 없습니다.

AI를 당황하게 한 치즈 샌드위치 다음은 아주 쉬워야 할 문제입니다. 두 조각의 두꺼운 흰 빵(봉지 포장지에 적힌 탄수화물: 조각당 20g)과 체더 치즈(탄수화물 무시할 수 있는 수준)입니다. 참조값은 40g입니다. 단순하고 명확하며, 포장지 라벨과 정확히 일치하는 값입니다.

왼쪽: 세 모델은 독립적으로 약 28g으로 수렴했습니다. 12g이나 잘못된 것입니다. 오른쪽: GPT-5.4는 높은 변동성을 보이며 약 74g으로 추정했습니다. 반대 방향으로 34g이나 잘못되었습니다. 빨간 점선이 실제 값입니다.

네 모델 중 세 모델(Claude, Gemini 2.5 Pro, Gemini 3.1 Pro)은 40g 식사에 대해 독립적으로 약 28g이라는 값으로 수렴했습니다. Claude로부터 510개의 쿼리를 받았는데 변동계수(CV)가 0.3%였고, 모든 값이 실제 값보다 12g 낮았습니다. 빵은 사진에 그대로 있고, 탄수화물 수치는 포장지에 적혀 있습니다. 이것이 바로 '정밀하게 틀리는(precisely wrong)' 문제입니다. 높은 일관성이 정확성을 보장하지 않습니다.

당뇨병 앱 사용자가 매번 28g이라는 값을 받는다면, 매번 약 1.2단위의 인슐린을 부족하게 투여하게 될 것입니다. GPT-5.4는 반대 방향으로 갔는데, 평균 추정치는 74g으로 참조값의 거의 두 배였으며, 게다가 변동성도 매우 컸습니다.

모델은 자신이 보고 있는 것을 항상 알지 못한다 저는 13개의 테스트 이미지 중 8개에서 음식 식별 오류를 발견했습니다:

베이크웰 타르트(Bakewell tart): Claude는 510개 쿼리 중 100%에서 이를 '린저 토르테(Linzer torte)'라고 불렀습니다. GPT-5.4는 '잼 타르트' 또는 '케이크 바'라고 불렀습니다. 오직 Gemini 3.1 Pro만이 이를 정확히 알아맞혔습니다 (99.8%).
크레마 카탈라나(Crema catalana): 4개 모델 중 3개가 100%의 확률로 '크렘 브륄레(creme brulee)'라고 불렀습니다. 오직 Gemini 3.1 Pro만이 '크레마 카탈라나'라고 답했으며, 이는 쿼리의 단 3.4%에 불과했습니다.
치즈 샌드위치: Gemini 3.1 Pro는 쿼리의 17.4%에서 존재하지 않는 '딜리 미트(deli meat)'를 추가했습니다. 즉, 실제로 없는 재료를 환각(hallucination)해 낸 것입니다. 이는 탄수화물 추정치를 직접적으로 부풀릴 수 있습니다.

이러한 일부 오식별은 영양학적 영향이 적을 수 있습니다. 하지만 다른 것들은 탄수화물 추정치를 상당히 바꿔놓을 수 있습니다.

당신의 인슐린 용량은 어디에서 작용하고 있습니까?

원문 보기

원문 보기 (영어)

Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times. You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency. That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app. The study I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4 , Anthropic Claude Sonnet 4.6 , Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview . Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings. 26,904 queries in total. All at the lowest randomness setting these models offer. The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example. The models disagree with themselves Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously. Each dot is one of the 13 test images. The violin shape shows the spread. Claude’s variation clusters below 5% for most images; the Gemini models regularly exceed 10-20%. The worst case? The paella photo. Here’s what happened when I sent it to each model 500+ times: Every dot is one query. Same photo. Same prompt. Same model. Gemini 2.5 Pro’s estimates span from 55g to 484g — a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude’s estimates cluster tightly by comparison. 42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality. This variation is invisible to you When you take a photo in a diabetes app, you get one number back. You have absolutely no way to know whether you received a typical estimate or a tail-end outlier from a distribution you can’t see. For Claude, that single number is probably close to the model’s consensus. For Gemini 2.5 Pro, you could be anywhere on the map. The cheese sandwich that defeats AI Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g . Simple, unambiguous, packet-label accuracy. Left: Three models independently converge on ~28g — consistently wrong by 12g. Right: GPT-5.4 estimates ~74g with high variability — wrong in the other direction by 34g. The red dashed line is the actual value. Three of four models — Claude, Gemini 2.5 Pro and Gemini 3.1 Pro — independently converge on approximately 28g for a 40g meal. 510 queries from Claude, CV of 0.3%, and every single one is 12g below the actual value. The bread is right there in the photo. The carb value is on the packet. This is the “precisely wrong” problem: high consistency doesn’t guarantee accuracy. A diabetes app user getting 28g every time would consistently underdose by ~1.2 units. GPT-5.4 goes the other way: mean estimate 74g, nearly double the reference, and highly variable on top of it. The models don’t always know what they’re looking at I found food identification errors in 8 of the 13 test images: Bakewell tart : Claude called it a “Linzer torte” in 100% of 510 queries. GPT-5.4 called it a “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly named it (99.8%). Crema catalana : Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries. Cheese sandwich : Gemini 3.1 Pro added non-existent “deli meat” in 17.4% of queries — hallucinating an ingredient that isn’t there. This could directly inflate carbohydrate estimates. Some of these misidentifications have modest nutritional impact. Others could change the carbohydrate estimate substantially. Where does your insulin dose actually land? On the five images where I had the strongest reference values (packet labels and weighed portions), here’s how often each model’s individual queries would have pushed insulin doses into clinically dangerous territory: Green is safe (<1U error). Yellow is moderate (1-2U). Orange is clinically significant (2-5U). Red is severe hypo risk (>5U). Claude is the only model with no queries in the orange or red zones. Claude: 100% of queries in the safe or moderate zone. No single query would have caused more than a 2-unit insulin error. GPT-5.4: 37% of queries would cause a clinically significant insulin error (>2U). That’s more than one in three queries landing in the danger zone. Gemini 3.1 Pro Preview: 12% of queries would cause a clinically significant insulin error (>2U) . Better than ChatGPT-5.4. Gemini 2.5 Pro: 12% of queries would cause a >5U error — the threshold associated with severe hypoglycaemia requiring third-party assistance. Two types of risk The study identifies two distinct failure modes: Systematic bias (chronic risk). All four models overestimate carbs on average, meaning the dominant direction of error is toward too much insulin and hypoglycaemia. GPT-5.4 averages +1.2 units overdose per meal on strong-reference foods. Three meals a day, that’s 3.6 units of extra insulin per day. Stochastic variability (acute risk). The within-image variation means a single unlucky query could produce a catastrophic outlier. Gemini 2.5 Pro’s worst single query on strong-reference data would have caused an 11.3 unit insulin overdose for a 34g meal. That’s a potential severe hypo. “But the AI said it was confident” The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates? No. We can’t. Claude’s confidence has literally zero correlation with whether it’s right or wrong. It reports ~0.80 confidence whether it’s nailing the bakery cookie (MAE 1.9g) or getting the cheese sandwich wrong by 12g. Worse: when Claude reports high confidence (above 0.85), its estimates are actually less accurate (MAE 17.3g) than when it reports lower confidence (MAE 9.1g). The confidence score is worse than useless — it’s actively misleading. Gemini 2.5 Pro reports confidence above 0.9 for 86% of all food items, and Gemini 3.1 Pro for 76%, regardless of whether the estimate is anywhere near correct. That’s not calibrated uncertainty. That’s a model saying “I’m very sure” about everything. here is one faint signal: Claude’s mean confidence does vary slightly by image — the churros (its most misidentified food) get 0.65, while the crema catalana gets 0.92. But the range is so narrow and the calibration so poor that no diabetes app could meaningfully use these scores to protect users. The bottom line: the only reliable uncertainty signal comes from querying multiple times and observing the spread. The model’s own confidence score is not a safety mechanism. What this means for you The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement. If you’re using AI carb counting in a diabetes app: Don’t trust it blindly. No model tested is safe for unsupervised insulin dosing. Not even Claude. Ask it more than once. Query 3-5 times and look at the spread. If the answers vary wildly, the model is uncertain — even if it doesn’t tell you that. Check what it thinks it sees. If the model identifies “chicken with stuffing” for your grilled fish, you want to know that BEFORE it calculates your carbs. Know your model. The four-fold difference in consistency is real. Claude Sonnet 4.6 is the safest single-query option today, but it can still be precisely wrong. An AI that gives you the same wrong answer 500 times is not more trustworthy than one that varies. Consi

인공지능 오류 의료 AI 환각 현상 당뇨병 대형 언어 모델