Hacker News • 95일 전

데이터베이스 설계의 기본 전제를 깨는 에이전트 AI

IMP

8/10

핵심 요약

기존 데이터베이스는 인간이 작성하고 검토한 예측 가능한 쿼리를 처리한다는 암묵적인 전제하에 설계되었습니다. 그러나 스스로 추론하여 비정형 쿼리를 생성하고 무자비하게 쓰기 작업을 수행하는 에이전트 AI는 이 전제를 완전히 깨버립니다. 이 글은 개발자들이 에이전트로 인한 데이터베이스 장애를 막기 위해 세션 타임아웃 설정 및 소프트 삭제(Soft Delete) 도입 등 어떤 조치를 취해야 하는지 실무적인 관점에서 조언합니다.

번역된 본문

데이터베이스는 이것을 위해 설계되지 않았습니다 - Arpit Bhayani (엔지니어링, 데이터베이스 및 시스템. 항상 무언가를 만드는 사람)

지금까지 내렸던 모든 데이터베이스 아키텍처 결정의 기반에는 암묵적인 계약(contract)이 있습니다. 아마도 이를 문서로 적어본 적은 없을 것입니다. 아무도 그렇게 하지 않죠. 그냥… 당연하게 존재했을 뿐입니다.

그 계약은 대략 이렇습니다: 호출하는 주체(caller)는 인간이 작성한 애플리케이션이며, 결정론적인(deterministic) 코드를 실행하고, 배포 전에 개발자가 검토한 예측 가능한 쿼리를 발행합니다. 쓰기(Write) 작업은 의도적입니다. 데이터베이스 연결은 짧습니다. 문제가 발생하면 인간이 이를 인지합니다. 애플리케이션 계층이 똑똑하고 주의 깊기 때문에 데이터베이스는 멍청하고 빠르기만 하면 됩니다.

40년 동안 이 계약은 유효했습니다. 이는 우리가 스키마를 설계하고, 커넥션 풀의 크기를 결정하며, 권한을 부여하고, 장애 모드에 대해 생각하는 방식을 형성했습니다. 이 가정이 올바랐기 때문에 모든 것이 잘 작동했습니다.

하지만 이제 이 가정은 더 이상 옳지 않습니다.

에이전트 AI(Agentic AI) 시스템은 모든 계층에서 이 계약을 동시에 위반합니다. 이 글에서는 정확히 어떤 가정이 실패하고 있는지, 왜 그것이 중요한지, 그리고 이를 해결하기 위해 구체적인 패턴과 코드로 무엇을 해야 하는지 분석해 보겠습니다. 바로 시작해 보겠습니다...

가정 - 결정론적인 호출자 (Deterministic Caller)

에이전트 이전에 배포한 모든 애플리케이션에서 데이터베이스를 때리는 쿼리는 인간이 작성했습니다. 개발자가 SQL을 작성하고, 코드 리뷰를 하고, 테스트한 뒤 배포했습니다. 이 가정은 매우 깊이 자리 잡고 있어 우리가 사용하는 도구에도 자동으로 반영됩니다. Postgres 쿼리 플래너는 관찰된 쿼리 패턴을 기반으로 통계를 구축하고, 캐싱 계층은 반복되는 쿼리로 워밍업되며, 커넥션 풀은 알려진 복잡도의 예상 동시 쿼리 수에 맞춰 튜닝됩니다.

에이전트는 다르게 작동합니다. 이들은 스스로 추론하여 쿼리를 만들어냅니다. 다른 추론 경로는 동일한 테이블에 대해 전혀 다른 쿼리를 생성합니다. 고객 분석 작업을 수행하는 에이전트는 한 번도 실행된 적 없는 5개 테이블에 대한 조인(join)을 발행할 수 있고, 그 결과를 고민하는 동안 연결을 유지한 다음 완전히 다른 후속 쿼리를 발행할 수 있습니다. 인덱스는 정상 경로(happy path)만 다룹니다. 커넥션 풀은 관찰된 최고치에 맞게 크기가 조정됩니다. 에이전트가 필요한 데이터에 따라 어떤 쿼리든 만들어낼 수 있을 때는 이러한 것들이 더 이상 유효하지 않습니다.

명령문 타임아웃 (Statement Timeouts)

명령문 타임아웃은 첫 번째 방어선입니다. 30초가 걸리는 인간이 작성한 쿼리는 누군가가 발견할 버그입니다. 반면 30초가 걸리는 에이전트 쿼리는 아무도 지켜보지 않는 추론 루프(reasoning loop)일 수 있습니다. 따라서 애플리케이션 수준뿐만 아니라 역할(role) 수준에서 타임아웃을 설정해야 합니다.

CREATE ROLE agent_worker; ALTER ROLE agent_worker SET statement_timeout = '5s'; ALTER ROLE agent_worker SET idle_in_transaction_session_timeout = '10s';

이때 idle_in_transaction_session_timeout 설정은 매우 중요합니다. 열려 있는 트랜잭션을 보유한 채로 추론을 잠시 멈추는 에이전트는 합법적인 상황일 수 있기 때문입니다.

가정 - 쓰기 작업은 의도적이다 (Writes are Intentional)

데이터베이스 아키텍처에서 가장 위험한 가정은 모든 쓰기 작업이 발생하기 전에 인간의 검토를 거쳤다는 것입니다. 이는 당신의 경력 전반에 걸쳐 기본적으로 사실이었지만, 이제는 더 이상 그렇지 않습니다.

에이전트는 자율적으로 쓰기 작업을 수행합니다. 에이전트는 현재 작업에 대한 자신의 이해를 바탕으로 쓰기를 수행하며, 이는 틀릴 수 있습니다. 에이전트는 도구가 예상치 못한 결과를 반환할 때 루프에 빠져 계속 쓰기를 수행합니다. 일시적인 네트워크 오류로 인해 첫 번째 시도가 실패했다고 '생각'할 때 재시도하며 쓰기를 수행합니다. 에이전트는 뭔가 이상하다는 Slack 알림을 받을 시간조차 없이 수천 개의 행을 써버릴 수도 있습니다.

다음은 실제로 기록된 장애 패턴입니다. 레거시 API를 호출한 에이전트가 빈 결과 셋과 함께 HTTP 200을 수신했습니다. 하위 데이터베이스 커넥션 풀이 고갈되어 API가 조용히 실패한 것입니다. 에이전트는 '데이터 없음'을 '문제 없음'으로 해석하고 불완전한 데이터로 500개의 트랜잭션을 계속 처리했습니다. 예외는 발생하지 않았고, 경고도 울리지 않았습니다. 로그에는 모든 레코드에 대해 '결정: 승인됨'이라고만 표시되었습니다.

이에 대한 핵심 해결책은 호출자가 잘못될 수도 있고, 재시도할 수도 있으며, 결과를 지켜보지 않을 수 있다고 가정하고 쓰기 경로(write path)를 설계하는 것입니다.

모든 곳에 소프트 삭제(Soft Deletes) 적용하기

절대 에이전트가 데이터를 하드 삭제(hard-delete)하지 못하게 하...

원문 보기

원문 보기 (영어)

Databases Were Not Designed For This Arpit Bhayani engineering, databases, and systems. always building. There is an implicit contract at the foundation of every database architecture decision you have ever made. You probably never wrote it down. Nobody does. It just… existed. The contract goes something like this: the caller is a human-authored application, running deterministic code, issuing predictable queries, reviewed by a developer before deployment. Writes are intentional. Connections are brief. When something goes wrong, a human notices. The database can be dumb and fast because the application layer is smart and careful. For forty years, this contract held. It shaped how we designed schemas, sized connection pools, granted permissions, and thought about failure modes. It worked because the assumption was correct. It is no longer correct. Agentic AI systems violate this contract at every layer simultaneously. In this article, I break down exactly which assumptions are failing, why they matter, and what to do about it - with concrete patterns and code. Let’s dig right in… Assumption - Deterministic Caller In every application you have deployed before agents, the queries hitting your database were authored by a human. developer wrote the SQL developer code-reviewed it developer tested it and deployed it. This assumption runs so deep that the tooling reflects it automatically: the Postgres query planner builds statistics around observed query patterns, caching layers warm up on repeated queries, and connection pools are tuned around the expected number of concurrent queries of a known complexity. Agents work differently; they reason their way to queries. Different reasoning paths produce different queries against the same tables. An agent working on a customer analytics task might issue a join across five tables that has never been issued before, hold the connection while it thinks about the result, then issue a completely different follow-up. Your indexes cover the happy path. Your connection pool is sized for your observed peak. Neither of those holds when the agent can build any query depending on the data it needs. Statement Timeouts Statement timeouts are your first line of defense. A human-authored query that takes 30 seconds is a bug that someone will notice. An agent query that takes 30 seconds might be a reasoning loop that no one is watching. So, set timeouts at the role level, not just the application level. CREATE ROLE agent_worker; ALTER ROLE agent_worker SET statement_timeout = '5s' ; ALTER ROLE agent_worker SET idle_in_transaction_session_timeout = '10s' ; The idle_in_transaction_session_timeout is especially important. Agents that pause mid-reasoning while holding an open transaction could be a legitimate situation. Assumption - Writes are Intentional The most dangerous assumption in database architecture is that every write was reviewed by a human before it happened. This was basically true for your entire career, but not anymore. Agents write autonomously. They write based on their current understanding of the task, which may be wrong. Agents write in loops when their tools return unexpected results. Agents write on retries when a transient network error makes them ‘think’ the first attempt failed. Agents can even write thousands of rows in the time it takes you to get a Slack notification that something looks off. Here’s a real documented failure pattern - an agent calling a legacy API receives HTTP 200 with an empty result set. The API failed silently because the database connection pool was exhausted downstream. The agent interprets “no data” as “no problem” and proceeds to process 500 transactions with incomplete data. No exception was raised. No alert fired. The log showed “decision: approved” on every record. The core fix here is to design your write paths assuming the caller might be wrong, might retry, and might not be watching the results. Soft Deletes Everywhere Never let an agent hard-delete anything. Use soft deletes as a baseline for any table an agent can write to ALTER TABLE orders ADD COLUMN deleted_at TIMESTAMPTZ ; ALTER TABLE orders ADD COLUMN deleted_by TEXT ; -- 'agent:customer-support-v2', 'user:abc123' ALTER TABLE orders ADD COLUMN delete_reason TEXT ; -- Agents query this view; they never see deleted rows and can't accidentally undelete CREATE VIEW active_orders AS SELECT * FROM orders WHERE deleted_at IS NULL ; The deleted_by column is more important than it looks. When you are debugging what happened two hours ago, “show me everything agent X deleted” is a query you will want to run. Append-only Event Logs For operations where the stakes are higher - financial records, inventory changes, user state mutations - consider going further and making the table append-only. The agent never issues UPDATE or DELETE . It issues INSERT with a new state and a reason: CREATE TABLE order_state_log ( id UUID DEFAULT gen_random_uuid() PRIMARY KEY , order_id UUID NOT NULL REFERENCES orders(id), previous_status TEXT , new_status TEXT NOT NULL , changed_by TEXT NOT NULL , changed_at TIMESTAMPTZ DEFAULT now (), reason TEXT , idempotency_key TEXT UNIQUE ); This is the event sourcing pattern applied at the table level. A single append-only log table for your most sensitive entities gives you a complete audit trail and makes “undo” a projection query. Idempotency Keys Are Not Optional Agents retry, and this is by design. Every orchestration framework operates on at-least-once delivery semantics. If a step fails, it runs again. Your write paths need to be designed for this. An idempotency key is a stable identifier that an agent includes with every write. The database rejects duplicates silently with a unique constraint. The agent gets a successful response either way. Running the operation twice produces the same result as running it once. -- The agent generates this key from -- task_id + operation_type + target_id -- It is deterministic for the same logical -- operation, so retries produce the same key ALTER TABLE order_state_log ADD CONSTRAINT uq_idempotency_key UNIQUE (idempotency_key); In practice, the agent constructs the key like this: import hashlib def make_idempotency_key (task_id: str , operation: str , target_id: str ) -> str : raw = f " { task_id } : { operation } : { target_id } " return hashlib.sha256(raw.encode()).hexdigest()[: 32 ] The task ID comes from the orchestration layer and is stable across retries of the same logical task. This means the agent can retry as many times as it needs to, and your database sees exactly one write per logical operation. Assumption - Connections are Brief Traditional connection pool sizing follows a straightforward mental model. Your application handles N concurrent requests. Each request needs one database connection for a brief period. You size your pool to slightly above your expected concurrency peak, add a little headroom, and you are done. Agents break this model in three ways. Agents hold connections longer A multi-step reasoning task may issue a query, pause to process the result with the LLM, issue another query, pause again, and repeat. Each pause holds the connection open. The connection time per task is no longer “query execution time” - it is “query execution time + LLM inference time x reasoning steps.” Agents fan out A single high-level agent task often spawns sub-agents to work in parallel. One task becomes five simultaneous database sessions. This can exhaust connections when concurrent agent workflows holding db.session open across long IO waits until Postgres ran out of connection slots. Agents multiply unexpectedly In development, you had three agents. In production, you have thirty. Nobody updated the connection pool configuration. The fix is a dedicated connection pool for agent workloads, sized independently from your human-facing transactional application traffic # Rule of thumb: (num_agent_workers * avg_concurrent_steps * 0.5) # The 0.5 accounts for t

에이전트 AI 데이터베이스 시스템 아키텍처 백엔드 쿼리 최적화