Hacker News • 70일 전

구글 클라우드 계정 정지로 인한 Railway 전면 장애 사고 보고서(해결)

IMP

8/10

핵심 요약

클라우드 호스팅 플랫폼인 Railway가 구글 클라우드(GCP) 측의 실수로 프로덕션 계정이 일시 정지되면서 약 8시간 동안 전체 서비스가 마비되는 대규모 장애를 겪었습니다. 이 사건은 단일 상위 클라우드 공급자의 장애나 정책 오류가 전체 플랫폼 아키텍처로 어떻게 연쇄적으로 전파될 수 있는지 보여주는 핵심 사례입니다. 이에 Railway 측은 아키텍처의 책임을 인정하고, 재발 방지를 위한 인프라 및 네트워크 제어 평면 분리 등의 조치를 약속했습니다.

번역된 본문

작성자: Chandrika Khanduri & Cody De Arkland / 2026년 5월 20일 🚅

본 보고서는 게시 시점까지 파악된 사실을 반영하며, Google Cloud의 내부 검토 결과에 따라 업데이트될 수 있습니다.

Railway는 Google Cloud가 당사의 계정을 정지 상태로 잘못 처리하여 플랫폼 전체에 걸친 서비스 장애를 겪었습니다. 이로 인해 Google Cloud(GCP)에 호스팅된 모든 인프라의 서비스가 일시적으로 중단되었습니다. 해당 인프라는 당사의 대시보드, API 및 네트워크 인프라의 일부를 지원합니다. 캐시된 네트워크 라우트(Route)가 만료되면서 장애는 GCP를 넘어 Railway의 모든 워크로드(Workload)에까지 영향을 미쳤습니다. 아래에서는 발생한 상황, 당사의 대응 과정, 그리고 유사한 사건을 예방하기 위해 무엇을 하고 있는지 설명합니다.

영향 2026년 5월 19일 22:20(UTC)부터 5월 20일 약 06:14(UTC)까지(약 8시간) Railway는 Google Cloud가 당사의 프로덕션 계정 서비스를 정지하면서 플랫폼 전체 장애를 겪었습니다. 이로 인해 API, 제어 평면(Control Plane), 데이터베이스가 오프라인되었으며, Google Cloud에 호스팅된 컴퓨팅 인프라도 함께 중단되었습니다.

사용자들은 대시보드와 API에서 즉시 503 에러를 경험했으며, 'no healthy upstream(정상적인 업스트림 없음)' 및 'unconditional drop overload(무조건적 오버로드 드롭)' 메시지와 함께 로그인할 수 없었습니다. Google Cloud 컴퓨팅에 호스팅된 모든 워크로드가 오프라인 처리되었습니다.

당사 자체 인프라인 Railway Metal과 AWS 버스트 클라우드(Burst-cloud) 환경의 워크로드는 계속 가동 상태를 유지했지만, Railway의 엣지 프록시(Edge Proxy)는 라우팅 테이블을 채우기 위해 Google Cloud에 호스팅된 제어 평면 API에 의존하고 있어 장애가 Google Cloud를 넘어 연쇄적으로 확산되었습니다. 네트워크 라우트 캐시가 만료되면서 이러한 다른 환경의 워크로드에도 접근할 수 없게 되었고, 네트워크 제어 평면이 활성 인스턴스로 향하는 라우트를 더 이상 확인할 수 없어 404 에러가 반환되었습니다. 피크 영향 시점에는 모든 리전의 Railway 워크로드에 접근할 수 없게 되었습니다.

Google Cloud 환경을 복구하는 과정에서 개별 서비스를 복원하는 동안 플랫폼 전체에서 빌드 및 배포가 차단되었습니다. 전체 인프라가 복원된 후, 플랫폼에 과부하가 걸리는 것을 방지하기 위해 대기 중이던 대규모 배포 백로그가 점진적으로 처리되었습니다.

이와 병행하여, GitHub는 Railway의 OAuth 및 웹훅(Webhook) 통합에 대한 속도 제한(Rate-limit)을 시작하여 일시적으로 로그인 및 빌드를 차단했습니다. 이러한 호출의 양은 Google Cloud 장애로 인해 캐시가 지워지면서 급증했습니다. 부작용으로 서비스 약관(Terms-of-service) 동의 기록도 초기화되어, 사용자들은 대시보드에 다시 방문할 때 재동의를 해야 했습니다.

당사는 단일 상위 제공업체의 조치가 플랫폼 전체 장애로 연쇄될 수 있었던 아키텍처적 결정에 대해 전적인 책임을 인정합니다. 아래에는 발생한 상황, 복구 과정, 그리고 재발을 방지하기 위해 변경하고 있는 사항을 자세히 설명합니다.

사고 타임라인

5월 19일 22:10 UTC - 자동 모니터링이 API 상태 확인 실패를 감지하고 당직자에게 알림을 보냈으며, 조사를 시작했습니다.
5월 19일 22:11 UTC - 대시보드에서 503 에러가 반환되었습니다. 사용자 로그인이 불가했습니다.
5월 19일 22:19 UTC - 근본 원인 파악: Google Cloud Platform이 Railway의 프로덕션 계정을 정지했습니다.
5월 19일 22:22 UTC - Google Cloud에 심각도 최상위(P0) 티켓이 접수되었습니다. Railway의 GCP 계정 매니저가 직접 개입했습니다.
5월 19일 22:29 UTC - 사고가 공식적으로 선포되었습니다.
5월 19일 22:29 UTC - GCP 계정 접근이 복구되었습니다. 단, 모든 컴퓨팅 인스턴스는 중단된 상태로 유지되었으며 영구 디스크(Persistent disk)에는 접근할 수 없었습니다.
5월 19일 22:35 UTC - 캐시된 네트워크 라우트가 만료되기 시작했습니다. 네트워크에서 라우트를 더 이상 확인할 수 없어 Railway Metal 및 AWS의 워크로드에서도 404 에러가 반환되었습니다.
5월 19일 23:09 UTC - 첫 번째 영구 디스크가 온라인으로 돌아왔습니다.
5월 19일 23:54 UTC - 모든 영구 디스크가 준비(Ready) 상태로 복원되었습니다. 네트워크는 여전히 다운된 상태였습니다.
5월 20일 00:39 UTC - 디스크가 준비된 것을 확인했습니다. 복구는 Google Cloud 네트워크 복원에 맞춰 진행되었습니다.
5월 20일 01:30 UTC - 컴퓨팅 인스턴스가 복구되기 시작했습니다.
5월 20일 01:38 UTC - 엣지 트래픽이 다시 서비스되기 시작했습니다. 네트워크가 복원되었습니다.
5월 20일 01:57 UTC - 오케스트레이션 및 빌드 인프라가 복원되었습니다. 대기 중이던 작업들이 동시에 실행되어 시스템에 과부하가 걸리는 것을 막기 위해 배포가 일시적으로 중단되었습니다.

원문 보기

원문 보기 (영어)

Chandrika Khanduri & Cody De Arkland May 20, 2026 🚅 This report reflects what we know at time of publication and may be updated pending Google Cloud's internal review. Railway experienced a platform-wide service disruption due to Google Cloud incorrectly placing our account in a suspended status. This resulted in a temporary loss of service for all GCP hosted infrastructure. This infrastructure supports our dashboard, API, and pieces of our network infrastructure. As cached network routes expired, the outage extended beyond GCP to affect all Railway workloads. Below, we walk through what happened, how we responded, and what we're doing to prevent a similar incident in the future. Impact On May 19, 2026 between 22:20 UTC and approximately 06:14 UTC on May 20 (~8 hours), Railway experienced a platform-wide outage after Google Cloud suspended services on our production account. This took our API, control plane and databases offline, along with compute infrastructure hosted on Google Cloud. Users immediately experienced 503 errors on the dashboard and API, including "no healthy upstream" and "unconditional drop overload" messages, and were unable to log in. All workloads hosted on Google Cloud compute were taken offline. While workloads on our own Railway Metal and AWS burst-cloud environments remained up, Railway's edge proxies rely on a Google Cloud-hosted control plane API to populate their routing tables, causing the outage to cascade beyond Google Cloud. As the route caches expired, these other workloads became unreachable, resulting in returning 404 errors as the network control plane could no longer resolve routes to active instances. At peak impact, all Railway workloads across all regions were rendered unreachable. As we recovered our Google Cloud environment, builds and deployments were blocked platform-wide while we restored the individual services. Once the entirety of our infrastructure was restored, a significant backlog of queued deploys was gradually drained to avoid overwhelming the platform. In parallel, GitHub began rate-limiting Railway's OAuth and webhook integrations, temporarily blocking logins and builds. The volume of these calls increased as a result of our caches being cleared from the Google Cloud outage. As a side effect, Terms-of-service acceptance records were also reset, prompting users to re-accept on their next visit to the dashboard. We take full responsibility for the architectural decisions that allowed a single upstream provider action to cascade into a platform-wide outage, and detail below what happened, how we recovered, and the changes we are making to prevent this from happening again. Incident Timeline May 19, 22:10 UTC - Our automated monitoring detected API health check failures and paged our on-calls, who started investigating the issue. May 19, 22:11 UTC - Dashboard returning 503 errors. Users unable to log in. May 19, 22:19 UTC - Root cause identified: Google Cloud Platform has suspended Railway's production account. May 19, 22:22 UTC - P0 ticket filed with Google Cloud. Railway's GCP account manager engaged directly. May 19, 22:29 UTC - Incident declared. May 19, 22:29 UTC - GCP account access restored. All compute instances remained stopped and persistent disks inaccessible. May 19, 22:35 UTC - Cached network routes began expiring; workloads on Railway Metal and AWS began returning 404 errors as the networking could no longer resolve routes. May 19, 23:09 UTC - First persistent disk comes back online. May 19, 23:54 UTC - All persistent disks restored to ready state. Network still down. May 20, 00:39 UTC - Disks confirmed ready. Recovery blocked on Google Cloud networking restoration. May 20, 01:30 UTC - Compute instances began recovering. May 20, 01:38 UTC - Edge traffic being served again. Networking restored. May 20, 01:57 UTC - Orchestration and build infrastructure restored. Deploys temporarily paused to prevent overwhelming systems as queued work attempted to execute simultaneously. May 20, 02:04 UTC - Compute hosts being brought back online incrementally. May 20, 02:47 UTC - GitHub began rate-limiting Railway's OAuth and webhook integrations; some users unable to log in, builds blocked. May 20, 02:55 UTC - Dashboard accessible again. May 20, 03:59 UTC - Deployments beginning to process again across all tiers. May 20, 04:00 UTC - API, dashboard, and OAuth endpoints confirmed operational. Remaining workloads continuing to restore. May 20, 06:14 UTC - Incident moved to monitoring. May 20, 07:58 UTC - Incident is resolved. What Happened? At 22:20 UTC on May 19, Google Cloud placed Railway’s production account into a suspended status incorrectly, as part of an automated action. This action extended to many accounts within Google Cloud. As this was a platform-wide action, there was no proactive outreach to individual customers prior to the restriction. This suspended status disabled our GCP related infrastructure, which supports the Railway Dashboard, API and parts of our Network infrastructure, along with additional burst-compute infrastructure hosted on Google Cloud. Railway's control plane is a set of a core dependencies that serves the dashboard, processes builds and deployments, and populates the routing tables used by our edge. The impact was immediate for all workloads on Google Cloud. Railway's edge proxies maintain a cache of routing tables from the network control plane, which is hosted within Google Cloud. While that cache held, workloads on Railway Metal and AWS continued to serve traffic. Once the cache expired, the edge could no longer resolve routes to active instances, and workloads across all regions, including Metal and AWS, began returning 404 errors. This caused the network outage impact to cascade beyond Google Cloud, into these regions as well, even though the workloads themselves remained online. Railway's infrastructure is designed for high availability. Our databases run across multiple availability zones, and our network uses redundant connections between AWS, GCP, and Railway Metal. However, restoring account access did not restore these individual services. Persistent disks, compute instances, and networking all required separate recovery. Due to the nature of this recovery process, the outage was extended by several hours. Disks were restored to a ready state by 23:54 UTC, but core networking and edge routing did not fully restore until approximately 01:30 UTC on May 20. (We are awaiting confirmation to see if this delay and associated errors were on Google’s side) As networking was restored, recovery of Railway core services and validation of end user workloads proceeded layer by layer. To prevent overwhelming our build systems we temporarily paused deploys, and gradually allowed them to resume. In parallel to our core system recovery, GitHub began rate-limiting Railway's OAuth and webhook integrations, due to the volume and burst nature of all retried requests, temporarily blocking user logins and builds. By approximately 04:00 UTC on May 20, the API, dashboard, and OAuth endpoints were confirmed operational, with remaining workloads continuing to restore. Preventative Measures Railway’s network control plane is designed for resilience. It is a multi-AZ, multi-zone control plane which can tolerate the loss of multiple machines and components, while still functioning with zero user impact. This has been tested in both staging as well as live traffic (prior to its rollout a few months ago). We have invested in resiliency as a result of prior incidents which have assisted us in dealing with the impact. A prior example of these lessons was Railway being able to gracefully recover user GitHub installations without triggering secondary rate-limits. However, many have asked over multiple forums, how could Railway have a single dependency that would affect all customer workloads? Railway’s network is a mesh ring, built up of high availability fiber interconnects between Metal <>

클라우드 인프라 서버 장애 구글 클라우드(GCP) 인시던트 리포트

Railway, 구글 클라우드 계정 차단으로 대규모 장애 발생

클라우드 호스팅 플랫폼인 Railway가 상위 클라우드 제공업체인 Google Cloud의 계정 차단을 원인으로 대규모 서비스 장애를 겪고 있습니다. 이로 인해 사용자 인증 실패, 대시보드 접속 불가 등의 증상이 발생했으며, API 및 내부 네트워크 제어 등 핵심 인프라 복구를 위해 구글 측과 협력 중입니다. 현재 복구 시점(ETA)은 미확정 상태이며 지속적인 모니터링이 필요합니다.

클라우드 장애 인프라 Railway