OpenAI, a leading name in the field of artificial intelligence, recently faced one of its longest outages in history. This incident had a significant impact on its various services, including the highly popular ChatGPT, the video generator Sora, and the developer-facing API. The outage began around 3 p.m. Pacific on Wednesday and took the company approximately three hours to resolve.
Unraveling OpenAI's Telemetry Service Outage
Understanding the Outage
OpenAI attributed the outage to a "new telemetry service" that went awry. In a postmortem published late Thursday, it was revealed that this service, deployed on Wednesday to collect Kubernetes metrics, unintentionally caused resource-intensive Kubernetes API operations. Kubernetes, an open source program for managing containers, became overwhelmed, taking down the control plane in most of OpenAI's large clusters.This led to a series of complications as the new telemetry service affected OpenAI's Kubernetes operations, particularly a resource crucial for DNS resolution. DNS resolution converts IP addresses to domain names, allowing users to simply type "Google.com" instead of the actual IP address. OpenAI's use of DNS caching further complicated matters by delaying visibility and allowing the rollout of the telemetry service to continue before the full extent of the problem was understood.Detecting and Remedying the Issue
Fortunately, OpenAI was able to detect the issue "a few minutes" before customers started noticing an impact. However, due to the overwhelmed Kubernetes servers, it was not possible to quickly implement a fix. The company explained that this was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways. Their tests did not catch the full impact of the change on the Kubernetes control plane, and remediation was slow due to the locked-out effect.Preventive Measures
To prevent similar incidents in the future, OpenAI has announced several measures. These include improvements to phased rollouts with better monitoring for infrastructure changes and new mechanisms to ensure that OpenAI engineers can access the company's Kubernetes API servers in any circumstances. The company has expressed its apology for the impact caused to all its customers and acknowledged that it fell short of its own expectations.In conclusion, OpenAI's outage serves as a lesson in the importance of careful monitoring and planning when implementing new services and systems. The company is committed to learning from this experience and taking steps to ensure the smooth operation of its services in the future.You May Like