OpenAI's ChatGPT Outage Linked to New Telemetry Service

Dec 13, 2024 at 3:12 PM

OpenAI, a leading name in the field of artificial intelligence, recently faced one of its longest outages in history. This incident had a significant impact on its various services, including the highly popular ChatGPT, the video generator Sora, and the developer-facing API. The outage began around 3 p.m. Pacific on Wednesday and took the company approximately three hours to resolve.

Unraveling OpenAI's Telemetry Service Outage

Understanding the Outage

OpenAI attributed the outage to a "new telemetry service" that went awry. In a postmortem published late Thursday, it was revealed that this service, deployed on Wednesday to collect Kubernetes metrics, unintentionally caused resource-intensive Kubernetes API operations. Kubernetes, an open source program for managing containers, became overwhelmed, taking down the control plane in most of OpenAI's large clusters.This led to a series of complications as the new telemetry service affected OpenAI's Kubernetes operations, particularly a resource crucial for DNS resolution. DNS resolution converts IP addresses to domain names, allowing users to simply type "Google.com" instead of the actual IP address. OpenAI's use of DNS caching further complicated matters by delaying visibility and allowing the rollout of the telemetry service to continue before the full extent of the problem was understood.

Detecting and Remedying the Issue

Fortunately, OpenAI was able to detect the issue "a few minutes" before customers started noticing an impact. However, due to the overwhelmed Kubernetes servers, it was not possible to quickly implement a fix. The company explained that this was a confluence of multiple systems and processes failing simultaneously and interacting in unexpected ways. Their tests did not catch the full impact of the change on the Kubernetes control plane, and remediation was slow due to the locked-out effect.

Preventive Measures

To prevent similar incidents in the future, OpenAI has announced several measures. These include improvements to phased rollouts with better monitoring for infrastructure changes and new mechanisms to ensure that OpenAI engineers can access the company's Kubernetes API servers in any circumstances. The company has expressed its apology for the impact caused to all its customers and acknowledged that it fell short of its own expectations.In conclusion, OpenAI's outage serves as a lesson in the importance of careful monitoring and planning when implementing new services and systems. The company is committed to learning from this experience and taking steps to ensure the smooth operation of its services in the future.

MYworldfix

News

Finance

ParentsKids

Recipes

Fashion

Cars

Games

OpenAI's ChatGPT Outage Linked to New Telemetry Service

Unraveling OpenAI's Telemetry Service Outage

Understanding the Outage

Detecting and Remedying the Issue

Preventive Measures

You May Like

OpenAI's ChatGPT Outage Linked to New Telemetry Service

Unraveling OpenAI's Telemetry Service Outage

Understanding the Outage

Detecting and Remedying the Issue

Preventive Measures

You May Like

Liquid AI Raises $250M for Developing Efficient AI Model

Texas AG Investigates Character.AI & 14 Platforms for Child Safety