High latencies on all requests

Incident Report for Playground

Postmortem

Twice today, Playground encountered issues with generating images and at times accessing the website and API routes. This was caused directly by a hostError on our VMs that manage inference queueing. A hostError is a lower level software or hardware error (source). This meant that generation requests would sit in the queue for up to 300s when they would expire. While this primarily affected our inference queue, a cascading issue caused requests to our website to be affected as well.

The duration of the incident lasted 4 hours, though generation was affected for much less time. API users may have seen 429 errors.

Around 2pm PT, we finished migrating VMs to add redundancy to our inference queueing; now, if one host fails, our queue will be robust and should survive the incident.

Posted May 20, 2024 - 21:21 UTC

Resolved

This incident has been resolved.

Posted May 20, 2024 - 21:06 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 20, 2024 - 20:39 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted May 20, 2024 - 18:57 UTC

Investigating

We are currently investigating this issue.

Posted May 20, 2024 - 17:56 UTC

Monitoring

Playground v2.5 has mostly recovered with a few remaining cases of high queue time.

Posted May 20, 2024 - 17:43 UTC

Update

Website and other models have been restored. There are still high queue times for Playground v2.5 for many users.

Posted May 20, 2024 - 16:45 UTC

Update

We are continuing to investigate this issue, service to website is restored. Queue times remain high for some users.

Posted May 20, 2024 - 16:02 UTC

Investigating

We are currently investigating this issue.

Posted May 20, 2024 - 14:52 UTC

This incident affected: Website, Default model (Playground v2.5), and Other Models.