Impact:
We had a partial outage (some requests could not access the platform at all) and some builds were stuck in pending for 30 mins.
Detection:
We manually detected this issue before our automated check (every 10 minutes) alerted us
Root Cause:
We had a parallel issue with Firebase logging and the combination of a number of small issues as a result caused some pods to become unresponsive.
Resolution:
We reverted our last push to production to test if this was code related. Once the revert triggered services to restart, the issue was then resolved.