Partial Outage: Pipeline builds are stuck in pending due to expired certificate's
Incident Report for Codefresh
Postmortem

Impact:

We had a 10 hybrid runners (no more than 10) that were unable to communicate with our API for a day, and therefore were unable to fetch and run pipelines.

Detection:

We were informed of this issue by customers.

Root Cause:

We identified an issue with our certificate rotation which failed to generate new certificates as required for this subset of runners.

Resolution:

We were able to resolve the issue by manually recreating the certificates required, which were then updated to the runners on the next build, restoring the service for all impacted customers. Further mitigation was done to ensure the issue with certificate rotation was also rectified.

We are working on monitoring improvements in this area

Posted Aug 14, 2024 - 01:54 UTC

Resolved
We had a small number of hybrid runners (no more than 10) that were unable to communicate with our API for a day, and therefore were unable to fetch and run pipelines. We identified an issue with our certificate rotation which failed to generate new certificates as required for this subset of runners. We were able to resolve the issue by manually recreating the certificates required, which were then updated to the runners on the next build, restoring the service for all impacted customers.
Posted Aug 01, 2024 - 19:00 UTC