Impact: Some accounts sporadically experienced longer pending times than usual on a portion of their builds for a day.
Detection: Issue was reported by a customer, and shortly after confirmed by Codefresh’s platform monitoring alerts.
Root Cause: This issue was caused by a bug in MongoDB driver. The MongoDB driver was upgraded in Codefresh services as part of our efforts to improve performance, but this version contained a bug that caused Mongoose queries to hang when under heavy load without returning or throwing errors. This resulted in the Codefresh build manager randomly getting stuck when enough queries were hanging under certain conditions.
Resolution: A temporary solution to improve build queries queue behavior was initially implemented to alleviate the issue for affected customers. The actual root cause was identified the following week, and the issue was resolved by downgrading the MongoDB driver to a version that did not contain the bug.