On January 21, 2025, Vantage services experienced issues during POD launches due to authentication failures when pulling images. This was caused by a routine password reset for a service account. The service account was used across multiple clusters, which unexpectedly impacted production environments. Upon mitigation, storage pods were affected due to Node unavailability, which caused several key services like classification training and verification to be affected. This was also later mitigated by manually removing and creating new Nodes.
Cloud Instance
Incident Timeframe
Incident Status
Customer Impact
Customers experienced a complete halt of Mail import, Training, and Manual Verification tasks in their Vantage Cloud tenants.
Root Cause
The primary root cause was the reset of Service Principal of Sentry Cluster, which led to outage of all clusters since all clusters used the same Service Principal.
Secondary to it, the Storage Pod remained in Pending state due to unavailable Nodes and Pod disruption budgets which was resolved by manually deleting and scheduling new Nodes.
Mitigation Measures
The incident was mitigated by resolving the authentication issue across all regions, restoring service principal functionality, and manually addressing Node availability by cordoning, draining, and replacing affected Nodes to stabilize storage and dependent services.
Improvement Measures
We apologize for any inconvenience and, most of all, for the potential impact on your business. We are committed to preventing the issue in the future and will continue working on improving the infrastructure and our monitoring solutions.