Vantage US

Incident Report for ABBYY

Postmortem

On January 21, 2025, Vantage services experienced issues during POD launches due to authentication failures when pulling images. This was caused by a routine password reset for a service account. The service account was used across multiple clusters, which unexpectedly impacted production environments. Upon mitigation, storage pods were affected due to Node unavailability, which caused several key services like classification training and verification to be affected. This was also later mitigated by manually removing and creating new Nodes.
Cloud Instance

Vantage Cloud All Regions;

Incident Timeframe

18:52 UTC – Vantage Classification went down. Incident started;
18:58 UTC – Troubleshooting commenced with emphasis on generated logs;
9:17 UTC – Problem was identified as misaligned service principal password;
19:18 UTC – Mitigation procedures were simultaneously executed on all regions to resolve the service principal issue. Performance was gradually being restored;
19:54 UTC – Despite previous success with Classification, Vantage Mail import and training went down;
20:04 UTC – Troubleshoot of a new issue commenced;
20:15 UTC – The problem with Storage was identified and mitigation procedures commenced;
20:30 UTC – Troubleshooting and Investigation was ongoing and new issues with PODs were identified;
21:10 UTC – Additional measures of Manual Cordoning, draining and deletion of old Nodes were taken and system started stabilizing;
22:30 UTC – System was fully restored and operational;

Incident Status

Fully Mitigated;

Customer Impact
Customers experienced a complete halt of Mail import, Training, and Manual Verification tasks in their Vantage Cloud tenants.
Root Cause
The primary root cause was the reset of Service Principal of Sentry Cluster, which led to outage of all clusters since all clusters used the same Service Principal.
Secondary to it, the Storage Pod remained in Pending state due to unavailable Nodes and Pod disruption budgets which was resolved by manually deleting and scheduling new Nodes.
Mitigation Measures
The incident was mitigated by resolving the authentication issue across all regions, restoring service principal functionality, and manually addressing Node availability by cordoning, draining, and replacing affected Nodes to stabilize storage and dependent services.
Improvement Measures

Each Clusters will be configured to have its own service principal;
All Pod disruption budgets will be investigated and removed as required;
Automation will be set up to automatically reschedule old and pending Nodes;

We apologize for any inconvenience and, most of all, for the potential impact on your business. We are committed to preventing the issue in the future and will continue working on improving the infrastructure and our monitoring solutions.

Posted Jan 23, 2025 - 14:54 UTC

Resolved

This incident has been resolved.

Posted Jan 21, 2025 - 22:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jan 21, 2025 - 21:23 UTC

Identified

Unfortunately, additional issues were found that require the intervention of our specialists.
At the moment, they are busy implementing a fix.
Please stay tuned for updates.

Posted Jan 21, 2025 - 20:47 UTC

Monitoring

Users experiencing errors in running skills.

Our specialists have identified the issue and resolved it.
The system is currently under monitoring.

We sincerely apologize for any inconvenience this may cause.

Posted Jan 21, 2025 - 20:06 UTC

This incident affected: Vantage US.