Vantage US major outage
Incident Report for ABBYY
Postmortem

Dear Customer,

On May 31, 2024, ABBYY Vantage US experienced an interruption in the storage service operation, resulting in an outage of the Vantage platform. We are pleased to confirm that the issues have been mitigated, and the service is now fully functional again. Please review the following incident Root Cause Analysis (RCA) information:

Cloud instance

  • United States

Incident timeframe

  • May 31, 2024

    • 14:30 – 16:30 UTC

Incident status

  • Fully mitigated

Customer impact

  • The platform was unable to complete new or existing processing transactions.
  • Transactions could fail with the error message “Original error: [BrokenCircuitException: The circuit is now open and is not allowing calls.]; Original error type: /app/bin/x86_64/OcrEngine.Worker.Base.dll”
  • Manual Review tasks could not be processed due to the error message “Polly.CircuitBreaker.BrokenCircuitException: The circuit is now open and is not allowing calls. ---> System.Net.Http.HttpRequestException: Connection refused (app-st-storage.app.svc.cluster.local:80)”
  • Skills that were being edited could become unresponsive, or edits could be lost.

Incident history

  • 14:30 UTC: The service health monitoring system was triggered, notifying the on-duty team of an increased rate of storage authentication errors from one pod. The team began investigations to mitigate the incident.
  • 15:00 UTC: The team identified an issue with credentials being used for storage authentication, attempted to resolve the issue and restart the affected pod, but was unsuccessful.
  • 16:07 UTC: Following internal playbooks for incident mitigation, all pods were restarted, which led to additional storage authentication errors from other pods.
  • 16:30 UTC: The root cause of the failing storage authentication was identified as pods receiving invalid storage credentials. After resolving the credentials issue and restarting the storage service, all pods started successfully, and the platform returned to normal operation. The incident was considered fully mitigated.

Root cause

  • Incorrect storage authentication credentials were received by the pods after the reconfiguration of the service during a regular password rotation procedure.

Mitigation measures

  • Corrected the configured source of credentials used for storage authentication.
  • Restarted the storage service and cleaned up the cached authentication credentials.

Prevention measures

  • Optimize internal playbooks and define additional confirmation steps to ensure the correctness of credentials updated during the password rotation procedure in the short term.
  • Implement passwordless authentication for connections to underlying service infrastructure components.

We apologize for any inconvenience and most importantly, for the potential impact on your business. We are committed to preventing such issues in the future and will continue working on improving our infrastructure and monitoring solutions.

Thank you for using ABBYY Vantage!

If you have any questions or feedback, please feel free to contact our support team via the Help Center portal.

Yours faithfully,

ABBYY Vantage Team

Posted Aug 12, 2024 - 16:28 UTC

Resolved
This incident has been resolved.
Posted May 31, 2024 - 16:31 UTC
Investigating
We observe partial degradation of processing on Vantage US. Some transactions and training can fail.
Posted May 31, 2024 - 15:56 UTC
This incident affected: Vantage US.