Redis outage on Vantage EU
Incident Report for ABBYY
Postmortem

Dear Customer,

On January 29 and January 30, 2024, ABBYY Vantage EU experienced interruption in operation of the Redis service, resulting in outages of the Vantage platform. We are glad to inform you that the issues have been mitigated and the service is now fully functional. Please review the following incident Root Cause Analysis (RCA) information:

Cloud instance

  • Western Europe

Incident timeframe

  • January 29, 2024

    • 15:42 – 17:20 UTC
  • January 30, 2024

    • 10:30 – 12:10 UTC

Incident status

  • Fully mitigated

Customer impact

  • The platform was unable to launch new or complete existing transactions.
  • Skills which were being edited during the outage could become unresponsive.
  • Edits being done to skills during the outage could be lost.
  • Subscription status information was not accessible.

Incident history

  • January 29, 2024

    • 15:42 UTC: The health monitoring system of the service reported a failure of the Redis master node. The on-duty operations team discovered an unusually high memory consumption of the Redis service and performed its manual failover.
    • 16:02 UTC: The Redis master node started consuming an unusually high amount of memory again. The team performed another manual failover and cleaned up the Redis stream to reduce the consumed memory.
    • 17:20 UTC: All internal services returned to normal operation. The incident was considered mitigated.
  • January 30, 2024

    • 10:30 UTC: The health monitoring system of the service reported a new failure of the Redis master node due to high memory consumption. The on-duty operations team scaled down the Transaction service to zero to prevent new data flowing to Redis and deleted all Workflow service objects in all Redis master nodes. They then scaled up the Transaction service again.
    • 12:10 UTC: All internal services returned to normal operation. The incident was considered mitigated.

Root cause

  • The failures of Redis were caused by too high an amount of memory set allowed to be occupied by the Redis service in its hosting VMs.
  • As a result, the Redis master node was unable to process data requests but was successfully responding about its availability. This caused an automatic failover not to be triggered.

Mitigation measures

  • Manual failover of the Redis service, followed by cleaning up the Redis stream to reduce the consumed memory.

Prevention measures

  • Optimize the implementation of communication with Redis for reading and writing data to reduce memory consumption.
  • Replace the VMs hosting the Redis service with a native fully managed Redis service on MS Azure.

We apologize for any inconvenience and most importantly, for the potential impact on your business. We are committed to preventing such issues in the future and will continue working on improving our infrastructure and monitoring solutions.

Thank you for using ABBYY Vantage!

If you have any questions or feedback, please feel free to contact our support team via the Help Center portal.

Yours faithfully,

ABBYY Vantage Team

Posted Feb 07, 2024 - 09:43 UTC

Resolved
This incident has been resolved.
Posted Jan 29, 2024 - 17:20 UTC
Investigating
We are currently are experiencing difficulties in completing transactions and encountering errors when attempting to create new skills within our platform.
Our team is actively investigating the issue and working towards a resolution. We apologize for any inconvenience caused and appreciate your patience.
Posted Jan 29, 2024 - 15:42 UTC
This incident affected: Vantage EU.