Mainnet API stalled
Incident Report for Hiro Systems
Postmortem

Impact

All clients that relied on the Hiro API. This includes the Explorer, Hiro desktop and web wallets, the stacks CLI, and others. Requests on all endpoints to the API were still being served, but were returning old data.

Root cause

This incident was caused by the stacks-node broadcasting events to the writer API suddenly losing all outbound peers around 10:30AM ET. It wasn't noticed until a message was posted about it in Discord around 11:50AM ET. After being notified, traffic was switched to a different sub environment and the API was then able to return new block and transaction data.

Things that worked well

  • Quick resolution time. Traffic was switched to the other environment within a few minutes of being notified.

Areas of improvement

  • The alert monitoring block heights has too large of a buffer to wait before sending an alert. This large buffer resulted in the prolonged delay of a stalled before it could be noticed and addressed.

Action Items

  • Adjust internal alerting systems to decrease the time buffer when determining the API is stalled.
  • Adjust internal alerting systems to track a sudden loss of outbound peers.
Posted Jun 27, 2022 - 09:02 CDT

Resolved
Mainnet API stalled and unable to acquire new blocks and transactions for a few hours.
Posted May 21, 2022 - 09:30 CDT