Cross-chain Status | 2 July 24

Summary

Pyth cross-chain prices experienced intermittent downtimes between 7:30-15:00 UTC that resulted in prices becoming stale or unavailable. The network is now stable and we are actively monitoring its health.

The root cause of this outage was the rollout of an upgrade to Pythnet. While the upgrade itself was fine, many nodes rolled it late Monday and early Tuesday which created a lot of load on all the nodes to properly sync their state and resulted in RPC nodes used by Wormhole guardians falling behind and missing price updates intermittently. The issue was resolved over time as nodes synced up properly and the additional load was gone.

Impact

As the prices are streamed off-chain via Hermes providers, it is difficult to asses the impact on third-party Hermes providers. While there are 50 minutes of cross-chain prices that were not recorded, the actual downtime is less than that. The chart below shows the recorded cross-chain price updates per minute during the outage. Most of the downtimes happened between 9:00-10:40 UTC and 13:30-15:00 UTC.

Number of recorded price updates per minute recorded

Root cause analysis

A new upgrade for Pythnet was announced on Thursday to improve Pythnet network performance. Wormhole guardians also run their own Pythnet nodes to observe Pythnet prices to produce verified observed data. When they updated their nodes, they started to miss price updates frequently. During upgrades some nodes may fall behind in their state or will get connected to unhealthy peers and get to a broken state, and to catchup, they ask other nodes to give them data to rebuild their state. This process puts a lot of load on other nodes in the network and is referred as “repair traffic” in Solana. Many of the validators and RPCs did the upgrade late Monday and early Tuesday. Doing so, Pythnet experienced more than usual unhealthy or behind nodes than what it can tolerate and the repair load slowed down Guardian nodes intermittently that resulted in Wormhole guardian’s RPCs falling behind and not observing Pyth price updates. Over time, the nodes healed themselves and the Wormhole guardians started to reliably capture all the price updates.

Hermes public instances faced additional problems due to improper failure handling that resulted in longer downtimes. The failures were the result of inaccurate health-check probes, higher load, and backoff time between the service crashes. Some temporary mitigations were put in place and more will be developed in the future. The sponsored prices were not affected as they failed over to third-party Hermes endpoints when the public Hermes instances were down.

Action items to prevent future incidents

Revise the node upgrade process to ensure uninterrupted and unaffected price outputs.
Investigate why faster node sync mechanism (like fetching snapshot) did not work properly to reduce the load.
Create a status page for consumer protocols and share updates there.
Improve the health-check probes of Hermes and Beacon.