Date: March 12, 2026 10:45 UTC
Subject: A transparent look at the March 9th service disruption
At Soundtrack Technologies, we know that when our service stops, your business is affected. On Monday, March 9th, we faced a complex technical challenge that tested our systems. We want to share the story of what happened, how our teams responded, and the concrete steps we are taking to ensure we stay ahead of similar issues in the future.
The incident began when one of our primary cloud infrastructure providers, experienced an internal service disruption. This coincided with an automated update to our server clusters. While our systems have built-in fallback mechanisms, this specific combination hit an edge-case that bypassed several of our standard redundancies.
As the cloud provider’s systems attempted to upgrade, they entered us in a degraded state, leading to intermittent connectivity across our core services. To our users, this manifested in three frustrating ways:
During the incident, we received a lot of feedback. We want to address the most important sentiments we heard:
Our Site Reliability Engineering (SRE) team operates on a 24/7 model, ensuring expert eyes are on our systems every second of the day. Within minutes of the first anomaly, our on-call engineers established a centralized "War Room."
Because the root cause sat within a third-party provider's infrastructure, our team took manual control to shield our customers. Our engineers manually took control over the "Control Plane"—the brain of our server clusters—and strategically restarted services in a specific sequence to restore traffic. Even as our systems were hit with 14x the normal traffic from devices trying to reconnect, our team stabilized the platform and fully restored services by 18:10 UTC, when the third-party cloud provider’s incident was resolved. It’s unusual for incidents to last for several hours as we’re typically able to resolve issues very quickly, making this incident one of the longest-running in Soundtrack history.
A mature SRE culture is defined by how it learns. We are not just "fixing" this incident; we are evolving because of it. We have initiated a comprehensive roadmap of over 15 high-priority action points to prevent a recurrence:
We pride ourselves on our technical maturity, but we pride ourselves more on the trust you place in us to provide the soundtrack to your day. Our 24/7 team remains vigilant, and we thank you for the candid feedback that helps us build a more robust, professional, and resilient platform.
Please make sure to subscribe to our Status page updates, to be notified when an incident happens.
— The Soundtrack Support & Site Reliability Engineering Team