March 9th Service Disruption

Incident Report for Soundtrack

Postmortem

Incident Report: Resilience, Recovery and Our Commitment to You

Date: March 12, 2026 10:45 UTC

Subject: A transparent look at the March 9th service disruption

At Soundtrack Technologies, we know that when our service stops, your business is affected. On Monday, March 9th, we faced a complex technical challenge that tested our systems. We want to share the story of what happened, how our teams responded, and the concrete steps we are taking to ensure we stay ahead of similar issues in the future.

The incident story

The incident began when one of our primary cloud infrastructure providers, experienced an internal service disruption. This coincided with an automated update to our server clusters. While our systems have built-in fallback mechanisms, this specific combination hit an edge-case that bypassed several of our standard redundancies.

As the cloud provider’s systems attempted to upgrade, they entered us in a degraded state, leading to intermittent connectivity across our core services. To our users, this manifested in three frustrating ways:

  1. The "White Screen": Many users saw a blank loading loop without an error message.
  2. Management Lockout: The tools used to manage music and business accounts were temporarily unreachable.
  3. Playback Interruptions: While our local caching kept music playing for many, some devices were disconnected because the authentication service could not verify sessions during the cloud update.

We Heard You

During the incident, we received a lot of feedback. We want to address the most important sentiments we heard:

  • "Is it me or you?" Many of you spent valuable time troubleshooting your own hardware—restarting iPads and checking Wi-Fi—only to find the issue was on our side.
  • The Dreaded Loading Loop: A blank screen without information causes significant confusion. You weren't sure if the app was broken or just slow.
  • The Awkward Silence: When the music stops and you cannot log in to fix it, it affects your guests and your atmosphere. We understand that responsibility.

Our 24/7 Response and Resolution

Our Site Reliability Engineering (SRE) team operates on a 24/7 model, ensuring expert eyes are on our systems every second of the day. Within minutes of the first anomaly, our on-call engineers established a centralized "War Room."

Because the root cause sat within a third-party provider's infrastructure, our team took manual control to shield our customers. Our engineers manually took control over the "Control Plane"—the brain of our server clusters—and strategically restarted services in a specific sequence to restore traffic. Even as our systems were hit with 14x the normal traffic from devices trying to reconnect, our team stabilized the platform and fully restored services by 18:10 UTC, when the third-party cloud provider’s incident was resolved. It’s unusual for incidents to last for several hours as we’re typically able to resolve issues very quickly, making this incident one of the longest-running in Soundtrack history.

Turning Lessons into Action

A mature SRE culture is defined by how it learns. We are not just "fixing" this incident; we are evolving because of it. We have initiated a comprehensive roadmap of over 15 high-priority action points to prevent a recurrence:

  • Enhanced Communication: While we currently display a Statuspage pop-up in the web app, we are developing a dedicated “Major Incident Error Page” for our apps. If an outage occurs, you’ll immediately know it’s on our side—so you won’t have to spend time troubleshooting your own Wi-Fi.
  • White screen issue: We are also improving the messaging shown during the white screen loading state that many of you have experienced.
  • Technical Resilience: We are updating the "Circuit Breakers" to handle massive traffic surges and investigating multi-region, High-Availability (HA) setups for our most critical services to reduce dependence on any single cloud provider zone.
  • Resilient Pairing: We are updating our internal logic to ensure that even if a service returns a temporary error, your devices won't "unpair" or log you out unnecessarily.
  • Smarter Maintenance: We are separating our maintenance windows so that infrastructure updates never happen simultaneously across different parts of our system, ensuring our monitoring tools remain online even during upgrades.

Our Commitment

We pride ourselves on our technical maturity, but we pride ourselves more on the trust you place in us to provide the soundtrack to your day. Our 24/7 team remains vigilant, and we thank you for the candid feedback that helps us build a more robust, professional, and resilient platform.

Please make sure to subscribe to our Status page updates, to be notified when an incident happens.

The Soundtrack Support & Site Reliability Engineering Team

Posted Mar 13, 2026 - 12:32 UTC

Resolved

The incident was caused by an issue with our third-party hosting provider. While the problem was ongoing, we actively explored alternative ways to mitigate the impact.

As a result of this incident, some devices may have become unpaired and will need to be re-paired. Music that was already played or cached should have worked, but there may have been difficulties updating to new schedules or playlists. With that said, everything should be back to normal now.

If you notice any further issues, please contact us at support@soundtrack.io
Posted Mar 09, 2026 - 18:10 UTC

Monitoring

We’ve applied the currently necessary backend configuration to address the issue and are currently monitoring progress. It may take a little time for the situation to be fully resolved, but we’ll keep you informed.

We understand this may be disruptive and sincerely appreciate your patience. We’ll continue to provide updates as we make progress.
Posted Mar 09, 2026 - 17:02 UTC

Identified

Update: We’ve narrowed down the issue and are starting to see improvements. We will continue to investigate and monitor the situation closely.

We understand this may be disruptive and sincerely appreciate your patience. We’ll keep you updated as we make progress.
Posted Mar 09, 2026 - 16:34 UTC

Update

Update: We are continuing to actively work on resolving the issue as quickly as possible.

We understand this may be disruptive and sincerely appreciate your patience. We will continue to provide updates as we make progress.
Posted Mar 09, 2026 - 14:26 UTC

Update

Update: We are actively working on alternative ways to resolve the issue as quickly as possible.

We understand this may be disruptive and sincerely appreciate your patience. We will continue to provide updates as we make progress.
Posted Mar 09, 2026 - 13:37 UTC

Update

Update: The previous solution didn’t fully resolve the issue. Our team is actively working on alternative ways to fix it as quickly as possible.

We understand this may be disruptive and sincerely appreciate your patience. We’ll continue to provide updates as we make progress.
Posted Mar 09, 2026 - 12:37 UTC

Update

Update: Things are now moving in the right direction and the situation is improving.

That said, we’re still dealing with some remaining effects from the incident and are continuing our work to fully stabilise the service.

Thank you for your patience, we’ll keep you updated as we make further progress.
Posted Mar 09, 2026 - 11:57 UTC

Update

We’re continuing to work on the issue, and current indications are that resolution could take some time.

If you have alternative ways to play music, we recommend using those in the meantime.

Our team is fully engaged and working with highest priority to resolve this as quickly as possible. We’ll share updates as soon as we have more information.
Posted Mar 09, 2026 - 11:34 UTC

Investigating

We regret to inform you that our platform is currently experiencing a partial outage that is affecting some services. Users may encounter difficulties accessing any part of our platform, including website, app, and customer support channels.
Posted Mar 09, 2026 - 10:45 UTC