Expedia Group Technology — Engineering

Lessons from a Rollback Gameday

Insights and best practices from a real-world rollback gameday

4 min readJan 9, 2025

A hiker stands on a rocky landscape at dawn. — Photo by NEOM on Unsplash

In the fast-paced world of software development and deployment, things don’t always go as planned. Even with rigorous testing and careful planning, new releases can sometimes introduce unexpected issues that impact user experience or system stability. At ™, we recognize that a well-executed rollback can be the difference between a minor hiccup and a major crisis. This blog post explores key insights from our team’s recent rollback gameday, a proactive exercise designed to sharpen our incident response skills. We’ll delve into the lessons learned that can help any development team enhance their rollback processes and maintain service reliability.

Training and drills

When planning a gameday, it’s crucial to schedule it regularly. You might opt for a monthly cadence if your CI/CD pipeline changes frequently, or adjust the frequency based on what best fits your team’s needs. Additionally, ensure the scheduling aligns with other team activities and workload peaks to avoid conflicts and maximize participation.

Before the gameday, clearly define the metrics you need to collect. These might include:

Total Rollback Time (TRT): TRT = Timestamp of traffic first seen on stable stack minus Timestamp of Rollback triggered
SRE’s Golden Signals (latency, traffic, errors and saturation) during rollback
Success Rate of Rollback Attempts: Success Rate = (Number of successful rollbacks / Total rollback attempts) * 100%

Understanding these metrics in advance helps focus your team’s efforts and provides concrete data to analyze after the exercise.

It’s also important to review your service’s SLO (Service Level Objective) to determine the appropriate MTTR (Mean Time to Recovery) for rollbacks. The SLO defines the level of service your users expect, and MTTR is a key metric in how quickly you need to recover from issues. By aligning your rollback gameday with these metrics, you ensure that your simulations reflect real-world pressures and expectations.

This comprehensive approach not only helps keep the team familiar with rollback procedures but also integrates well with your overall workflow. It enhances both efficiency and engagement while providing valuable insights for continuous improvement. By identifying and resolving bottlenecks in your process before they impact live operations, your system remains resilient and responsive.

Choosing the right version for rollback

When it comes to rollbacks, a common misconception is that reverting to the immediately previous version is always the best course of action. However, this approach can sometimes lead to more problems than it solves. The reality is that the previous version might also be unstable or contain critical bugs, and blindly rolling back to it could compound existing issues or introduce new ones.

To address this challenge, it’s crucial to implement a “safe harbor” strategy in your Continuous Deployment pipeline. This involves creating a robust mechanism to identify and tag stable versions of your software, ensuring you always have reliable rollback points in case of severe issues.

For Kubernetes applications, you can leverage Liveness & Readiness probes to evaluate your app’s health and stability.

Beyond these Kubernetes built-in features, consider implementing performance benchmarks in your pipeline to ensure the application meets required response times and throughput. Once a version successfully passes all these checks, it can be tagged as a “stable” version. These tagged versions then serve as your “safe harbors.” They provide a crucial safety net for your team when things go awry, allowing you to quickly revert to a known good state while you diagnose and fix the problems in the newer version.

The image shows a software deployment lifecycle with versions 1.1, 1.2 (green and labeled “Stable”), 1.3, and 1.4 (red). An arrow indicates a rollback from 1.4 to 1.2. Below 1.2, a gray box lists validated checks: “Liveness Probe,” “Readiness Probe,” and “Performance,” all with green checkmarks. It highlights 1.2 as a stable rollback point due to successful validation. — Roll back to a stable version

Rollback playbooks

A rollback playbook is your service’s emergency manual for reversing problematic deployments. When a new release causes unexpected issues, this playbook guides your team through the process of quickly reverting to a previous, stable version of your software. The playbook ensures that even under pressure, anyone on-call can act swiftly and confidently. It minimizes the impact of deployment issues by shortening recovery time.

Key components of an effective rollback playbook include:

Clear decision-making criteria for initiating a rollback
Step-by-step rollback procedures
Roles and responsibilities during a rollback
Communication protocols
Monitoring and observability during rollback
Verification steps to ensure successful rollback
Post-rollback analysis and reporting

Remember, a rollback playbook is not a static document. It requires continuous improvement through regular reviews and updates. This is why scheduling frequent rollback gamedays is crucial. These simulations provide valuable opportunities to test, refine and enhance your playbook.

Conclusion

As we’ve explored throughout this blog, a robust rollback strategy is an essential part of any modern deployment pipeline. A well-prepared team with a solid rollback strategy can turn potential disasters into manageable incidents. So, take the time to implement these practices, run those drills, and keep refining your approach.

Expedia Group Technology