Meeting the scale challenge in the cloud: lessons from the CRM platform migration

6 min readJan 28, 2025

As a developer on Tripadvisor’s CRM platform team, I found myself in the thick of a massive shift: moving our infrastructure to Amazon Web Services (AWS) while migrating to a new third-party marketing platform. In this post, I describe the challenges we faced while migrating to AWS, along with the lessons we learned throughout the process.

By , Tripadvisor

The task: migrating to both AWS and a new third-party marketing platform integration

Our team was tasked with moving from an internal marketing platform to a third-party provider platform. As you might guess, sending personalized marketing communications to millions of users every day comes with its own set of challenges. To complicate things further, not only were we introducing a new partner into the mix, but we also needed to build an API optimized for that partner that aligns with our company’s ongoing shift away from on-prem infrastructure.

One of our key milestones was to move email communication to this partner platform. This shift required us to adapt and expand our expertise, especially in operating the new partner platform.

For some background, we typically personalize emails as they are being sent to provide the most timely recommendations and to help our users get the most out of their travel plans. Since Tripadvisor is the world’s largest travel guidance platform, the massive scale of visitors means we only have a small window of time to render and deliver these communications to all of those users. Although preloading data is a common approach here, doing so impacts our ability to tune communications to user needs. The third-party platform also has limited capacity to support such offline data, so it expects to leverage real-time calls to personalize communications. Their platform operates a highly distributed infrastructure that renders, personalizes, and delivers emails concurrently, with concurrency being the key for this context. We needed infrastructure that can handle massive concurrent requests, especially those requests that can’t be cached or requests that happen during cold starts. On top of that, a significant part of personalization data comes from other upstream Tripadvisor services, services that also power the live site/app and which have hard limits on the number of calls allowed for supporting CRM needs.

Note: We talk about the challenges of building this system in the following posts:

Our existing in-house platform’s infrastructure is managed by our dedicated Operations teams, so we as developers could focus on application building. With the transition to AWS, our team had to take responsibility for managing infrastructure. This turned out to be a significant adjustment.

This would also be our team’s first large scale deployment in AWS. To help teams move quickly to the cloud, Tripadvisor provides an internal catalog of AWS Cloud Development Kit (CDK) solution constructs. This catalog helped to make deploying the essential parts of our API much simpler. For example, we used a prebuilt construct that used Spring Cloud Gateway for routing with rate limiting using Redis.

The following diagram shows how requests are routed from the third-party platform to our services:

Request routing from third-party platforms to our services

The hiccup: rate limiting and load challenges

Our partner platform allows marketers to set email send rate limits on a per-campaign basis, but there is no way to define a global rate limit. As we added and migrated more campaigns, the number of concurrent requests kept increasing. This meant that we had to manage our own rate limits, returning “Too Many Requests” (HTTP 429) status codes when overloaded. Moreover, our partner’s system implemented a 2 second timeout per API call, which triggered retries with backoff after closing connections. These retries did not count towards their email rate limit, which compounded the challenges we faced.

Our internal on-premise platform had supported loads of thousands emails per second. To be safe, we decided to quadruple the capacity and performed extensive load testing to verify that our new AWS infrastructure could handle the required requests per second. We configured our rate limits accordingly and went live. Initially, things went smoothly, but then we hit a snag: some very large scale campaigns overlapped and suddenly we started seeing traffic far beyond what our configuration could handle. In the past, load was spread out because of finite parallelization so a lower throughput was enough. In the new platform, we could no longer control parallelization so even quadrupling the capacity was no longer enough to guarantee we could meet our delivery target for a given hour on any given day.

The chaos: dealing with a snowball effect

It wasn’t long before our API started returning too many “Too Many Requests” (429) responses, escalating to more severe “Internal Server Errors” (5XX). This created a snowball effect with each failed request prompting retries that compounded the problem, resulting in ECS instances failing health checks and being replaced repeatedly.

The immediate fix

Our first response was to increase ECS instances, hoping the additional capacity would ease the load. Unfortunately, this quick fix wasn’t enough. ECS continued to restart because of failed health checks, which meant new instance configurations never fully deployed.

To regain control, we temporarily activated AWS WAF (Web Application Firewall) rate limiting. This helped stabilize everything since the request rate reaching the system was now at its capacity, which gave us time to figure out the real issue behind the chaos.

Uncovering the real problem

We noticed a recurring error in the logs, an issue with the file descriptor limit being reached, which prevented new connection sockets from being created. This prompted us to increase the file descriptor limits on our ECS instances. While this helped temporarily, it didn’t fully resolve the underlying issue.

The root cause remained elusive until we took a closer look at the Redis configuration. Our Redis instance, supplied by the CDK construct, turned out to be an outdated version with limited network performance.

To fix this, we updated the CloudFormation node within the CDK code to use a newer Redis instance type with better network capabilities. Once resolved, our system handled workloads much better, but this was a clear wake-up call that encouraged us to scrutinize every part of our tech stack carefully.

Learning the rate-limiting lesson

After addressing the immediate problem, we came to a key realization: although our Spring Cloud Gateway rate limits were useful for typical traffic patterns, they fell short against unpredictable traffic spikes or potential DDoS scenarios. That’s where more robust solutions like AWS WAF proved extremely valuable.

Lessons learned and moving forward

In hindsight, we learned several important lessons that reshaped our understanding of infrastructure readiness and tactical resilience:

Leveraging a proven stack

We initially went live without WAF because the associated cost didn’t justify the need. We thought that since we had quadrupled our capacity and had a rate limiter in place that at worst we would just 429 and operate as usual. We overlooked the fact that during over saturation even healthy nodes will start to 5XX meaning that we couldn’t meet the agreed requests per second SLA. Autoscaling also wouldn’t have helped us due to the rate limiter being a single point of failure.
Interestingly, once we had WAF in place, we no longer needed to over-provision, leading to a significant reduction in our ECS costs that far outweighed the WAF’s expenses, so it more than paid for itself.

Enhancing the flexibility of CDK constructs

Our Redis issue prompted us to work closely with the team that manages our CDK constructs. Together, we updated the default Redis instance type and added extra configurations which helped to adapt the CDK construct to better suit our needs.

Embracing DevOps skills

Throughout this process, we understood the importance of incorporating DevOps skills into our toolkit. Getting comfortable with these concepts empowered us to troubleshoot problems, streamline workflows, and quickly adapt to unforeseen challenges.

Prioritizing continuous monitoring

Having effective monitoring in place became an obvious priority. While we used various dashboards to monitor the health of our services, the information required to make decisions was often fragmented. We’ve made monitoring more comprehensive across our systems, enabling us to spot potential issues before they escalate.

Conclusion

Navigating the complexities of scaling our CRM platform infrastructure on AWS while switching email providers taught us a lot about managing complex systems. By leaning into robust infrastructure practices, building out our DevOps capabilities, and remaining adaptable in the face of challenges, we are now better positioned to handle what the future might throw our way.

Tripadvisor Tech