Introducing Baldur: Tripadvisor’s self-serve experimentation tool for marketing

12 min readMar 27, 2025

How we built a tool that enables us make the right marketing decisions more often, through more frequent, robust and democratized experimentation.

By , Tripadvisor

At Tripadvisor we care deeply about the efficient and accurate measurement of our business initiatives. As we push forward with new strategies we must learn quickly from our prior successes and setbacks and, where necessary, be ready to adapt or change direction at a moment’s notice. This approach extends beyond product A/B testing to all areas of the business, including marketing. In marketing, the impacts of a weak experimentation strategy are measured both in a direct loss of competitive advantage and real short and potentially long-term profit loss.

For this reason, last year Tripadvisor decided to invest further in Causal Inference and Experimentation, in other words, the ways and extent to which we test our business and design decisions, by updating our organisational and technical processes across the board. To support this, we gathered together a cross-functional group of data scientists, analysts and engineers that were, at the time, working directly to support our product experimentation platform. Together we observed that while our in-house product A/B testing platform was mature and adoption rates were high, the same could not be said for marketing. Marketing (excludes search-engine optimisation (SEO)) experimentation was not democratized, meaning that every experiment required significant input from a data scientist for it to be designed, executed and measured. This was slowing down our rate of innovation, increasing the time between ideation and execution of experiments and decreasing the number of new ideas we could test. The Marketing experimentation gap was caused by differences in the required technical implementation that our existing product and SEO experimentation tools could not manage. As a result, we decided to work together to build a new tool that could fill this gap.

By leveraging an existing statistics backend that was developed by data scientists on the Marketing team, a single data scientist and engineer built Baldur, a tool that democratizes and automates marketing experimentation. Marketing requires significant expense and is one of the main drivers of user traffic and revenue for many businesses, including Tripadvisor. By making the right decisions more often through more frequent experimentation, we can potentially use this tool to drive additional millions of users to our site each year.

Marketing vs. product experimentation

To explain both why we needed a new tool and how it works, let’s consider why we can’t run normal A/B tests in marketing. In a standard A/B test, our experiment units are individual users allocated randomly into control and treatment such that, with a large enough sample size (number of users), we expect balance in terms of all important user characteristics. In causal inference terms, by doing this we are controlling for all confounding variables including unknown user characteristics and external factors in addition to information that we have available to us, such as device type. This independence of allocation into treatment or control from any other variable, combined with the fact that treatment exposure is at a user level, makes it possible to infer that the feature we’re testing is the sole cause of any change in our metrics during the experiment. Conceptually, this is relatively simple to achieve on your own website. As users arrive you simply bucket them at random into control and treatment groups and deliver the new experience to those in the treatment.

Marketing initiatives are more complex because they occur outside of the Tripadvisor site where we don’t control the targeting of individual users with specific campaigns (e.g. a creative) and therefore we cannot randomize at a user level to properly control for confounders¹. Instead, our experiment units are somewhat amorphous groups. They can, for example, be search engine query “categories” or searches for specific locations that we host on our site (e.g. a hotel). Alternatively, they can be regions that the user is observing our campaign from or searching from. For simplicity we will use designated market areas (DMAs) as our example experiment units, where the minimum campaign exposure is at a city-wide scale, delivered to many thousands or even millions of users at once. When a user observes our campaign while in NYC, they are counted in the NYC DMA.

¹ Google and other search providers do offer services that enable user-based testing in some situations but for a number of reasons we usually prefer to do in-house measurement even when this option is available.

There are three important observations to make as a result:

As in a product A/B test there are a wide range of confounders influencing each experiment unit, but these are now aggregate effects such as average user characteristics, economic, social or environmental factors that are specific to users in that region. This leads to large differences in the way that aggregate metrics (clicks, cost, revenue etc.) for each DMA respond to campaigns. We clearly cannot isolate the effects on the units we care about, our users, and control for them.
It’s possible for users to move between the DMA experiment units and therefore theoretically possible for them to move between any treatment and control composed of groups of DMAs. In causal inference terms, this is referred to as a violation of the stable unit treatment value assumption (SUTVA). It is often simple for a user to observe a campaign intended for users based in NYC while they’re in New Jersey. This is due to travel between states but also because users will find other ways even without leaving their own homes. This is comparable to users influencing one another via word-of-mouth, which we assume has a negligibly small effect going forward.
The grouping of many users into each experiment unit means that we have vastly fewer experiment units to allocate between treatment and control. The entire United States is split into around 200 different DMAs and many have very different traffic volumes. This means that a significant change in one high-volume DMA, such as NYC, can dominate the aggregated behavior of a group of DMAs. Clearly this presents a challenge if we are also trying to measure the effects of a campaign on the broader population such as the entire US.

**Fig. 1.** Why having few, highly different and sometimes indistinct groups makes experiment creation challenging. Localised seasonal and economic factors alongside potential travel between states and large user-volume differences mean that creating a comparable treatment and control group containing just the above seven states would be very difficult, or even impossible. USA outline image was downloaded from:

Ultimately, since we care about the effect of the treatment on our users but we can’t directly control an individual’s allocation into the treatment, a DMA-based experiment will never meet the robustness of the assumptions met by a product A/B test. The challenge is this: we need to find a way to create comparable treatment and control groups while carefully relaxing the requirement for controlled confounders.

One widely used Causal Inference method, and the theoretical foundation for the statistical backend of our marketing experimentation tool, is called Difference-in-Difference (DiD).

Difference-in-Difference

DiD works on the assumption that two parallel time-series trends from the same data source will continue in parallel in the absence of a treatment. To leverage this assumption we randomly split our experiment units, which in this case can be a type of action or attribute (for example DMA), into two groups, then plot the aggregated daily behavior for each group up to the present day. If the aggregate metrics (clicks, cost, revenue etc.) for the two samples produce parallel time-series trends, we assume that the trends would have continued as such without any intervention (see point 1, previous section) and can use them as valid and comparable control and treatment groups. An illustration that compares A/B testing to the DiD methodology is shown in Fig. 2.

**Fig.2.** Illustrating the difference between an A/B test (top) vs. a marketing experiment leveraging DiD (bottom). In a product A/B test, millions of users are randomly assigned into control and treatment. This naturally leads to balanced aggregate behavior, which allows us to attribute any difference in means to the impact of the treatment. In DiD, 100s or 1000s of groups (regions, search locations, etc.) are randomly bucketed to create treatment and control. If we see parallel time-series trends (offset doesn’t matter) in the pre-period before the experiment, then we attribute any new difference between control and treatment to the impact of the treatment post experiment launch.

The result of this is that when we start our experiment and release something to our designated treatment group whilst doing nothing to the control group, we assume that any difference we observe in the time-series behavior of the treatment vs. control is due solely to the effect of our release on the treatment. In other words we assume, in the absence of a treatment, that the average difference between control and treatment would have remained constant. The theoretical result of the parallel trends assumption holding true in the post-period (after experiment launch) is called the counterfactual. The counterfactual is simply the post-period control trend plus a constant equal to the average of the difference between treatment and control in the pre-period. A simulated random split that closely meets these requirements is shown in the pre-period (up to 2024–12–15) of Fig.3, which includes the expected trend for the counterfactual.

Fig.3. Simulated time-series plot for control and treatment groups that meet the requirements for DiD in the pre-period between 2024–10–01 and 2024–12–15 i.e. approximately constant difference and correlation. In the post-period there is a 20% lift in the treatment (green) measured from the value above the counterfactual (green dashed line). The difference-in-difference is found by comparing the mean difference between treatment and control in the post-period to pre-period (before and after vertical red dashed-line). Fig.4. below shows a plot of the difference.

**Fig.4.** Time-series plot for the difference between treatment and control in the pre-period (blue, before red-dashed vertical line) and post-period (green, after red-dashed vertical line) from Fig.3. The average pre-period difference represents what would have happened on average if the intervention had had no effect on the treatment group and so continued in parallel (constant difference) vs. the control group trend.

The difference-in-difference is found by comparing the difference between treatment and control in the post-period to the difference in the pre-period (See Fig.3,4). Note specifically that we are comparing the means of two differences, hence the name of the method. Writing this out mathematically we get,

where each term in the above equation represents an average. From there, the measurement is approximately the same as for a product A/B test, but instead of comparing the distributions of the aggregate user performance, we compare the distributions of the average daily difference.

**Fig.4.** Frequency histogram for the differences shown in Fig.3 in the pre-period (blue) vs. post-period (green). As in an A/B test for metric averages, it is necessary to run the experiment long-enough so that we arrive at clear distributions for both the treatment and counterfactual differences. Like an A/B test, we also use a t-test, p-values and statistical power to compare them.

As briefly described earlier, to create an experiment that can be measured using DiD we take a set of experiment units (e.g. DMAs), randomly group them together, sum the metrics by day for each group and then plot the time-series for each. The key issue then stems from point 3 in the previous section: as long as we have a reasonable number of DMAs (>50) we can keep the allocation into treatment and control independent of other confounding variables via random bucketing, but how can we then ensure that we meet the assumption of parallel trends we have so few experiment units? In a user-level A/B test, balance occurs naturally once we have enough user volume, but in marketing experiments we have relatively few units and thus cannot rely on this. As observed in point 1 in the previous section, randomly picking a treatment group consisting of DMAs representing the vast majority of the most populous and wealthy regions in the treatment group means we will run a meaningless experiment. Fundamentally, we are very unlikely to pick control and treatment groups that meet the parallel trends assumption on our first attempt.

For this reason we repeat the randomization many times, each time using a range of statistics that measure how closely the splits meet the parallel trends assumption. Once complete, we rank the splits and pick the best one. In this way, the ultimate goal of our tool is to search for a “split” that separates our data into two time-series that meet the parallel trends assumption as closely as possible, allowing us to relax the requirement to control for confounders.

Automating marketing experimentation with Baldur

To enable our stakeholders to create marketing experiments that can be measured using DiD, our tool guides users through the following configuration steps:

1. Add useful metadata:

2. Write a SQL query that builds a table with the following columns:

ID, representing the experiment unit e.g. a search or group of searches.
Date, containing daily data for every ID
Metrics, containing metric data

3. Add the metadata for the experiment including:

Pre-period start date, the date from which to look for similar time-series
Experiment unit ID
Primary, Secondary and Guardrail metrics
Number of splits to create when searching for a split that meets the parallel trends assumption
The number of arms (variants) in the experiment and their names

Once the experiment is configured, the tool executes the query, retrieves the resultant table and completes some pre-processing steps. It then:

Creates a random split using stratified random sampling, which ensures that IDs are sampled evenly across the distribution of the source sample. The process reduces the sampling error which can occur when a normal random sample misses important datapoints, giving us an inaccurate representation of the original source data. For a simple experiment with two variants, data is separated in this way into a control and treatment group with 50% of the ID’s randomly in the control and 50% in the treatment.
Each group is aggregated by day to create a time-series for the control and treatment
The time-series for the split are compared using a number of different metrics, most importantly Pearson’s correlation, mean absolute percentage error (MAPE) and “drift”. Drift uses a regression to measure the absolute change between treatment and control in a given metric over time. Generally, a drift of zero and a Pearson’s correlation of one, would mean that the two time-series have perfectly parallel trends. We prefer lower values of MAPE because larger differences between time-series leads to a skew in the difference distribution. This is because the value for a variant with a higher starting value goes up by a larger amount for the same percentage increase, and vice versa if it is lower, which will overestimate or underestimate the impact.
Steps 1–3 are repeated according to the number of splits requested by the user in step 2d of the experiment configuration and scores appended to a table.
Once the requested number of splits are complete, the results are compared using a ranking algorithm to identify the best split.
A range of plots are output to the user which they can use to evaluate the quality of the split they have created. If they aren’t satisfied, they can modify their configuration by, for example, modifying their query or other configuration parameters and then repeating steps 1–6.
The table for best split is output, which simply contains the variant assignment, so an ID column with a treatment column defining which variant the ID is in using an integer. For a simple experiment with a control and treatment, this column would just contain 0 for control or 1 for treatment.
On submission, the table is joined to the original metric table with trends and results added to a dashboard on a daily basis.

Once the experiment starts we begin to calculate the results, which we do by analyzing the resultant distributions as in Fig. 4 and running a t-test on the DiD to get our p-value. Our daily results dashboard ends up looking like Fig.5.

Fig.5. A snapshot of part of our results dashboard which appears once the measurement start date is passed. This is plotted alongside the time-series trends for the experiment in the same format as Fig.1 and Fig.2.

Conclusion

With a small amount of additional training to allow them to effectively use Baldur, our stakeholders have been empowered to create and measure their own marketing experiments in a reliable and robust manner, meaning that Tripadvisor is learning faster than ever. Baldur also continues to support a growing range of marketing experiments with different design demands. A key project for 2025 is the more effective combination of our SEO and marketing platforms so we have a holistic view of our marketing experiments.

There are many other interesting things going on in the background that we’ve failed to cover in this post, but we hope that this has given you a high-level impression of our marketing experimentation platform. Further improvements are added to Baldur all the time, and we’re hoping to post an update on both these and other updates, later in the year.

The Marketing DS Team (home of Causal Inference at TA) in close collaboration with Platform Engineering and Analytics at Tripadvisor supports a wide range of other related projects. Please contact us if you have any questions related to this article or want to find out more about Causal Inference at Tripadvisor!

Tripadvisor Tech