Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

Propensity-Scoring Matching (PSM) using Machine learning in SQL

--

Causal Relationships

As data engineers, we often find ourselves analyzing web events, store transactions, or kiosk interactions to answer the fundamental question: “What has happened?” These time-series data sources help us generate performance indicators — whether for a specific product line or the business as a whole. We look for patterns in customer behavior, such as products added to the cart but not purchased, or items frequently returned and try to correlate

But should we stop at retrospective analysis? Instead of merely observing past performance, shouldn’t we be asking:

  • Why is this happening?
  • Are we unknowingly introducing selection bias?
  • How can we continuously adapt our insights to improve decision-making?

Moving Beyond Observation: Finding Causal Relationships

Performance indicators tell a story, but true insight comes from understanding the relationships between them. That’s where Propensity-Score Matching (PSM) comes in. PSM enables data engineers to analyze real-world, non-randomized data and uncover potential causal relationships.

In traditional A/B testing, we would set up controls and test groups to measure the impact of changes. However, in e-commerce and real-world settings, this approach is often impractical. Conducting controlled experiments at scale is costly, time-consuming, and by the time results arrive, they may already be outdated.

By leveraging PSM, we can extract meaningful insights from customer interactions — without relying on expensive, slow-moving experiments. Instead, we will align similar customer groups based on key attributes and analyze their behavior, helping us make data-driven decisions faster.

Applying PSM at Scale with Adobe Experience Platform — Machine learning SQL extension

In standard data science pipelines, data manipulation and modeling are typically performed using Python libraries such as pandas (for structured data operations), numpy (for numerical computing), and scikit-learn (for machine learning tasks including classification and regression). For large-scale data processing and distributed computation, frameworks like Apache Spark are employed, often necessitating cluster orchestration, resource management, and bespoke configuration to ensure optimal performance.

On Adobe Experience Platform, the Data Distiller Machine Learning SQL extension eliminates the need for external libraries and distributed system management. It enables native, in-platform execution of machine learning workflows through an extended SQL interface that supports distributed feature processing, model training, and inference directly against large-scale datasets. This approach preserves data locality, minimizes data movement, and enforces consistent access control, security, and governance policies as defined within the platform. By leveraging the native compute layer, users can execute machine learning pipelines at scale, within a governed environment, and without introducing additional operational overhead.

SQL based feature engineering using Data Distiller

  • Our Event Source:

Our event source data, the raw data for us to extract our features. The dataset is expected to be a high volume dataset with highly nested (struct, array, map) to represent different aspects of user events. The event data in our example is ecommerce dataset with the following characteristics

Schema of our event source dataset
Column metadata sourced for feature query
  • Sample Data: A view of sample data relevant for feature engineering would be
  • Features Engineering: Features to track customer purchase behavior metrics and customer activity ( aka 0 if purchased within 45 days). Given the source, lets extract the relevant purchase metric for our feature engineering using SQL capability provided using Data Distiller in Adobe Experience Platform.
Feature query in sql
Features used in Logistic Regression

Data Distiller Machine Learning SQL extension for Modeling & Prediction

Step 1: Model Training

A logistic regression can be trained on labeled data to predict the probability of conversion or churn. The logistic regression equation incorporating the given features would be:

Typically a data scientist would use Python packages like pandas and numpy, sklearn for regressions and the like. And if we are talking about data at scale in a distributed environment then Apache spark or similar distributed computing engine and its custom configuration. But on Adobe Experience platform one can use Data Distiller Machine learning-SQL extension and eliminate the need for any of the above setup and do so without having to move the data out of the platform and by using just use the SQL extension with built in Machine learning capability to access/process/train/infer on distributed data at scale with the same governance and access model but now in a trusted environment.

Let’s dive into the details and build propensity to purchase or churn using logistic regression in SQL.

ML-SQL extension provides “create model” sql operators to do feature engineering and associate them with the model (in our case logistic regression). On executing of create model sql operator the model is trained with the data provided and available for evaluation and further iteration as needed. Using the event source data we will build a instance of logistic regression called porpensity_purchase_behavior_log_reg

Step 2: Model Evaluation

Machine learning -SQL extension provides the model_evaluate functionality for one to assess the accuracy and effectiveness of the model before deploying it for predictions. The evaluation uses a similar dataset structure as model training to generate accuracy metrics. This allows us to measure the model’s predictive performance on the validation data

We used a new time partition to capture test data for our evaluation.

Step 3: Model Prediction

After evaluation, let’s use the model to predict customer propensity to purchase

And the predicted output for propensity to purchase

Step 4: PSM using nearest neighbor matching

In evaluating the impact of an intervention (e.g., a marketing campaign, product feature, or loyalty program), we need to compare treated customers (activity = 0) to control customers (activity = 1). However, since these groups may have different characteristics, we use PSM to create a fair comparison by pairing similar customers.

Query to implement Nearest Neighbor Matching (NNM) approach within a Propensity Score Matching (PSM) framework in SQL

For our analysis the query picks 10000 customers randomly from each group (treated and controlled) from psm_data. Where psm_data is the dataset generated as an output of model_predict.

Query for Nearest Neighbor Matching:

We implemented the Nearest Neighbor Matching (NNM) with the following characteristics

  • Used a Composite Score (Normalized Euclidean Distance) between treated and control group:
  • Normalized features using their standard deviation (STDDEV), preventing one feature from dominating due to scale differences.
  • Minimized bias by balancing features aka all features contribute to the distance
  • Ensured matching based on overall similarity, not just one feature.
  • Used the best match based on the lowest distance rather than an arbitrary ranking order.

Query does random sampling of customers (Control & Treated Groups)

  • Extracts 10000 random customers from activity = 0 (treated group) and activity = 1 (control group).

Cross Join (Potential Matches for Each Treated Customer)

  • Every treated customer is paired with all control customers (creating a Cartesian product).
  • This allows us to compute distances between treated and control customers.

Nearest Neighbor Matching — Find the closet control match

  • Uses ROW_NUMBER() to rank the control customers for each treated customer.
  • The closest control match (smallest distance) is ranked 1 and selected.
Sample output of paired customers

Observations:

  • Matched pairs have very similar customer behaviors in terms of revenue, purchases, and order values.
  • The only key difference is “days since last purchase”, which naturally separates treated and control groups.
  • Distance is small, meaning the control group customers are close matches to the treated customers.

Now we have a list of comparable customers that we missed realizing revenue from, within the targeted timeline of 45 days.

In line with the goal of PSM where we want to make observational data behave like an experiment. Here we have demonstrated an approach where we are able to match treated customers (activity = 0) with the most similar control customers (activity = 1) based on propensity scores and other features. This method ensures fair comparison by reducing selection bias, and making the analysis more reliable. After matching outcomes (e.g., revenue, engagement, recent purchase activity) the matched customers can be used to measure the true effect on engagement, retention or revenue.

Summary : Data Engineers — The New Data Scientists

As the lines between data engineering and data science continue to blur, data engineers are increasingly taking on responsibilities traditionally associated with data scientists. With advanced tools, automation, and ML-SQL integrations, they are not just preparing features, but are actively building, deploying, and optimizing machine learning models. The future of data-driven decision-making is no longer just about science — it’s about engineering intelligence at scale

Big shoutout to the Data Distiller team for driving this innovation and making machine learning more accessible for all within SQL-based workflows!

Blogs using Machine learning within SQL:

Learning-to-Rank (LTR)

Federated learning:

Ensemble Learning:

Causal Learning:

Machine Learning:

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

No responses yet