Digital twin for operational resilience

7 min readJun 10, 2021

Imagine you could start a business or a project over and over, given the same initial conditions. Suppose you make the same decisions along your journey, do you think the outcomes will always be perfectly identical? You may think, yes, success and failure are plain deterministic!

If, like me, you are a bit more cautious and concerned with the alternative courses of events you could have taken because of randomness, what follows is meant for you. Let me explain in which purpose we built a digital twin at Leroy Merlin and how.

Taming randomness

It all began the day I started my new position as a Data Scientist in late October 2020. I was introduced to a customer relationship management objective: cut in half the average waiting time before answering an incoming phone call from a client.

My Data Scientist colleagues and I began by discussing the process with customer relation center managers. We learned the call distribution at Leroy Merlin is a complex system. A client may be answered by an employee of the store they called or by an agent in a contact center: incoming calls may switch from a waiting queue to another during distribution depending on some management parameters.

*[Figure 1: multi-queued calls distribution with switches]*

So, part of the mission is to anticipate the volume of calls in the coming days and to adapt the number of agents assigned to each queue. We quickly realised that for incoming phone calls, a great deal was to tame randomness! Let me explain: a customer’s decision to make a phone call is the result of many variables. Some of them are predictable (e.g. sales, new products, changed opening hours) but others are not (e.g product failures, shortages). At the end, we only observe a black box, outputting calls or not.

Signal versus Noise

Sure, when you take a stealthy glance at the company’s national statistics, the distribution of incoming calls one day looks pretty the same as yesterday ones (thank law of large numbers for that). But on a narrower scale, typically a store, the trend is overwhelmed by noise.

*[Figure 2: one store incoming calls per 10 minutes (top) VS national (bottom), a Tuesday of 2019 compared with surrounding days ]*

This level of noise in the core business should trigger skepticism. When I investigated stores’ past performance indicators, I always kept in mind that these may not be the perfect insights to assess whether a strategy of resources distribution was fitted or not. On a day-to-day basis, out of more than a hundred stores, some might have performed better than expected and others less.

Properly infer from data

Another issue rapidly arose when we tried to answer managers questions: how to infer improvement of performance indicators given some changes in resources? Let’s assume a 10% increase in agents availability: does this result in a 10% drop of waiting time?

Obviously the dynamic system we observe here is more complex than mere linear relations and trying to achieve ambitious performances may reveal far more expensive than expected.

So as physicists sometimes deal with complex dynamic systems, we decided to build a digital twin of the incoming calls distribution. We relied on 2 hypotheses:

This system is just a sophisticated tasks allocation one: should we input the load and the resources of a given period, we should obtain similar performance indicators to real ones.
By introducing randomness, we will explore different courses of events and then assess what is more likely to happen.

It is all about tasks allocation

Let’s see how we built such a digital twin! First, the paradigm: to best represent the call distribution we chose to work by a discrete-event dynamic system with an increment of one second. Objects in the system will evolve in response to events and their current state. We identify 3 types of objects in our system: CALLS, AGENTS and CENTERS. Instances of those objects are linked with each other in order to suit the real call distribution system: a call belongs to at least one queue while an agent is in charge of only one center at a time.

*[Figure 3: modelisation of multi-queued distribution system ]*

As times goes, available agents are responding to the call on top of their centers queue and new calls are filling the queues (not necessarily by the bottom, it depends on priority rules). We now face several questions to answer to run proper simulations:

When are new calls generated ?
Do clients wait endlessly before being answered ?
How long does the conversation take ?

Three attributes to characterize a call

To generate new calls in a center, we rely on its expected number of calls within an hour. Let’s say we are simulating Tuesday: from 8AM to 9AM, we expect 36 incoming calls. Thus, for each second within this hour the probability of generating a call will be 36/(60*60) = 0.01. On average, we will obtain 36 calls; sometimes less and sometimes more. This is the first mean to introduce randomness in our model. As for the expected number of calls, we could use past data for retrospective studies or forecasts for future tasks allocation.

To each of these calls, we assign a “patience” which represents the maximum waiting time duration: after that, the call will be abandoned. Once again, this patience is randomly drawn from a probability distribution. We obtained such a function by isolating periods where no calls at all were answered (because of technical issues or lack of agents). It is noticeable that our data is best modeled by a log-normal distribution, on contrary to the related literature which broadly uses the exponential distribution.

In the same way, we modeled conversation durations based on our past data. Because we noticed behavior differences, we chose to have one distribution per call center. Thus, when a call is picked up in a simulation, its conversation duration is drawn from a distribution depending on its center.

On the side of the agents it is a lot simpler: we input for each of them a record of login/logout. Anytime they are logged in and free, they answer the call at the top of their center’s queue. We added slight refinement to best suit reality: the mandatory resting time between two calls and a time allocated to after call work (once again, modeled on past data, but I think you are getting the point).

Events-triggered states: defining the flowchart

With these rules in place, we only have left to describe how objects are switching from states to others. Let’s recall the events of an incoming call and its associated variables: it is generated at DATE_START with a maximum waiting time of PATIENCE. If it is answered by an agent, the client has waited DISTRIBUTION_TIME and the conversation will last TALK_TIME. With these in mind, the states flowchart is plain logic. A similar flowchart describes agents states as well.

The show can now begin. Seconds after seconds in more than hundred centers, calls are generated, answered or abandoned while agents are connecting or disconnecting themselves. We keep track of the simulation by feeding a database table built exactly like the one containing the real data, thus comparisons between reality and simulation are easier to implement and to grasp.

A last word about implementation

Let’s finish this technical part with a few implementation and execution details. We did not rely on queuing theory related libraries because at some point or another they lack essential features for us: either dynamic resources, or queue switching, etc. So we designed our own in Python with the use of common libraries: collections, date time, numpy.

Parallelism has been easy because of our hypothesis: calls are independant. So, what happens on a given day has no influence on others, and we can simply break simulated period into smallers ran simultaneously. It takes about 15 minutes to simulate a year for a store (an average of 50k calls) on one thread.

A versatile sandbox

We have recently finished tuning our digital twin by retrospective simulations: on a given period, say a year, genuine performance indicators lie within simulated ones. Because periods are simulated about a hundred times, we can now measure our resilience to randomness and the spread of possible outcomes.

Our immediate next step is to offer to stores, a resources allocation decision making tool. By plugging our digital twin into calls forecasting we can assess the needed staff to meet the target performance with a given level of confidence. Even if an unpredictable disruption occurs (e.g. a lockdown) and causes an unprecedented number of calls distributed throughout the day in a new fashion, we are able to adapt our strategy in real time.

In the long range we will simulate various scenarios and quantify the gain in performance, customer satisfaction and cost. We could experiment with alternative systems such as shared queues, calls priority, specialized queues or alter calls properties by increasing customers’ patience or decreasing conversations duration for example.

On a personal standpoint, I am thrilled to have taken part in this project. This was my first mission in my first job and I immediately have been given the opportunity to go from data exploration to operational concrete applications. Throughout those months of work, our operations colleagues have been of great help and very enthusiastic about the expected improvements we will provide. We can all be proud of the versatile tool we have built to best serve our customers.

ADEO Tech Blog