Building an Uber Like Cab Service

9 min readJun 12, 2020

Leveraging the power of stream processing using Apache Kafka & Samza

Scenario

In a fictional world, NYCabs is an upcoming private cab/ride share app. I was hired to implement the core part of the service, that of matching client requests to available drivers and displaying advertisements. Cab hailing apps like Uber have the driver send position updates roughly every 5 seconds which forms a large stream of data. After matching the client, the company will choose an advertisement to display for that client.

This part required some static profile data of that client and businesses’ information. The company could also get client’s interest from their phones based on their browser data. Utilizing all these information would help the company to make a decision making system for advertising.

I decide to use the stream processing model of computation since it fits this use case very well.

What did I use?

Java
Apache Kafka & Samza
AWS EMR

Creating Streams using Kafka Producer

In this task I wrote Kafka Producer code to create streams using a trace file provided to us. This task aims to simulate the scenario where drivers update their locations on a regular basis as they move in the city and clients requests for rides arrive at some time. I developed a Java program that sends messages to Kafka topics on the remote Samza EMR cluster which I created on AWS.

Next, I create two streams named driver-locations and events using the shell command, and then read the tracefile and send the trace records to the topics using the Java API. Each line in the tracefile is a JSON string containing various fields. The fields might be different among different types of streams, but you can assume each line must have two fields, which are type and blockId. First, I created two streams based on the value of field type, and partition the streams based on the value of field blockId.

driver-locations stream

This stream is a stream of driver locations as they move through the city. I needed to publish this stream in Kafka under the topic driver-locations.

Each JSON string contains the following fields:

blockId: the block where the driver is currently moving. This is similar to a city block/neighborhood. A block can have multiple drivers. The stream is partitioned on this field.
driverId: unique identifier of the driver.
type: is set to DRIVER_LOCATION for this particular stream.
latitude, longitude: within a block a driver will be at a particular latitude and longitude. This latitude and longitude is what you will consider as a part of the match score to find a driver for a given client request.
example:
{“driverId”:131,”blockId”:3214,”latitude”:40.7519871,”longitude”:-74.0047584,”type”:”DRIVER_LOCATION”}

events stream

This stream is a stream of events that are separate from the driver location updates. This includes events from both clients and drivers and had to be published in Kafka under the topic events.

Each JSON string contains the following fields:

blockId: the block where the user (driver or rider) is currently present.The stream is partitioned on this field.
clientId/driverId: unique identifier of the driver or client. This will be clientId if the type of the event is RIDE_REQUEST. It will be driverId in all other cases.
latitude, longitude: within a block a driver will be at a particular latitude and longitude, so you can find these two attributes in the events of ENTERING_BLOCK, LEAVING_BLOCK and RIDE_COMPLETE. A client also has his or her latitude and longitude when s/he requests a ride. These attributes also appear in the event of RIDE_REQUEST.
gender: this attribute specifies the gender of the driver. In this project, the gender of a driver is either “M” for male or “F” for female. You will see this attribute in the events of ENTERING_BLOCK and RIDE_COMPLETE.
gender_preference: the counterpart of the attribute gender. In the event of RIDE_REQUEST, a client will come with his or her driver gender_preference ("M" for male, "F" for female and "N" for no preference).
rating: each driver has a float number from 0.0 to 5.0 (inclusive) as his or her rating. Rating also contributes to the match score since a client always wants a driver with a high reputation. You can find this attribute in ENTERING_BLOCK and RIDE_COMPLETE.
salary: the amount of money this driver has made today. Salary is an integer from 0 to 100 (inclusive) and you can see it in the events of ENTERING_BLOCK and RIDE_COMPLETE.
status: This field is valid ONLY if type is LEAVING_BLOCK or ENTERING_BLOCK. It persists the state of driver (free or busy) across blocks. The valid values for this field are AVAILABLE and UNAVAILABLE.
user_rating: This field is valid ONLY if type is RIDE_COMPLETE. It is the client's rating for this ride. The driver's rating should be updated to the average of rating and user_rating. In other words, the formula to update rating is (old_rating + user_rating)/2.
type:

LEAVING_BLOCK - a driver is moving to a different block or is going offline. This event will come with the blockId of the old block. Use this to update your local state for the old block. For example, if a driver is moving from block 1 to block 2, this event will arrive with the blockId 1.

ENTERING_BLOCK - a driver is logging in or is entering a different block. This event will come with the blockId of the new block. Use this to update your local state for the new block. For example, if a driver is moving from block 1 to block 2, this event will arrive with the blockId 2.

RIDE_REQUEST - a client has requested a ride in a particular block. Find the driver with the highest match score and output that driverId to the output stream.

RIDE_COMPLETE - a ride has completed. It comes with the current location of the driver, meaning this driver is available again in this block.

Some examples:

{"blockId":5647,"driverId":7806,"latitude":40.7901188,"longitude":-73.9747985,"type":"ENTERING_BLOCK","status":"AVAILABLE","rating":2.14,"salary":11,"gender":"F"}

{"blockId":1930,"clientId":6343,"latitude":40.731471,"longitude":-73.9901805,"type":"RIDE_REQUEST","gender_preference":"N"}

Consume and analyze streams using Samza

Next, I wrote Samza code to consume the two streams generated by the Kafka stream generator in previously and output the client / driver pair to the match-stream stream. The driver had to be the one with the highest match score to the client (see the details in the algorithm description below) and within the same block as the client.

Cab matching service. There are 2 candidate drivers in block 1 but driver 1234 has a higher match score. Some fields in the JSON have been omitted.

Formula for the match score

Besides using the distance to find the closest driver for a client, we also took the gender_preference, driver’s rating and salary into consideration. We show the weights of different parts in Table 3.

distance_score: Used the Euclidean distance between the driver and the client. Since for each client, we only look for a driver in the same block and we want to convert the distance to a real number score between 0 to 1. The following formula can be a possible way to do so. The client_driver_distance is the Euclidean distance between the driver’s and the client’s positions.

distance_score = 1 * e ^ (-1 * client_driver_distance)

gender_score: Similar to the distance score, we can use a number to measure whether the driver’s gender matches with the client’s preference. If the client’s preference equals to the driver’s gender, the gender score for this client-driver pair is 1.0, otherwise the gender score is 0.0. For the client whose gender preference is “N”, we assume he or she matches with both gender “M” and “F”.

rating_score: A driver’s rating is a real number between 0.0 and 5.0, so we can standardize it by:

rating_score = rating / 5.0

salary_score: The idea here is that NYCabs wants the money earned to be evenly distributed among drivers so that they don’t lose drivers who earn little money every day. In that case, when there are two available drivers, with the other conditions being the same, the driver with the lower salary will have higher priority. In our simple model, every driver can make as much as 100 USD per day. However, if there is a case where you only have one driver on a block, and his/her salary is already more than 100, you can pick this driver to match a ride request coming on that block. To convert the salary to a measurable score:

salary_score = 1 - salary / 100.0

match_score: now let’s put the distance_score, gender_score, rating_score and salary_score altogether to calculate a match score based on the weights above. You need to match a driver with the highest match score for each coming client.

match_score = distance_score * 0.4 + gender_score * 0.1 + rating_score * 0.3 + salary_score * 0.2

Output stream

match-stream

This stream was the output by my Samza job.

clientId: The Integer id of the client for whom a ride has been generated.

driverId: The Integer id of the free driver with highest match score with this client, i.e. the driver who is assigned to this client.

Examples:

{“clientId”:902,”driverId”:434}

Targeted Advertising by Consuming and analyzing streams and static data using Samza

IoT devices are pervasive, they can be found everywhere and you can use them to infer users’ interests from their behavior. We can use a wearable device to monitor and infer a user’s current health status using some wearable devices. Consider that we have a future high-tech device that can detect a user’s blood sugar, mood, level of stress and how active a user is.

Armed with this timely information, NYCabs wanted us to advertise directly to specific individuals to make sure their target audience will see a suitable and timely advertisement. Previously, we matched a stream of riders with a stream of cars. After the matching process, I was tasked to use the users’ profiles, their recent browsing interests and current health status to display suitable and timely advertisements.

To do this, we processed the events stream to match a user with one potential business so that you can decide which advertisement to place in the car. We used the match score to match a user with a business using the distance between the user and the business, the user’s age, the user’s purchasing power, the user’s travel frequency, and the user’s interest. The output was the ad-stream.

We wrote Samza code to consume the events stream generated using the load generator developed earlier, and output the client/yelp business pair to the ad-stream stream. The advertisement of the Yelp business is the one with the highest match score to the client.

What did I learn?

Identify real-world scenarios with real-time data streams where stream processing is adopted.
Explore and understand the characteristic of IoT stream data.
Explain the data flow model in Kafka and Samza.
Describe the fault tolerance model of Kafka and Samza.
Generate streams of data using Kafka and make it available for a Samza consumer.
Practice test-driven development (TDD) with JUnit on Kafka when developing solutions for handling big data.
Design and implement a solution to join and process multiple streams of GPS data using Samza API to setup a driver matching service like Uber.
Design and implement a solution to join and process streams of IoT device data and static data using Samza API to setup an advertising system.
Experiment with deploying a Kafka and Samza stream processing system on the YARN cluster provisioned on the cloud.
Experiment with Kafka replication capabilities to see how Kafka handles issues or the fault.
Manage and monitor Samza jobs on Yarn. Explore different debugging techniques when running a Samza job on YARN.
Experiment on using Kafka in frontend application to update the state of the frontend application.

Academic Integrity restricts me to go into details about the implementation and our approach to the project. If you would like to learn about the details, feel free to contact me.
Source for project theory: Carnegie Mellon University, 15619 Cloud Computing