Danny Ma’s SQL Case Study #2 Pizza Runner

Published in

Dev Genius

12 min read

Apr 12, 2023

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (3)

This article only focuses on solutions to sections A and B of this case study. This is done to make it easier to read. If you would like to check out my solutions to an earlier, you can find it here case study #1.

PostgreSQL v13 will be used for this analysis.

Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)

Danny was scrolling through his Instagram feed when something really caught his eye — “80s Retro Styling and Pizza Is The Future!”

Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire — so he had one more genius idea to combine with it — he was going to Uberize it — and so Pizza Runner was launched!

Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.

Because Danny had a few years of experience as a data scientist — he was very aware that data collection was going to be critical for his business’ growth.

He has prepared for us an entity relationship diagram of his database design but requires further assistance to clean his data and apply some basic calculations so he can better direct his runners and optimise Pizza Runner’s operations.

All datasets exist within the pizza_runner database schema - be sure to include this reference within your SQL scripts as you start exploring the data and answering the case study questions.

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (4)

You can find the all the details for this challenge including the datasets here.

Let’s view the current state of customer_orders

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (5)

If we inspect the exclusions and extras columns, we see that not every cell has a value. The cells where customers don’t request for exclusions or extras are represented in an inconsistent manner. They either have the Null data type (i.e. [null]), null as a string (i.e. null) or are left blank. This inconsistency can lead to errors when trying to retrieve data in future. So we want to stick with one method of representing cells that have no exclusions or extras. Now, because these columns are varchar columns that contain strings, it makes sense to represent a cell without an exclusion or extra as an empty string.

DROP TABLE IF EXISTS customer_orders_temp;
CREATE TABLE customer_orders_temp AS
 SELECT 
 order_id, 
 customer_id, 
 pizza_id, 
 CASE
 WHEN exclusions is null OR exclusions = ‘null’ THEN ‘’
 ELSE exclusions
 END AS exclusions,
 CASE
 WHEN extras is null OR extras = ‘null’ THEN ‘’
 ELSE extras
 END AS extras,
 order_time
FROM customer_orders;

The idea is to create a temp_table, customer_orders_temp, which would be a copy of our original table, customer_orders. The data cleaning or transformation will be done on the temporary table leaving our original table untouched so we will have a source to fallback to just incase.

This is how our new table looks:

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (6)

Now, let’s proceed to the runner_orders table. There’s work to be done there too.

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (7)

Inspecting runner_orders, we can see a number of issues. pickup_time, distance and duration columns were created with the character varying data type (i.e. varchar) when it should have been timestamp, numeric and integer respectively. So we will need to modify these columns to appropriate data types. But if we tried to do so now, we would get errors because there are some invalid entries. We should not have string(i.e. text) in those columns. We need to replace null(that’s written as a string) with [null] data type. In addition, we have to remove the ‘km’, ‘minute’, ‘minutes’, ‘mins’ from distanceand duration columns. The cancellation column has the same issues with the exclusions and extras columns in customer_orders. Hence, we would apply the same logic we applied to those columns to the cancellation column.

DROP TABLE IF EXISTS runner_orders_temp;
CREATE TEMP TABLE runner_orders_temp AS
 SELECT 
 order_id, 
 runner_id, 
 CASE
 WHEN pickup_time = ‘null’ THEN NULL
 ELSE pickup_time
 END AS pickup_time,
 CASE
 WHEN distance = ‘null’ THEN NULL
 WHEN distance LIKE ‘%km’ THEN TRIM(‘km’ from distance)
 ELSE distance 
 END AS distance,
 CASE
 WHEN duration = ‘null’ THEN NULL
 WHEN duration LIKE ‘%mins’ THEN TRIM(‘mins’ from duration)
 WHEN duration LIKE ‘%minute’ THEN TRIM(‘minute’ from duration)
 WHEN duration LIKE ‘%minutes’ THEN TRIM(‘minutes’ from duration)
 ELSE duration
 END AS duration,
 CASE
 WHEN cancellation IS NULL or cancellation = ‘null’ THEN ‘’
 ELSE cancellation
 END AS cancellation
 FROM runner_orders;

After effecting the above changes, we can now alter the data type as shown below.

ALTER TABLE runner_orders_temp 
ALTER COLUMN pickup_time TYPE timestamp USING pickup_time::timestamp without time zone, 
ALTER COLUMN distance TYPE numeric USING distance::numeric,
ALTER COLUMN duration TYPE integer USING duration::integer;

Let’s view our the current state of new table:

SELECT * FROM runner_orders_temp

We are now ready to analyze our data and get useful insights.

Now that we have cleaned data, we can proceed with our analysis

Q1. How many pizzas were ordered?

SELECT COUNT(pizza_id) AS ordered_pizza 
FROM customer_orders_temp;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (8)

Q2. How many unique customer orders were made?

Use COUNT function with DISTINCT to get the number of unique orders

SELECT COUNT(DISTINCT(order_id)) AS unique_customer_orders 
FROM customer_orders_temp;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (9)

Q3. How many successful orders were delivered by each runner?

If we observe runner_orders_temp table, a successful order has a non-null entry for both distance and duration. So we can use either column to identify a successful order. To get the successful orders for each runner we will filter our results by applying a where distance is not null clause and grouping the orders by runner_id and counting for each runner.

SELECT runner_id, COUNT(runner_id) AS order_count 
FROM runner_orders_temp
WHERE DISTANCE IS NOT NULL
GROUP BY runner_id
ORDER BY order_count DESC;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (10)

Runner 1 has delivered 4 orders.
Runner 2 has delivered 3 orders.
Runner 3 has delivered 1 order.

Q4. How many of each type of pizza was delivered?

While the runner_orders_temp gives information as regards delivered orders, it does not tell us the content of each order. Those details can be found in customer_orders_temp table. Hence, we’ll join these tables.
Apply where distance is not null clause to get only delivered orders, group by pizza name then take a count for each pizza type.

SELECT pizza_name, COUNT(pizza_id) 
FROM runner_orders_temp
JOIN customer_orders_temp
USING (order_id)
JOIN pizza_names
USING (pizza_id)
WHERE DISTANCE IS NOT NULL
GROUP BY pizza_name;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (11)

9 Meatlovers pizzas and 3 Vegetarian pizzas were delivered

Q5. How many Vegetarian and Meatlovers were ordered by each customer?

We need to get names of pizza so we would join customer_orders_temp and pizza_names tables.
Group by customer_id and pizza_name then count customer ids

SELECT customer_id, pizza_name, COUNT(customer_id) AS pizza_count
FROM customer_orders_temp
JOIN pizza_names
USING (pizza_id)
GROUP BY customer_id, pizza_name
ORDER BY customer_id;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (12)

Customer 101 ordered 2 Meatlovers pizzas and 1 Vegetarian pizza.
Customer 102 ordered 2 Meatlovers pizzas and 1 Vegetarian pizzas.
Customer 103 ordered 3 Meatlovers pizzas and 1 Vegetarian pizza.
Customer 104 ordered 3 Meatlovers pizzas.
Customer 105 ordered 1 Vegetarian pizza.

Q6. What was the maximum number of pizzas delivered in a single order?

We need to group by order_id, count and rank in descending order from highest count to the lowest. Then we write a second query to select the order_id with rank of 1.

With ranking AS (
 SELECT order_id, COUNT(order_id) AS pizza_count,
 RANK() OVER(ORDER BY COUNT(order_id) DESC)
 FROM customer_orders_temp
 JOIN runner_orders_temp
 USING (order_id)
 WHERE DISTANCE IS NOT NULL
 GROUP BY order_id
)
SELECT order_id, pizza_count FROM ranking
WHERE rank = 1;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (13)

Maximum number of pizzas delivered in a single order was 3.

Yet again we see another question that requires us to join runner_orders_temp to the customer_orders_temp. So let’s create a view called delivered_orders that joins these tables and applies WHERE DISTANCE IS NOT NULL filter so we don’t have to do these every time we encounter a question that requires us to work with successfully delivered orders.

Creating a view

CREATE VIEW delivered_orders AS
SELECT * FROM customer_orders_temp
JOIN runner_orders_temp 
USING (order_id)
WHERE distance IS NOT NULL;

Q7. For each customer, how many delivered pizzas had at least 1 change, and how many had no changes?

Every pizza type has its standard recipe as shown in the pizza_recipes table. When a customer requests for an exclusion or an extra, he/she is asking for changes to made to the standard recipe of that pizza.

So to answer this question we would include CASE statements to count:

(i) pizzas that had at least one exclusion or extra (i.e. at least one change)

(ii) pizzas that customers didn’t request for extras or exclusions (i.e. no changes) and

SELECT customer_id,
 COUNT(
 CASE 
 WHEN exclusions <> ‘’ OR extras <> ‘’ THEN 1
 END
 ) AS changed,
 COUNT(
 CASE 
 WHEN exclusions = ‘’ AND extras = ‘’ THEN 1
 END
 ) AS unchanged 
FROM delivered_orders
GROUP BY customer_id
ORDER BY customer_id;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (14)

Only 40% of the customers (i.e. Customer 101 and 102) took their pizzas with the standard set of toppings. The others were open to trying out other toppings.

Q8. How many pizzas were delivered that had both exclusions and extras?

We would be counting delivered pizzas where neither the exclusions column nor the extras column is empty for that order i.e., the customer asked for both

SELECT COUNT(*) AS pizza_having_exclusions_n_extras
FROM delivered_orders
WHERE exclusions <> ‘’ 
AND extras <> ‘’;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (15)

Just one pizza was delivered with both exclusions and extras

Q9. What was the total volume of pizzas ordered for each hour of the day?

We can use EXTRACT to get the hour of the day an order was made. This will allow us to group pizzas by hour of the day and count.

SELECT EXTRACT(HOUR FROM order_time) AS hour_of_day, 
 count(pizza_id) AS pizza_count
FROM customer_orders_temp
GROUP BY hour_of_day
ORDER BY hour_of_day;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (16)

From the above we can infer that the 13th(1pm), 18th(6pm), 21th(9pm) and 23rd(11pm) hours are the busiest of the day
11th(11am) and 19th(7pm) hours are the least busy

Q10. What was the volume of orders for each day of the week?

TO_CHAR gives us the day of week for the specified date e.g. Friday, Saturday etc. We can then group by day of the week and take a count of pizza ids for each day.

SELECT TO_CHAR(order_time, ‘day’) AS day_of_week,
 COUNT(pizza_id) AS pizza_count
FROM customer_orders_temp
GROUP BY day_of_week
ORDER BY day_of_week;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (17)

With 5 orders each, Saturdays and Wednesdays are the busiest days of the week while Friday is the least busy

Q1. How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (18)

SELECT registration_date, 
 EXTRACT(WEEK FROM registration_date) AS week_of_year
FROM runners;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (19)

As seen in the calendar above, 1st January, 2021 is a Friday and a continuation of the last (or 53rd) week of 2020. It seems to explain why the EXTRACT function returns 53 when applied to the 1st January, 2021. Our first full week in 2021 starts 3 days after the 1st (i.e. 4th January).
Hence, we will need to offset our dates by 3 days so that the EXTRACT function sees 1st January as it would see 4th January. This ensures EXTRACT reads 1st January as the first week in 2021 and not a continuation of the last week of the previous year because our question says week starts 2021–01–01.

SELECT registration_date, 
 EXTRACT(WEEK FROM registration_date + 3) AS week_of_year
FROM runners;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (20)

Now we have that sorted we can now provide a solution for our question

SELECT EXTRACT(WEEK FROM registration_date + 3) AS week_of_year,
 COUNT(runner_id)
FROM runners
GROUP BY week_of_year
ORDER BY week_of_year;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (21)

Week 1 has the most runners (2) signed up

Q2. What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pick up the order?

Recall our view, delivered_orders, contains both the runner_orders_temp and customer_orders_temptables. So we can simply subtract the order_time from the pickup_time and find the average.
We then use DATE_TRUNC function to extract the minute component of the time. An offset of 30 seconds is required to round up our average to the next minute when the seconds component of the average time is 30 seconds or more (e.g. 15 minutes 30 seconds will be rounded up to 16 minutes)

WITH order_time_diff AS (
 SELECT DISTINCT order_id, (pickup_time — order_time) AS time_diff 
 FROM delivered_orders
 )
 SELECT DATE_TRUNC(‘minute’, AVG(time_diff) + ’30 second’ ) — round to nearest minute
 FROM order_time_diff;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (22)

It takes each runner 16 minutes on the average to pick up the order

Q3. Is there any relationship between the number of pizzas and how long the order takes to prepare?

First step is to get for each order, the number of pizzas and time it takes to prepare the pizzas
Then group orders that have similar pizza count together and find the average of preparation time for each group

WITH orders_group AS (
 SELECT order_id, count(order_id) AS pizza_count, 
 (pickup_time - order_time) AS time_diff 
 FROM delivered_orders
 GROUP BY order_id, pickup_time, order_time
 ORDER BY order_id
 )
 SELECT pizza_count, AVG(time_diff) 
 FROM orders_group
 GROUP BY pizza_count;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (23)

From the above, the more the pizzas contained in an order, the longer it takes for that order to be ready.

Q4. What was the average distance traveled for each customer?

Group by customer and calculate the average distance for each customer

SELECT customer_id, 
 ROUND(AVG(distance), 2) as avg_distance
FROM delivered_orders
GROUP BY customer_id
ORDER BY avg_distance DESC;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (24)

Customer 105 stays farthest (25km) while Customer 104 stays closest (10km).

Q5. What was the difference between the longest and shortest delivery times for all orders?

Use MAX and MIN functions to get the longest and shortest delivery times respectively.

SELECT MAX(duration) — MIN(duration) AS delivery_time_diff 
FROM runner_orders_temp;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (25)

The difference between the longest and shortest delivery times is 30 minutes

Q6. What was the average speed for each runner for each delivery and do you notice any trend for these values?

Speed (in km/hr) is calculated by dividing the distance covered in kilometers by the time spent in hours. Since our duration is in minutes, we have to divide it by 60 to get the equivalent in hours
Because every pizza in a single order will be delivered at the same time, we will use DISTINCT with the order_id so we would have only one entry for each order_id. Hence, using AVG is not necessary

SELECT DISTINCT order_id, runner_id, 
 round(distance / (duration::numeric/60), 2) AS average_speed
FROM delivered_orders
ORDER BY runner_id, average_speed

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (26)

Of concern is Runner 2’s speed. There is a large variance between the lowest(35.1km/hr) and highest speeds (93.6km/hr). This should be investigated.

Q7. What is the successful delivery percentage for each runner?

Group by runner and calculate the percentage of successful deliveries for each runner

SELECT runner_id, 
 round(count(distance)::numeric/ count(runner_id) * 100) AS delivery_percentage
FROM runner_orders_temp
GROUP BY runner_id;

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (27)

Runner 1 has highest percentage of successful deliveries (100%) while Runner 3 has the least (50%). But it’s important to note that it’s beyond the control of the runner as either the customer or the restaurant can cancel orders.

We have come to the end of the first part of this case study. Thank you so much for reading. If you found this post useful, please consider giving an upvote / follow.

I hope to write the concluding part of this case study soon but in the meantime you may want to check out my solutions to case study #3 .

Danny Ma’s SQL Case Study #2 Pizza Runner — Solutions (2024)

References