Published in · 12 min read · Apr 12, 2023
--
This article only focuses on solutions to sections A and B of this case study. This is done to make it easier to read. If you would like to check out my solutions to an earlier, you can find it here case study #1.
PostgreSQL v13 will be used for this analysis.
Did you know that over 115 million kilograms of pizza is consumed daily worldwide??? (Well according to Wikipedia anyway…)
Danny was scrolling through his Instagram feed when something really caught his eye — “80s Retro Styling and Pizza Is The Future!”
Danny was sold on the idea, but he knew that pizza alone was not going to help him get seed funding to expand his new Pizza Empire — so he had one more genius idea to combine with it — he was going to Uberize it — and so Pizza Runner was launched!
Danny started by recruiting “runners” to deliver fresh pizza from Pizza Runner Headquarters (otherwise known as Danny’s house) and also maxed out his credit card to pay freelance developers to build a mobile app to accept orders from customers.
Because Danny had a few years of experience as a data scientist — he was very aware that data collection was going to be critical for his business’ growth.
He has prepared for us an entity relationship diagram of his database design but requires further assistance to clean his data and apply some basic calculations so he can better direct his runners and optimise Pizza Runner’s operations.
All datasets exist within the pizza_runner
database schema - be sure to include this reference within your SQL scripts as you start exploring the data and answering the case study questions.
You can find the all the details for this challenge including the datasets here.
Let’s view the current state of customer_orders
If we inspect the exclusions
and extras
columns, we see that not every cell has a value. The cells where customers don’t request for exclusions
or extras
are represented in an inconsistent manner. They either have the Null data type (i.e. [null]
), null as a string (i.e. null
) or are left blank. This inconsistency can lead to errors when trying to retrieve data in future. So we want to stick with one method of representing cells that have no exclusions
or extras
. Now, because these columns are varchar columns that contain strings, it makes sense to represent a cell without an exclusion
or extra
as an empty string.
DROP TABLE IF EXISTS customer_orders_temp;
CREATE TABLE customer_orders_temp AS
SELECT
order_id,
customer_id,
pizza_id,
CASE
WHEN exclusions is null OR exclusions = ‘null’ THEN ‘’
ELSE exclusions
END AS exclusions,
CASE
WHEN extras is null OR extras = ‘null’ THEN ‘’
ELSE extras
END AS extras,
order_time
FROM customer_orders;
The idea is to create a temp_table, customer_orders_temp
, which would be a copy of our original table, customer_orders
. The data cleaning or transformation will be done on the temporary table leaving our original table untouched so we will have a source to fallback to just incase.
This is how our new table looks:
Now, let’s proceed to the runner_orders
table. There’s work to be done there too.
Inspecting runner_orders
, we can see a number of issues. pickup_time
, distance
and duration
columns were created with the character varying data type (i.e. varchar) when it should have been timestamp, numeric and integer respectively. So we will need to modify these columns to appropriate data types. But if we tried to do so now, we would get errors because there are some invalid entries. We should not have string(i.e. text) in those columns. We need to replace null
(that’s written as a string) with [null]
data type. In addition, we have to remove the ‘km’, ‘minute’, ‘minutes’, ‘mins’ from distance
and duration
columns. The cancellation
column has the same issues with the exclusions
and extras
columns in customer_orders
. Hence, we would apply the same logic we applied to those columns to the cancellation
column.
DROP TABLE IF EXISTS runner_orders_temp;
CREATE TEMP TABLE runner_orders_temp AS
SELECT
order_id,
runner_id,
CASE
WHEN pickup_time = ‘null’ THEN NULL
ELSE pickup_time
END AS pickup_time,
CASE
WHEN distance = ‘null’ THEN NULL
WHEN distance LIKE ‘%km’ THEN TRIM(‘km’ from distance)
ELSE distance
END AS distance,
CASE
WHEN duration = ‘null’ THEN NULL
WHEN duration LIKE ‘%mins’ THEN TRIM(‘mins’ from duration)
WHEN duration LIKE ‘%minute’ THEN TRIM(‘minute’ from duration)
WHEN duration LIKE ‘%minutes’ THEN TRIM(‘minutes’ from duration)
ELSE duration
END AS duration,
CASE
WHEN cancellation IS NULL or cancellation = ‘null’ THEN ‘’
ELSE cancellation
END AS cancellation
FROM runner_orders;
After effecting the above changes, we can now alter the data type as shown below.
ALTER TABLE runner_orders_temp
ALTER COLUMN pickup_time TYPE timestamp USING pickup_time::timestamp without time zone,
ALTER COLUMN distance TYPE numeric USING distance::numeric,
ALTER COLUMN duration TYPE integer USING duration::integer;
Let’s view our the current state of new table:
SELECT * FROM runner_orders_temp
We are now ready to analyze our data and get useful insights.
Now that we have cleaned data, we can proceed with our analysis
Q1. How many pizzas were ordered?
SELECT COUNT(pizza_id) AS ordered_pizza
FROM customer_orders_temp;
Q2. How many unique customer orders were made?
- Use COUNT function with DISTINCT to get the number of unique orders
SELECT COUNT(DISTINCT(order_id)) AS unique_customer_orders
FROM customer_orders_temp;
Q3. How many successful orders were delivered by each runner?
- If we observe
runner_orders_temp
table, a successful order has a non-null entry for both distance and duration. So we can use either column to identify a successful order. To get the successful orders for each runner we will filter our results by applying a where distance is not null clause and grouping the orders by runner_id and counting for each runner.
SELECT runner_id, COUNT(runner_id) AS order_count
FROM runner_orders_temp
WHERE DISTANCE IS NOT NULL
GROUP BY runner_id
ORDER BY order_count DESC;
- Runner 1 has delivered 4 orders.
- Runner 2 has delivered 3 orders.
- Runner 3 has delivered 1 order.
Q4. How many of each type of pizza was delivered?
- While the
runner_orders_temp
gives information as regards delivered orders, it does not tell us the content of each order. Those details can be found incustomer_orders_temp
table. Hence, we’ll join these tables. - Apply where distance is not null clause to get only delivered orders, group by pizza name then take a count for each pizza type.
SELECT pizza_name, COUNT(pizza_id)
FROM runner_orders_temp
JOIN customer_orders_temp
USING (order_id)
JOIN pizza_names
USING (pizza_id)
WHERE DISTANCE IS NOT NULL
GROUP BY pizza_name;
- 9 Meatlovers pizzas and 3 Vegetarian pizzas were delivered
Q5. How many Vegetarian and Meatlovers were ordered by each customer?
- We need to get names of pizza so we would join
customer_orders_temp
andpizza_names
tables. - Group by customer_id and pizza_name then count customer ids
SELECT customer_id, pizza_name, COUNT(customer_id) AS pizza_count
FROM customer_orders_temp
JOIN pizza_names
USING (pizza_id)
GROUP BY customer_id, pizza_name
ORDER BY customer_id;
- Customer 101 ordered 2 Meatlovers pizzas and 1 Vegetarian pizza.
- Customer 102 ordered 2 Meatlovers pizzas and 1 Vegetarian pizzas.
- Customer 103 ordered 3 Meatlovers pizzas and 1 Vegetarian pizza.
- Customer 104 ordered 3 Meatlovers pizzas.
- Customer 105 ordered 1 Vegetarian pizza.
Q6. What was the maximum number of pizzas delivered in a single order?
- We need to group by order_id, count and rank in descending order from highest count to the lowest. Then we write a second query to select the order_id with rank of 1.
With ranking AS (
SELECT order_id, COUNT(order_id) AS pizza_count,
RANK() OVER(ORDER BY COUNT(order_id) DESC)
FROM customer_orders_temp
JOIN runner_orders_temp
USING (order_id)
WHERE DISTANCE IS NOT NULL
GROUP BY order_id
)
SELECT order_id, pizza_count FROM ranking
WHERE rank = 1;
- Maximum number of pizzas delivered in a single order was 3.
Yet again we see another question that requires us to join
runner_orders_temp
to thecustomer_orders_temp
. So let’s create a view calleddelivered_orders
that joins these tables and applies WHERE DISTANCE IS NOT NULL filter so we don’t have to do these every time we encounter a question that requires us to work with successfully delivered orders.
Creating a view
CREATE VIEW delivered_orders AS
SELECT * FROM customer_orders_temp
JOIN runner_orders_temp
USING (order_id)
WHERE distance IS NOT NULL;
Q7. For each customer, how many delivered pizzas had at least 1 change, and how many had no changes?
- Every pizza type has its standard recipe as shown in the
pizza_recipes
table. When a customer requests for an exclusion or an extra, he/she is asking for changes to made to the standard recipe of that pizza.
So to answer this question we would include CASE statements to count:
(i) pizzas that had at least one exclusion or extra (i.e. at least one change)
(ii) pizzas that customers didn’t request for extras or exclusions (i.e. no changes) and
SELECT customer_id,
COUNT(
CASE
WHEN exclusions <> ‘’ OR extras <> ‘’ THEN 1
END
) AS changed,
COUNT(
CASE
WHEN exclusions = ‘’ AND extras = ‘’ THEN 1
END
) AS unchanged
FROM delivered_orders
GROUP BY customer_id
ORDER BY customer_id;
- Only 40% of the customers (i.e. Customer 101 and 102) took their pizzas with the standard set of toppings. The others were open to trying out other toppings.
Q8. How many pizzas were delivered that had both exclusions and extras?
- We would be counting delivered pizzas where neither the exclusions column nor the extras column is empty for that order i.e., the customer asked for both
SELECT COUNT(*) AS pizza_having_exclusions_n_extras
FROM delivered_orders
WHERE exclusions <> ‘’
AND extras <> ‘’;
- Just one pizza was delivered with both exclusions and extras
Q9. What was the total volume of pizzas ordered for each hour of the day?
- We can use EXTRACT to get the hour of the day an order was made. This will allow us to group pizzas by hour of the day and count.
SELECT EXTRACT(HOUR FROM order_time) AS hour_of_day,
count(pizza_id) AS pizza_count
FROM customer_orders_temp
GROUP BY hour_of_day
ORDER BY hour_of_day;
- From the above we can infer that the 13th(1pm), 18th(6pm), 21th(9pm) and 23rd(11pm) hours are the busiest of the day
- 11th(11am) and 19th(7pm) hours are the least busy
Q10. What was the volume of orders for each day of the week?
- TO_CHAR gives us the day of week for the specified date e.g. Friday, Saturday etc. We can then group by day of the week and take a count of pizza ids for each day.
SELECT TO_CHAR(order_time, ‘day’) AS day_of_week,
COUNT(pizza_id) AS pizza_count
FROM customer_orders_temp
GROUP BY day_of_week
ORDER BY day_of_week;
- With 5 orders each, Saturdays and Wednesdays are the busiest days of the week while Friday is the least busy
Q1. How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01
)
SELECT registration_date,
EXTRACT(WEEK FROM registration_date) AS week_of_year
FROM runners;
- As seen in the calendar above, 1st January, 2021 is a Friday and a continuation of the last (or 53rd) week of 2020. It seems to explain why the EXTRACT function returns 53 when applied to the 1st January, 2021. Our first full week in 2021 starts 3 days after the 1st (i.e. 4th January).
- Hence, we will need to offset our dates by 3 days so that the EXTRACT function sees 1st January as it would see 4th January. This ensures EXTRACT reads 1st January as the first week in 2021 and not a continuation of the last week of the previous year because our question says week starts 2021–01–01.
SELECT registration_date,
EXTRACT(WEEK FROM registration_date + 3) AS week_of_year
FROM runners;
- Now we have that sorted we can now provide a solution for our question
SELECT EXTRACT(WEEK FROM registration_date + 3) AS week_of_year,
COUNT(runner_id)
FROM runners
GROUP BY week_of_year
ORDER BY week_of_year;
- Week 1 has the most runners (2) signed up
Q2. What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pick up the order?
- Recall our view,
delivered_orders
, contains both therunner_orders_temp
andcustomer_orders_temp
tables. So we can simply subtract theorder_time
from thepickup_time
and find the average. - We then use DATE_TRUNC function to extract the minute component of the time. An offset of 30 seconds is required to round up our average to the next minute when the seconds component of the average time is 30 seconds or more (e.g. 15 minutes 30 seconds will be rounded up to 16 minutes)
WITH order_time_diff AS (
SELECT DISTINCT order_id, (pickup_time — order_time) AS time_diff
FROM delivered_orders
)
SELECT DATE_TRUNC(‘minute’, AVG(time_diff) + ’30 second’ ) — round to nearest minute
FROM order_time_diff;
- It takes each runner 16 minutes on the average to pick up the order
Q3. Is there any relationship between the number of pizzas and how long the order takes to prepare?
- First step is to get for each order, the number of pizzas and time it takes to prepare the pizzas
- Then group orders that have similar pizza count together and find the average of preparation time for each group
WITH orders_group AS (
SELECT order_id, count(order_id) AS pizza_count,
(pickup_time - order_time) AS time_diff
FROM delivered_orders
GROUP BY order_id, pickup_time, order_time
ORDER BY order_id
)
SELECT pizza_count, AVG(time_diff)
FROM orders_group
GROUP BY pizza_count;
- From the above, the more the pizzas contained in an order, the longer it takes for that order to be ready.
Q4. What was the average distance traveled for each customer?
- Group by customer and calculate the average distance for each customer
SELECT customer_id,
ROUND(AVG(distance), 2) as avg_distance
FROM delivered_orders
GROUP BY customer_id
ORDER BY avg_distance DESC;
- Customer 105 stays farthest (25km) while Customer 104 stays closest (10km).
Q5. What was the difference between the longest and shortest delivery times for all orders?
- Use MAX and MIN functions to get the longest and shortest delivery times respectively.
SELECT MAX(duration) — MIN(duration) AS delivery_time_diff
FROM runner_orders_temp;
- The difference between the longest and shortest delivery times is 30 minutes
Q6. What was the average speed for each runner for each delivery and do you notice any trend for these values?
- Speed (in km/hr) is calculated by dividing the distance covered in kilometers by the time spent in hours. Since our duration is in minutes, we have to divide it by 60 to get the equivalent in hours
- Because every pizza in a single order will be delivered at the same time, we will use DISTINCT with the
order_id
so we would have only one entry for eachorder_id
. Hence, using AVG is not necessary
SELECT DISTINCT order_id, runner_id,
round(distance / (duration::numeric/60), 2) AS average_speed
FROM delivered_orders
ORDER BY runner_id, average_speed
- Of concern is Runner 2’s speed. There is a large variance between the lowest(35.1km/hr) and highest speeds (93.6km/hr). This should be investigated.
Q7. What is the successful delivery percentage for each runner?
- Group by runner and calculate the percentage of successful deliveries for each runner
SELECT runner_id,
round(count(distance)::numeric/ count(runner_id) * 100) AS delivery_percentage
FROM runner_orders_temp
GROUP BY runner_id;
- Runner 1 has highest percentage of successful deliveries (100%) while Runner 3 has the least (50%). But it’s important to note that it’s beyond the control of the runner as either the customer or the restaurant can cancel orders.
We have come to the end of the first part of this case study. Thank you so much for reading. If you found this post useful, please consider giving an upvote / follow.
I hope to write the concluding part of this case study soon but in the meantime you may want to check out my solutions to case study #3 .