NOTE: When solving sql questions always make sure that all the required components listed below are taken care of:
SELECT {columns},
AGG_FUNC {column(s)}
FROM {table(s)}
JOIN_TYPE {other_table(s)}
ON {column_names}
WHERE {conditions}
GROUP BY {columns}
HAVING {condition on aggregates}
ORDER BY {column(s)}
LIMIT {num rows} OFFSET {num}
Keep In Mind
$\rightarrow$ When using RANK() always think if the ordering is ASC or DESC.
Check
- Check if values could be NULL
- For questions with largest/smallest, check what to do if two or more have the same values
- Check the final column names and any column name typos
- Check that the final answer does need rounding or not (if numerical), or ordering (if not)
- Whenever there is a GROUP BY think about NULL
- Whenever doing UNION check if a simple OR in the where clause would work if disjoint sets
- Ask if the foreign key can be NULL
Caveat
Let’s say we have two tables Sellers
and Orders
as given below and the goal is to find all sellers
who did not make any sells in the month of Apr.
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| seller_id | int |
| seller_name | int |
+-------------+---------+
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| seller_id | int |
| sale_month | Enum |
+-------------+---------+
One query that you would think will get the job done is:
SELECT s.seller_id
FROM Sellers AS s
LEFT JOIN Orders AS o
ON s.seller_id = o.seller_id
WHERE o.sale_month = 'Apr'
GROUP BY s.seller_id
HAVING COUNT(o.seller_id) = 0
However, the above query will give wrong result. This is because in SQL, WHERE
is evaluated after the LEFT JOIN
. We can understand this
with the following example:
+-------------+-------------+
| seller_id | seller_name |
+-------------+-------------+
| 1 | A |
| 2 | B |
+-------------+-------------+
+-------------+-------------+
| seller_id | sale_month |
+-------------+-------------+
| 1 | Apr |
| 1 | May |
+-------------+-------------+
$\rightarrow$ From the above example, we can see that seller_id=2
did not make any sale in the month of Apr, which is the output that we want.
Now let’s analyze the query we have written. The output after the LEFT JOIN will be:
+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
| 1 | A | 1 | Apr |
| 1 | A | 1 | May |
| 2 | B | NULL | NULL |
+-------------+---------------+-------------+--------------+
And after the filtering with the WHERE o.sale_month = 'Apr'
we will get:
+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
| 1 | A | 1 | Apr |
+-------------+---------------+-------------+--------------+
Notice that the row with seller_id=2
has disappeared altogether. So we have to careful with the usage of ON
and WHERE
when LEFT JOIN or RIGHT JOIN is involved.
A correct query is:
SELECT s.seller_id
FROM Sellers AS s
LEFT JOIN Orders AS o
ON s.seller_id = o.seller_id
AND o.sale_month = 'Apr'
GROUP BY s.seller_id
HAVING COUNT(o.seller_id) = 0
By using this, the filtering of the month condition happens in the joining process and not after. The output after the LEFT JOIN will now look like:
+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
| 1 | A | 1 | Apr |
| 2 | B | NULL | NULL |
+-------------+---------------+-------------+--------------+
The above table, after the groupby and having clause will give the right answer, which is seller_id=2
.
Activity
Let’s say we have a table called Logins
that contains the following columns:
Logins |
---|
user_id |
login_date |
time_spent |
- What is the total time that each user spent for the month of Feb, 2024?
SELECT user_id, SUM(time_spent) AS total_time_spent FROM Logins WHERE DATE_FORMAT(login_date, '%Y-%m') = '2024-02' GROUP BY 1;
- Which user(s) have logged in the most number of times for a given day?
WITH LoginsPerDay AS ( SELECT user_id, login_date, COUNT(1) AS num_logins FROM Logins GROUP BY 1, 2; ) SELECT DISTINCT(user_id) AS user_id FROM LoginsPerDay WHERE num_logins = (SELECT MAX(num_logins) FROM LoginsPerDay)
- Which users have logged in for at least 5 consecutive days?
To answer this, we need to do a self-join where for any given row in the first table, we check
if there are rows in the second table that are within 5 days (including itself). Now there can be
multiple logins on some day; hence we need to dedup using DISTINCT.
SELECT DISTINCT l1.user_id, FROM Logins l1 JOIN Logins l2 ON l1.user_id = l2.user_id AND l2.login_date BETWEEN l1.login_date AND DATE_ADD(l1.login_date, INTERVAL 4 DAY) GROUP BY l1.user_id, l1.login_date HAVING(COUNT(DISTINCT l2.login_date)) = 5
- Which users are active? An active user is a user that has made a second login within 3 days of any other of their login.
The strategy is the same as that used for the above question. The only difference is we don’t have to worry
about distinct; since even a second login on the same day counts.
SELECT DISTINCT l1.user_id, FROM Logins l1 JOIN Logins l2 ON l1.user_id = l2.user_id AND l2.login_date BETWEEN l1.login_date AND DATE_ADD(l1.login_date, INTERVAL 3 DAY) GROUP BY l1.user_id, l1.login_date HAVING(*) > 1;
Friends and Relations
Let us now consider a table called Friends
that contains the following columns:
Friends |
---|
user1_id |
user2_id |
Each row is ordered such that user1_id
< user2_id
.
- Calculate number of friends each user has?
Since a given row
(1, 2)
implies that bothuser1_id
anduser2_id
have increased their friend count by 1; we need to somehow capture the information present in both these columns into a single column. We can do this using UNION.WITH Connections AS ( SELECT user1_id AS person FROM Friends UNION ALL SELECT user2_id FROM Friends ) SELECT person, COUNT(1) AS num_friends FROM Connections GROUP BY 1
- Calculate number of mutual friends for each friend pair?
A mutual friend of a pair $A \leftrightarrow B$ is someone such that $A \leftrightarrow C$ and $C \leftrightarrow B$. Thus to find the mutual friends we need to do self-join twice. Since the given table
only contains a directed edge $A \rightarrow B$, to create undirected edges we simply replicate the table
swapping the user_id as follows:
WITH Connections AS ( SELECT user1_id, user2_id FROM Friends UNION SELECT user2_id, user1_id FROM Friends )
Now we can apply our mutual friend criteria. Note that the first table in the join is the Friends table and not the Connections table, to avoid double counting and maintain the ordering user1_id
< user2_id
.
SELECT c1.user1_id, c1.user2_id, COUNT(1) AS num_mutual_friends
FROM Friends c1, Connections c2, Connections c3
WHERE c1.user1_id = c2.user1_id
AND c2.user2_id = c3.user1_id
AND c3.user2_id = c1.user2_id
GROUP BY 1, 2;
Now let’s say we have another table called Likes
that contains pages that are liked by users.
+-------------+---------+
| Column Name | Type |
+-------------+---------+
| user_id | int |
| page_id | int |
+-------------+---------+
We would like to recommend to any given user, pages that their friends liked which they haven’t liked yet; along with the count of the friends that liked that particular page.
To do this we will join the Connections
table with the Likes
table; while making sure that the page in the Likes
table row being considered is something that we haven’t
liked yet.
NOTE: We can use both NOT IN
and NOT EXISTS
; however for some reason NOT EXISTS
is more optimal.
SELECT f.user1_id AS user_id, l.page_id, COUNT(l.user_id) AS friends_likes
FROM (
SELECT user1_id, user2_id FROM Friendship
UNION
SELECT user2_id, user1_id FROM Friendship
) AS f
JOIN Likes l
ON f.user2_id = l.user_id
AND NOT EXISTS (SELECT page_id FROM Likes WHERE user_id = f.user1_id AND page_id = l.page_id)
GROUP BY f.user1_id, l.page_id
A very similar concept to that of Friends
is that of Follower
. Let’s say we have a table as given below, where every row implies that
a person whose id is follower_id
is following a person whose id is user_id
. Unlike the Friends
table, where any entry implies a
undirected relation, in this case every row is a directed relation.
+-------------+------+
| Column Name | Type |
+-------------+------+
| user_id | int |
| follower_id | int |
+-------------+------+
We want to find all pairs of users (user1_id, user2_id)
that have the most number of common followers. Notice that C
is a common follower for a pair of users $(A, B)$ if there exists two rows in the table $(user\_id=A, follower\_id=C)$ and $(user\_id=B, follower\_id=C)$.
WITH AggTable AS (
SELECT r1.user_id AS user1_id, r2.user_id AS user2_id, COUNT(1) AS common
FROM Relations r1, Relations r2
WHERE r1.user_id < r2.user_id
AND r1.follower_id = r2.follower_id
GROUP BY 1, 2
)
SELECT user1_id, user2_id
FROM AggTable
WHERE common = (SELECT MAX(common) FROM AggTable)
Pivoting
Let’s consider a table Visits
that contains entries for a particular month.
Visits |
---|
tenant_id |
pet_category |
visit_date |
amount |
Each entry (row) corresponds to a visit made by some tenant to supermarket to buy food for one of their pets along with the amount spent.
Compute the total amount that each tenant spends on each pet_category, where pet_category is an Enum with values cat, dog, fish
. The output should look like:
tenant_id | cat | dog | fish |
---|---|---|---|
1 | 20 | 40 | 0 |
2 | 34 | 0 | 25 |
We can get the total amount spent by each user per each pet category by simply doing a groupby. However, this will lead to a long-format table, which we will have to pivot to make the wide-format as required by the problem.
WITH ReqTable AS (
SELECT tenant_id, pet_category, SUM(amount) AS total_amount
FROM
GROUP BY 1, 2;
)
SELECT
tenant_id,
SUM(IF(pet_category = 'cat', total_amount, 0)) AS 'cat',
SUM(IF(pet_category = 'dog', total_amount, 0)) AS 'dog',
SUM(IF(pet_category = 'fish', total_amount, 0)) AS 'fish'
FROM ReqTable
GROUP BY 1;
Other scenarions
We are given two tables Trips
and Users
; which could be a log from a ride-share service.
+-------------+----------+
| Column Name | Type |
+-------------+----------+
| id | int |
| client_id | int |
| driver_id | int |
| city_id | int |
| status | enum |
| request_at | date |
+-------------+----------+
+-------------+----------+
| Column Name | Type |
+-------------+----------+
| users_id | int |
| banned | enum |
| role | enum |
+-------------+----------+
The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day. Find the cancellation rate of unbanned users (both client and driver must not be banned) each day between “2013-10-01” and “2013-10-03”. Round Cancellation Rate to two decimal points.
Since we only care about unbanned
users we first check that neither client_id
nor driver_id
are banned.
SELECT
request_at AS Day,
ROUND(AVG(IF(status = 'completed', 0, 1)), 2) AS `Cancellation Rate`
FROM Trips t
WHERE client_id NOT IN (SELECT users_id FROM Users WHERE banned = 'Yes')
AND driver_id NOT IN (SELECT users_id FROM Users WHERE banned = 'Yes')
AND request_at BETWEEN '2013-10-01' AND '2013-10-03'
GROUP BY 1;
Given two tables Customers
and Orders
, get the customer_id and customer_name of customers who bought products “A”, “B” but did not buy the product “C”.
+---------------------+---------+
| Column Name | Type |
+---------------------+---------+
| customer_id | int |
| customer_name | varchar |
+---------------------+---------+
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| order_id | int |
| customer_id | int |
| product_name | varchar |
+---------------+---------+
We first get the count of each product A
, B
and C
for each customer_id
. Then we filter those that have a
non-zero values for products A
and B
, but have zero value for C
.
WITH AggTable AS (
SELECT
customer_id,
SUM(IF(product_name = 'A', 1, 0)) AS product_a_count,
SUM(IF(product_name = 'B', 1, 0)) AS product_b_count,
SUM(IF(product_name = 'C', 1, 0)) AS product_c_count
FROM Orders
GROUP BY customer_id
HAVING product_a_count > 0 AND product_b_count > 0 AND product_c_count = 0
)
SELECT customer_id, customer_name
FROM AggTable
JOIN Customers USING(customer_id)
Write a solution to report the number of grand slam tournaments won by each player. Do not include the players who did not win any tournament.
We have two tables named Players
and Championships
as follows:
+----------------+---------+
| Column Name | Type |
+----------------+---------+
| player_id | int |
| player_name | varchar |
+----------------+---------+
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| year | int |
| Wimbledon | int |
| Fr_open | int |
| US_open | int |
| Au_open | int |
+---------------+---------+
We first CROSS JOIN between the two tables to get all possible combinations of (year, player_id)
. We then check
the number of grand slams won by that player_id for that particular year and then sum that count across all years.
WITH AggTable AS (
SELECT
p.player_id,
SUM(IF(Wimbledon = player_id, 1, 0) +
IF(Fr_open = player_id, 1, 0) +
IF(US_open = player_id, 1, 0) +
IF(Au_open = player_id, 1, 0)) AS grand_slams_count
FROM Championships c, Players p
GROUP BY p.player_id
HAVING grand_slams_count > 0
)
SELECT player_id, player_name, grand_slams_count
FROM AggTable
JOIN Players USING(player_id)
Let us consider a call log table Calls
as follows:
+--------------+----------+
| Column Name | Type |
+--------------+----------+
| caller_id | int |
| recipient_id | int |
| call_time | datetime |
+--------------+----------+
- Find the id of the users whose first and last calls on any day were with the same person.
It is important to note that just like the Friends
table, any given call log row $A \rightarrow B$ implies
a call happened between A and B. Hence first we duplicate the table flipping the caller_id
and recipient_id
.
We then compute the first call and the last call for every user for every day, and then check if the recipients
match for at least any day.
WITH Connections AS (
SELECT caller_id AS user_id, recipient_id, call_time
FROM Calls
UNION
SELECT recipient_id AS user_id, caller_id AS recipient_id, call_time
FROM Calls
),
FirstCall AS (
SELECT user_id, CAST(call_time AS DATE) AS day, recipient_id
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY user_id, CAST(call_time AS DATE) ORDER BY call_time DESC) AS first_call
FROM Connections
) AS k
WHERE k.first_call = 1
), LastCall AS (
SELECT user_id, CAST(call_time AS DATE) AS day, recipient_id
FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY user_id, CAST(call_time AS DATE) ORDER BY call_time) AS last_call
FROM Connections
) AS k
WHERE k.last_call = 1
)
SELECT DISTINCT(f.user_id)
FROM FirstCall AS f
JOIN LastCall AS l
ON f.user_id = l.user_id
AND f.day = l.day
AND f.recipient_id = l.recipient_id;