NOTE: When solving sql questions always make sure that all the required components listed below are taken care of:

SELECT {columns},
    AGG_FUNC {column(s)}
FROM {table(s)}
JOIN_TYPE {other_table(s)}
ON {column_names}
WHERE {conditions}
GROUP BY {columns}
HAVING {condition on aggregates}
ORDER BY {column(s)}
LIMIT {num rows} OFFSET {num}

Keep In Mind

$\rightarrow$ When using RANK() always think if the ordering is ASC or DESC.

Check

  • Check if values could be NULL
  • For questions with largest/smallest, check what to do if two or more have the same values
  • Check the final column names and any column name typos
  • Check that the final answer does need rounding or not (if numerical), or ordering (if not)
  • Whenever there is a GROUP BY think about NULL
  • Whenever doing UNION check if a simple OR in the where clause would work if disjoint sets
  • Ask if the foreign key can be NULL

Caveat

Let’s say we have two tables Sellers and Orders as given below and the goal is to find all sellers who did not make any sells in the month of Apr.

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| seller_id   | int     |
| seller_name | int     |
+-------------+---------+

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| seller_id   | int     |
| sale_month  | Enum    |
+-------------+---------+

One query that you would think will get the job done is:

SELECT s.seller_id
FROM Sellers AS s
LEFT JOIN Orders AS o
    ON s.seller_id = o.seller_id
    WHERE o.sale_month = 'Apr'
GROUP BY s.seller_id
HAVING COUNT(o.seller_id) = 0

However, the above query will give wrong result. This is because in SQL, WHERE is evaluated after the LEFT JOIN. We can understand this with the following example:

+-------------+-------------+
| seller_id   | seller_name |
+-------------+-------------+
| 1           | A           |
| 2           | B           |
+-------------+-------------+

+-------------+-------------+
| seller_id   | sale_month  |
+-------------+-------------+
| 1           | Apr         |
| 1           | May         |
+-------------+-------------+

$\rightarrow$ From the above example, we can see that seller_id=2 did not make any sale in the month of Apr, which is the output that we want.

Now let’s analyze the query we have written. The output after the LEFT JOIN will be:

+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
|      1      |       A       |      1      |     Apr      |
|      1      |       A       |      1      |     May      |
|      2      |       B       |    NULL     |     NULL     |
+-------------+---------------+-------------+--------------+

And after the filtering with the WHERE o.sale_month = 'Apr' we will get:

+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
|      1      |       A       |      1      |     Apr      |
+-------------+---------------+-------------+--------------+

Notice that the row with seller_id=2 has disappeared altogether. So we have to careful with the usage of ON and WHERE when LEFT JOIN or RIGHT JOIN is involved.

A correct query is:

SELECT s.seller_id
FROM Sellers AS s
LEFT JOIN Orders AS o
    ON s.seller_id = o.seller_id
    AND o.sale_month = 'Apr'
GROUP BY s.seller_id
HAVING COUNT(o.seller_id) = 0

By using this, the filtering of the month condition happens in the joining process and not after. The output after the LEFT JOIN will now look like:

+-------------+---------------+-------------+--------------+
| s.seller_id | s.seller_name | o.seller_id | o.sale_month |
+-------------+---------------+-------------+--------------+
|      1      |       A       |      1      |     Apr      |
|      2      |       B       |    NULL     |     NULL     |
+-------------+---------------+-------------+--------------+

The above table, after the groupby and having clause will give the right answer, which is seller_id=2.


Activity

Let’s say we have a table called Logins that contains the following columns:

Logins
user_id
login_date
time_spent
  • What is the total time that each user spent for the month of Feb, 2024?
    SELECT user_id, SUM(time_spent) AS total_time_spent
    FROM Logins
    WHERE DATE_FORMAT(login_date, '%Y-%m') = '2024-02'
    GROUP BY 1;
    
  • Which user(s) have logged in the most number of times for a given day?
    WITH LoginsPerDay AS (
    SELECT user_id, login_date, COUNT(1) AS num_logins
    FROM Logins 
    GROUP BY 1, 2;
    )
    SELECT DISTINCT(user_id) AS user_id
    FROM LoginsPerDay
    WHERE num_logins = (SELECT MAX(num_logins) FROM LoginsPerDay)
    
  • Which users have logged in for at least 5 consecutive days? To answer this, we need to do a self-join where for any given row in the first table, we check if there are rows in the second table that are within 5 days (including itself). Now there can be multiple logins on some day; hence we need to dedup using DISTINCT.
    SELECT DISTINCT l1.user_id,
    FROM Logins l1
    JOIN Logins l2
    ON l1.user_id = l2.user_id
      AND l2.login_date BETWEEN l1.login_date AND DATE_ADD(l1.login_date, INTERVAL 4 DAY)
    GROUP BY l1.user_id, l1.login_date
    HAVING(COUNT(DISTINCT l2.login_date)) = 5
    
  • Which users are active? An active user is a user that has made a second login within 3 days of any other of their login. The strategy is the same as that used for the above question. The only difference is we don’t have to worry about distinct; since even a second login on the same day counts.
    SELECT DISTINCT l1.user_id,
    FROM Logins l1
    JOIN Logins l2
    ON l1.user_id = l2.user_id
      AND l2.login_date BETWEEN l1.login_date AND DATE_ADD(l1.login_date, INTERVAL 3 DAY)
    GROUP BY l1.user_id, l1.login_date
    HAVING(*) > 1;
    

Friends and Relations

Let us now consider a table called Friends that contains the following columns:

Friends
user1_id
user2_id

Each row is ordered such that user1_id < user2_id.

  • Calculate number of friends each user has? Since a given row (1, 2) implies that both user1_id and user2_id have increased their friend count by 1; we need to somehow capture the information present in both these columns into a single column. We can do this using UNION.
    WITH Connections AS (
      SELECT user1_id AS person FROM Friends
      UNION ALL
      SELECT user2_id FROM Friends
    )
    SELECT person, COUNT(1) AS num_friends
    FROM Connections
    GROUP BY 1
    
  • Calculate number of mutual friends for each friend pair? A mutual friend of a pair $A \leftrightarrow B$ is someone such that $A \leftrightarrow C$ and $C \leftrightarrow B$. Thus to find the mutual friends we need to do self-join twice. Since the given table only contains a directed edge $A \rightarrow B$, to create undirected edges we simply replicate the table swapping the user_id as follows:
      WITH Connections AS (
          SELECT user1_id, user2_id FROM Friends
          UNION
          SELECT user2_id, user1_id FROM Friends
      )
    

Now we can apply our mutual friend criteria. Note that the first table in the join is the Friends table and not the Connections table, to avoid double counting and maintain the ordering user1_id < user2_id.

SELECT c1.user1_id, c1.user2_id, COUNT(1) AS num_mutual_friends
FROM Friends c1, Connections c2, Connections c3
    WHERE c1.user1_id = c2.user1_id
    AND c2.user2_id = c3.user1_id
    AND c3.user2_id = c1.user2_id
GROUP BY 1, 2;

Now let’s say we have another table called Likes that contains pages that are liked by users.

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| user_id     | int     |
| page_id     | int     |
+-------------+---------+

We would like to recommend to any given user, pages that their friends liked which they haven’t liked yet; along with the count of the friends that liked that particular page. To do this we will join the Connections table with the Likes table; while making sure that the page in the Likes table row being considered is something that we haven’t liked yet.

NOTE: We can use both NOT IN and NOT EXISTS; however for some reason NOT EXISTS is more optimal.

SELECT f.user1_id AS user_id, l.page_id, COUNT(l.user_id) AS friends_likes
FROM (
    SELECT user1_id, user2_id FROM Friendship
    UNION
    SELECT user2_id, user1_id FROM Friendship
) AS f
JOIN Likes l
ON f.user2_id = l.user_id
AND NOT EXISTS (SELECT page_id FROM Likes WHERE user_id = f.user1_id AND page_id = l.page_id)
GROUP BY f.user1_id, l.page_id

A very similar concept to that of Friends is that of Follower. Let’s say we have a table as given below, where every row implies that a person whose id is follower_id is following a person whose id is user_id. Unlike the Friends table, where any entry implies a undirected relation, in this case every row is a directed relation.

+-------------+------+
| Column Name | Type |
+-------------+------+
| user_id     | int  |
| follower_id | int  |
+-------------+------+

We want to find all pairs of users (user1_id, user2_id) that have the most number of common followers. Notice that C is a common follower for a pair of users $(A, B)$ if there exists two rows in the table $(user\_id=A, follower\_id=C)$ and $(user\_id=B, follower\_id=C)$.

WITH AggTable AS (
    SELECT r1.user_id AS user1_id, r2.user_id AS user2_id, COUNT(1) AS common
    FROM Relations r1, Relations r2
        WHERE r1.user_id < r2.user_id 
        AND r1.follower_id = r2.follower_id
    GROUP BY 1, 2
)

SELECT user1_id, user2_id
FROM AggTable
WHERE common = (SELECT MAX(common) FROM AggTable)

Pivoting

Let’s consider a table Visits that contains entries for a particular month.

Visits
tenant_id
pet_category
visit_date
amount

Each entry (row) corresponds to a visit made by some tenant to supermarket to buy food for one of their pets along with the amount spent. Compute the total amount that each tenant spends on each pet_category, where pet_category is an Enum with values cat, dog, fish. The output should look like:

tenant_id cat dog fish
1 20 40 0
2 34 0 25

We can get the total amount spent by each user per each pet category by simply doing a groupby. However, this will lead to a long-format table, which we will have to pivot to make the wide-format as required by the problem.

WITH ReqTable AS (
    SELECT tenant_id, pet_category, SUM(amount) AS total_amount
    FROM 
    GROUP BY 1, 2;
)
SELECT
    tenant_id,
    SUM(IF(pet_category = 'cat', total_amount, 0)) AS 'cat',
    SUM(IF(pet_category = 'dog', total_amount, 0)) AS 'dog',
    SUM(IF(pet_category = 'fish', total_amount, 0)) AS 'fish'
FROM ReqTable
GROUP BY 1;

Other scenarions

We are given two tables Trips and Users; which could be a log from a ride-share service.

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| id          | int      |
| client_id   | int      |
| driver_id   | int      |
| city_id     | int      |
| status      | enum     |
| request_at  | date     |     
+-------------+----------+

+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| users_id    | int      |
| banned      | enum     |
| role        | enum     |
+-------------+----------+

The cancellation rate is computed by dividing the number of canceled (by client or driver) requests with unbanned users by the total number of requests with unbanned users on that day. Find the cancellation rate of unbanned users (both client and driver must not be banned) each day between “2013-10-01” and “2013-10-03”. Round Cancellation Rate to two decimal points.

Since we only care about unbanned users we first check that neither client_id nor driver_id are banned.

SELECT 
    request_at AS Day,
    ROUND(AVG(IF(status = 'completed', 0, 1)), 2) AS `Cancellation Rate`
FROM Trips t
    WHERE client_id NOT IN (SELECT users_id FROM Users WHERE banned = 'Yes')
    AND driver_id NOT IN (SELECT users_id FROM Users WHERE banned = 'Yes')
    AND request_at BETWEEN '2013-10-01' AND '2013-10-03'
GROUP BY 1;

Given two tables Customers and Orders, get the customer_id and customer_name of customers who bought products “A”, “B” but did not buy the product “C”.

+---------------------+---------+
| Column Name         | Type    |
+---------------------+---------+
| customer_id         | int     |
| customer_name       | varchar |
+---------------------+---------+

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| order_id      | int     |
| customer_id   | int     |
| product_name  | varchar |
+---------------+---------+

We first get the count of each product A, B and C for each customer_id. Then we filter those that have a non-zero values for products A and B, but have zero value for C.

WITH AggTable AS (
    SELECT
        customer_id,
        SUM(IF(product_name = 'A', 1, 0)) AS product_a_count,
        SUM(IF(product_name = 'B', 1, 0)) AS product_b_count,
        SUM(IF(product_name = 'C', 1, 0)) AS product_c_count
    FROM Orders
    GROUP BY customer_id
    HAVING product_a_count > 0 AND product_b_count > 0 AND product_c_count = 0
)
SELECT customer_id, customer_name
FROM AggTable
JOIN Customers USING(customer_id)

Write a solution to report the number of grand slam tournaments won by each player. Do not include the players who did not win any tournament. We have two tables named Players and Championships as follows:

+----------------+---------+
| Column Name    | Type    |
+----------------+---------+
| player_id      | int     |
| player_name    | varchar |
+----------------+---------+

+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| year          | int     |
| Wimbledon     | int     |
| Fr_open       | int     |
| US_open       | int     |
| Au_open       | int     |
+---------------+---------+

We first CROSS JOIN between the two tables to get all possible combinations of (year, player_id). We then check the number of grand slams won by that player_id for that particular year and then sum that count across all years.

WITH AggTable AS (
    SELECT 
        p.player_id,
        SUM(IF(Wimbledon = player_id, 1, 0) + 
            IF(Fr_open = player_id, 1, 0) + 
            IF(US_open = player_id, 1, 0) + 
            IF(Au_open = player_id, 1, 0)) AS grand_slams_count
    FROM Championships c, Players p
    GROUP BY p.player_id
    HAVING grand_slams_count > 0
)
SELECT player_id, player_name, grand_slams_count
FROM AggTable
JOIN Players USING(player_id)

Let us consider a call log table Calls as follows:

+--------------+----------+
| Column Name  | Type     |
+--------------+----------+
| caller_id    | int      |
| recipient_id | int      |
| call_time    | datetime |
+--------------+----------+
  • Find the id of the users whose first and last calls on any day were with the same person.

It is important to note that just like the Friends table, any given call log row $A \rightarrow B$ implies a call happened between A and B. Hence first we duplicate the table flipping the caller_id and recipient_id. We then compute the first call and the last call for every user for every day, and then check if the recipients match for at least any day.

WITH Connections AS (
    SELECT caller_id AS user_id, recipient_id, call_time
    FROM Calls
    UNION
    SELECT recipient_id AS user_id, caller_id AS recipient_id, call_time
    FROM Calls
),
FirstCall AS (
    SELECT user_id, CAST(call_time AS DATE) AS day, recipient_id
    FROM (
        SELECT *,
            ROW_NUMBER() OVER(PARTITION BY user_id, CAST(call_time AS DATE) ORDER BY call_time DESC) AS first_call
        FROM Connections
    ) AS k
    WHERE k.first_call = 1
), LastCall AS (
    SELECT user_id, CAST(call_time AS DATE) AS day, recipient_id
    FROM (
        SELECT *,
            ROW_NUMBER() OVER(PARTITION BY user_id, CAST(call_time AS DATE) ORDER BY call_time) AS last_call
        FROM Connections
    ) AS k
    WHERE k.last_call = 1
)

SELECT DISTINCT(f.user_id)
FROM FirstCall AS f
JOIN LastCall AS l
    ON f.user_id = l.user_id
    AND f.day = l.day
    AND f.recipient_id = l.recipient_id;