Mastering SQL Queries for Data Science Interviews

Posted on Nov 13, 2024 | Estimated Reading Time: 15 minutes

Introduction

Structured Query Language (SQL) is a fundamental tool for data scientists. It allows you to interact with relational databases, extract insights, and prepare data for analysis. Mastering SQL is essential not only for day-to-day tasks but also for acing data science interviews. This guide covers advanced SQL queries and techniques that are commonly asked in technical interviews.

1. Understanding SQL Joins

SQL joins are used to combine rows from two or more tables based on related columns.

Inner Join

Use Case: Retrieve records that have matching values in both tables.


SELECT *
FROM TableA A
INNER JOIN TableB B ON A.key = B.key;

Left Join

Use Case: Retrieve all records from the left table and matched records from the right table.


SELECT *
FROM TableA A
LEFT JOIN TableB B ON A.key = B.key;

Right Join

Use Case: Retrieve all records from the right table and matched records from the left table.


SELECT *
FROM TableA A
RIGHT JOIN TableB B ON A.key = B.key;

Full Outer Join

Use Case: Retrieve all records when there is a match in either left or right table.


SELECT *
FROM TableA A
FULL OUTER JOIN TableB B ON A.key = B.key;

Why It's Important: Joins are fundamental in combining data from multiple tables, a common task in data analysis.

2. Subqueries and Common Table Expressions (CTEs)

Subqueries and CTEs are used to break down complex queries and improve readability.

Subqueries

Use Case: Nest a query within another SQL query.


SELECT employee_id, name
FROM Employees
WHERE department_id IN (
    SELECT department_id
    FROM Departments
    WHERE location = 'New York'
);

Common Table Expressions (CTEs)

Use Case: Define a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.


WITH DepartmentEmployees AS (
    SELECT department_id, COUNT(*) AS employee_count
    FROM Employees
    GROUP BY department_id
)
SELECT D.department_name, E.employee_count
FROM Departments D
JOIN DepartmentEmployees E ON D.department_id = E.department_id;

Why It's Important: Subqueries and CTEs help in simplifying complex queries and making them more maintainable.

3. Window Functions

Window functions perform calculations across a set of table rows related to the current row.

ROW_NUMBER()

Use Case: Assign a unique sequential integer to rows within a partition.


SELECT employee_id, salary,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS salary_rank
FROM Employees;

LAG() and LEAD()

Use Case: Access data from a previous or next row in the same result set.


SELECT employee_id, salary,
       LAG(salary, 1) OVER (ORDER BY salary) AS previous_salary,
       LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary
FROM Employees;

Aggregate Window Functions

Use Case: Calculate aggregates like SUM, AVG over partitions.


SELECT department_id, employee_id, salary,
       SUM(salary) OVER (PARTITION BY department_id) AS department_total_salary
FROM Employees;

Why It's Important: Window functions are powerful for performing advanced analytics directly in SQL.

4. Advanced Filtering and Sorting

Enhance your queries with advanced WHERE clauses and ORDER BY techniques.

Using CASE Statements

Use Case: Implement conditional logic in SQL queries.


SELECT employee_id, salary,
       CASE
           WHEN salary > 100000 THEN 'High'
           WHEN salary BETWEEN 50000 AND 100000 THEN 'Medium'
           ELSE 'Low'
       END AS salary_level
FROM Employees;

Filtering with HAVING

Use Case: Filter groups after aggregation.


SELECT department_id, COUNT(*) AS employee_count
FROM Employees
GROUP BY department_id
HAVING COUNT(*) > 5;

Advanced ORDER BY

Use Case: Sort results based on expressions or multiple columns.


SELECT employee_id, department_id, salary
FROM Employees
ORDER BY department_id ASC, salary DESC;

Why It's Important: Advanced filtering and sorting refine your data retrieval to meet specific analysis needs.

5. Set Operations

Combine results from multiple queries using set operators.

UNION and UNION ALL

Use Case: Combine results from two or more SELECT statements.


-- UNION removes duplicates
SELECT employee_id FROM Employees_US
UNION
SELECT employee_id FROM Employees_UK;

-- UNION ALL includes duplicates
SELECT employee_id FROM Employees_US
UNION ALL
SELECT employee_id FROM Employees_UK;

INTERSECT

Use Case: Return records common to both queries.


SELECT employee_id FROM Employees_US
INTERSECT
SELECT employee_id FROM Employees_UK;

EXCEPT (or MINUS)

Use Case: Return records from the first query that are not in the second query.


SELECT employee_id FROM Employees_US
EXCEPT
SELECT employee_id FROM Employees_UK;

Why It's Important: Set operations are useful for comparing datasets and performing data audits.

6. Handling NULLs

Properly manage NULL values in your queries.

IS NULL and IS NOT NULL

Use Case: Check for NULL values in conditions.


SELECT * FROM Employees
WHERE manager_id IS NULL;

COALESCE Function

Use Case: Return the first non-NULL value in a list.


SELECT employee_id, COALESCE(phone, 'No Phone') AS contact_number
FROM Employees;

NULLIF Function

Use Case: Return NULL if two expressions are equal.


SELECT employee_id, NULLIF(bonus, 0) AS adjusted_bonus
FROM Employees;

Why It's Important: Correct handling of NULLs prevents logical errors in queries and calculations.

7. Date and Time Functions

Manipulate date and time data effectively.

Extracting Date Parts

Use Case: Extract specific parts like year, month, or day from a date.


SELECT order_id, order_date,
       EXTRACT(YEAR FROM order_date) AS order_year,
       EXTRACT(MONTH FROM order_date) AS order_month
FROM Orders;

Date Arithmetic

Use Case: Perform calculations with dates.


SELECT order_id, order_date,
       order_date + INTERVAL '7 days' AS delivery_date
FROM Orders;

Datediff Function

Use Case: Calculate the difference between two dates.


SELECT order_id, order_date, shipped_date,
       DATEDIFF(day, order_date, shipped_date) AS days_to_ship
FROM Orders;

Why It's Important: Date functions are essential for time-based analyses like cohort studies and trend analysis.

8. String Functions

Manipulate string data for cleaning and analysis.

Concatenation

Use Case: Combine two or more strings.


SELECT first_name || ' ' || last_name AS full_name
FROM Employees;

Substring

Use Case: Extract a substring from a string.


SELECT SUBSTRING(phone_number, 1, 3) AS area_code
FROM Employees;

Upper and Lower Case

Use Case: Convert string to upper or lower case.


SELECT UPPER(first_name) AS first_name_caps
FROM Employees;

Why It's Important: String manipulation is crucial for data cleaning and preparing data for analysis.

9. Temporary Tables and Indexes

Optimize query performance using temporary tables and indexes.

Creating Temporary Tables

Use Case: Store intermediate results for complex queries.


CREATE TEMPORARY TABLE TempSales AS
SELECT * FROM Sales WHERE sale_date > '2024-01-01';

Creating Indexes

Use Case: Improve query performance on large tables.


CREATE INDEX idx_employee_last_name ON Employees(last_name);

Why It's Important: Understanding how to optimize queries is key for working with large datasets efficiently.

10. Stored Procedures and Functions

Encapsulate SQL code for reuse and modularity.

Creating Stored Procedures

Use Case: Execute a set of SQL statements as a single callable unit.


CREATE PROCEDURE GetEmployeesByDepartment(IN dept_id INT)
BEGIN
    SELECT * FROM Employees WHERE department_id = dept_id;
END;

Creating User-Defined Functions

Use Case: Return a single value or table from a custom function.


CREATE FUNCTION GetEmployeeCountByDepartment(dept_id INT)
RETURNS INT
BEGIN
    DECLARE emp_count INT;
    SELECT COUNT(*) INTO emp_count FROM Employees WHERE department_id = dept_id;
    RETURN emp_count;
END;

Why It's Important: Stored procedures and functions enhance code reuse and maintainability in complex SQL applications.

Sample Interview Questions

Question 1: How would you find the second highest salary from the Employees table?

Answer: You can use the DENSE_RANK() window function to rank salaries and then select the salary with a rank of 2.


SELECT salary
FROM (
    SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS salary_rank
    FROM Employees
) ranked_salaries
WHERE salary_rank = 2;

Question 2: Explain the difference between WHERE and HAVING clauses.

Answer: The WHERE clause filters rows before grouping and aggregation occur, whereas the HAVING clause filters groups after aggregation has taken place.

Question 3: How can you optimize a slow query in SQL?

Answer: Several ways to optimize a slow query include:

Creating indexes on columns used in JOINs and WHERE clauses.
Refactoring subqueries into JOINs or vice versa, depending on the scenario.
Limiting the dataset using WHERE clauses to retrieve only necessary data.
Analyzing the query execution plan to identify bottlenecks.

Conclusion

Mastering advanced SQL queries is essential for any data scientist. The techniques covered in this guide will help you tackle complex data manipulation tasks and impress in technical interviews. Practice these queries on real datasets to deepen your understanding and enhance your data analysis skills.

Additional Resources

Books:
- SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis by Renee M. P. Teate
- Learning SQL by Alan Beaulieu
Online Tutorials:
- SQL Tutorial
- Mode Analytics SQL Tutorial
Practice Platforms:
- HackerRank SQL Challenges
- LeetCode Database Problems

Author's Note

Thank you for reading! If you have any questions or comments, feel free to reach out. Good luck with your data science journey and interview preparations.

← Back to Blogs