Mastering SQL Queries for Data Science Interviews
Posted on Nov 13, 2024 | Estimated Reading Time: 15 minutes
Introduction
Structured Query Language (SQL) is a fundamental tool for data scientists. It allows you to interact with relational databases, extract insights, and prepare data for analysis. Mastering SQL is essential not only for day-to-day tasks but also for acing data science interviews. This guide covers advanced SQL queries and techniques that are commonly asked in technical interviews.
1. Understanding SQL Joins
SQL joins are used to combine rows from two or more tables based on related columns.
Inner Join
Use Case: Retrieve records that have matching values in both tables.
SELECT *
FROM TableA A
INNER JOIN TableB B ON A.key = B.key;
Left Join
Use Case: Retrieve all records from the left table and matched records from the right table.
SELECT *
FROM TableA A
LEFT JOIN TableB B ON A.key = B.key;
Right Join
Use Case: Retrieve all records from the right table and matched records from the left table.
SELECT *
FROM TableA A
RIGHT JOIN TableB B ON A.key = B.key;
Full Outer Join
Use Case: Retrieve all records when there is a match in either left or right table.
SELECT *
FROM TableA A
FULL OUTER JOIN TableB B ON A.key = B.key;
Why It's Important: Joins are fundamental in combining data from multiple tables, a common task in data analysis.
2. Subqueries and Common Table Expressions (CTEs)
Subqueries and CTEs are used to break down complex queries and improve readability.
Subqueries
Use Case: Nest a query within another SQL query.
SELECT employee_id, name
FROM Employees
WHERE department_id IN (
SELECT department_id
FROM Departments
WHERE location = 'New York'
);
Common Table Expressions (CTEs)
Use Case: Define a temporary result set that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement.
WITH DepartmentEmployees AS (
SELECT department_id, COUNT(*) AS employee_count
FROM Employees
GROUP BY department_id
)
SELECT D.department_name, E.employee_count
FROM Departments D
JOIN DepartmentEmployees E ON D.department_id = E.department_id;
Why It's Important: Subqueries and CTEs help in simplifying complex queries and making them more maintainable.
3. Window Functions
Window functions perform calculations across a set of table rows related to the current row.
ROW_NUMBER()
Use Case: Assign a unique sequential integer to rows within a partition.
SELECT employee_id, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS salary_rank
FROM Employees;
LAG() and LEAD()
Use Case: Access data from a previous or next row in the same result set.
SELECT employee_id, salary,
LAG(salary, 1) OVER (ORDER BY salary) AS previous_salary,
LEAD(salary, 1) OVER (ORDER BY salary) AS next_salary
FROM Employees;
Aggregate Window Functions
Use Case: Calculate aggregates like SUM, AVG over partitions.
SELECT department_id, employee_id, salary,
SUM(salary) OVER (PARTITION BY department_id) AS department_total_salary
FROM Employees;
Why It's Important: Window functions are powerful for performing advanced analytics directly in SQL.
4. Advanced Filtering and Sorting
Enhance your queries with advanced WHERE clauses and ORDER BY techniques.
Using CASE Statements
Use Case: Implement conditional logic in SQL queries.
SELECT employee_id, salary,
CASE
WHEN salary > 100000 THEN 'High'
WHEN salary BETWEEN 50000 AND 100000 THEN 'Medium'
ELSE 'Low'
END AS salary_level
FROM Employees;
Filtering with HAVING
Use Case: Filter groups after aggregation.
SELECT department_id, COUNT(*) AS employee_count
FROM Employees
GROUP BY department_id
HAVING COUNT(*) > 5;
Advanced ORDER BY
Use Case: Sort results based on expressions or multiple columns.
SELECT employee_id, department_id, salary
FROM Employees
ORDER BY department_id ASC, salary DESC;
Why It's Important: Advanced filtering and sorting refine your data retrieval to meet specific analysis needs.
5. Set Operations
Combine results from multiple queries using set operators.
UNION and UNION ALL
Use Case: Combine results from two or more SELECT statements.
-- UNION removes duplicates
SELECT employee_id FROM Employees_US
UNION
SELECT employee_id FROM Employees_UK;
-- UNION ALL includes duplicates
SELECT employee_id FROM Employees_US
UNION ALL
SELECT employee_id FROM Employees_UK;
INTERSECT
Use Case: Return records common to both queries.
SELECT employee_id FROM Employees_US
INTERSECT
SELECT employee_id FROM Employees_UK;
EXCEPT (or MINUS)
Use Case: Return records from the first query that are not in the second query.
SELECT employee_id FROM Employees_US
EXCEPT
SELECT employee_id FROM Employees_UK;
Why It's Important: Set operations are useful for comparing datasets and performing data audits.
6. Handling NULLs
Properly manage NULL values in your queries.
IS NULL and IS NOT NULL
Use Case: Check for NULL values in conditions.
SELECT * FROM Employees
WHERE manager_id IS NULL;
COALESCE Function
Use Case: Return the first non-NULL value in a list.
SELECT employee_id, COALESCE(phone, 'No Phone') AS contact_number
FROM Employees;
NULLIF Function
Use Case: Return NULL if two expressions are equal.
SELECT employee_id, NULLIF(bonus, 0) AS adjusted_bonus
FROM Employees;
Why It's Important: Correct handling of NULLs prevents logical errors in queries and calculations.
7. Date and Time Functions
Manipulate date and time data effectively.
Extracting Date Parts
Use Case: Extract specific parts like year, month, or day from a date.
SELECT order_id, order_date,
EXTRACT(YEAR FROM order_date) AS order_year,
EXTRACT(MONTH FROM order_date) AS order_month
FROM Orders;
Date Arithmetic
Use Case: Perform calculations with dates.
SELECT order_id, order_date,
order_date + INTERVAL '7 days' AS delivery_date
FROM Orders;
Datediff Function
Use Case: Calculate the difference between two dates.
SELECT order_id, order_date, shipped_date,
DATEDIFF(day, order_date, shipped_date) AS days_to_ship
FROM Orders;
Why It's Important: Date functions are essential for time-based analyses like cohort studies and trend analysis.
8. String Functions
Manipulate string data for cleaning and analysis.
Concatenation
Use Case: Combine two or more strings.
SELECT first_name || ' ' || last_name AS full_name
FROM Employees;
Substring
Use Case: Extract a substring from a string.
SELECT SUBSTRING(phone_number, 1, 3) AS area_code
FROM Employees;
Upper and Lower Case
Use Case: Convert string to upper or lower case.
SELECT UPPER(first_name) AS first_name_caps
FROM Employees;
Why It's Important: String manipulation is crucial for data cleaning and preparing data for analysis.
9. Temporary Tables and Indexes
Optimize query performance using temporary tables and indexes.
Creating Temporary Tables
Use Case: Store intermediate results for complex queries.
CREATE TEMPORARY TABLE TempSales AS
SELECT * FROM Sales WHERE sale_date > '2024-01-01';
Creating Indexes
Use Case: Improve query performance on large tables.
CREATE INDEX idx_employee_last_name ON Employees(last_name);
Why It's Important: Understanding how to optimize queries is key for working with large datasets efficiently.
10. Stored Procedures and Functions
Encapsulate SQL code for reuse and modularity.
Creating Stored Procedures
Use Case: Execute a set of SQL statements as a single callable unit.
CREATE PROCEDURE GetEmployeesByDepartment(IN dept_id INT)
BEGIN
SELECT * FROM Employees WHERE department_id = dept_id;
END;
Creating User-Defined Functions
Use Case: Return a single value or table from a custom function.
CREATE FUNCTION GetEmployeeCountByDepartment(dept_id INT)
RETURNS INT
BEGIN
DECLARE emp_count INT;
SELECT COUNT(*) INTO emp_count FROM Employees WHERE department_id = dept_id;
RETURN emp_count;
END;
Why It's Important: Stored procedures and functions enhance code reuse and maintainability in complex SQL applications.
Sample Interview Questions
Question 1: How would you find the second highest salary from the Employees table?
Answer: You can use the DENSE_RANK() window function to rank salaries and then select the salary with a rank of 2.
SELECT salary
FROM (
SELECT salary, DENSE_RANK() OVER (ORDER BY salary DESC) AS salary_rank
FROM Employees
) ranked_salaries
WHERE salary_rank = 2;
Question 2: Explain the difference between WHERE and HAVING clauses.
Answer: The WHERE clause filters rows before grouping and aggregation occur, whereas the HAVING clause filters groups after aggregation has taken place.
Question 3: How can you optimize a slow query in SQL?
Answer: Several ways to optimize a slow query include:
- Creating indexes on columns used in JOINs and WHERE clauses.
- Refactoring subqueries into JOINs or vice versa, depending on the scenario.
- Limiting the dataset using WHERE clauses to retrieve only necessary data.
- Analyzing the query execution plan to identify bottlenecks.
Conclusion
Mastering advanced SQL queries is essential for any data scientist. The techniques covered in this guide will help you tackle complex data manipulation tasks and impress in technical interviews. Practice these queries on real datasets to deepen your understanding and enhance your data analysis skills.
Additional Resources
- Books:
- SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis by Renee M. P. Teate
- Learning SQL by Alan Beaulieu
- Online Tutorials:
- Practice Platforms:
Author's Note
Thank you for reading! If you have any questions or comments, feel free to reach out. Good luck with your data science journey and interview preparations.