Would You Pass the Google SQL Interview? Test Yourself with These 7 Concepts
Joel Hereth, Big Tech Interviews
August 4, 2022
How do you prepare for your Google SQL interview? You start by appreciating SQL as a query language, and SQL server as a relational database management system.
SQL is a database query language that’s been around since 1979 for good reason. Any data scientist or data analyst will tell you that a massive part of their job comes down to data gathering and reprocessing raw data. That’s why it’s a necessary part of any data science interview. As an employee of a company that has collected and analyzed 50,000+ interview questions, I can tell you that the most popular SQL questions asked in the Google interviews come and go, but certain concepts always crop up.
The Google SQL interview is a big deal. Google relies on SQL for pretty much all their data crunching and munching, so no matter if you’re interviewing for a position as a data scientist, data analyst, business analyst, or data engineer, you should brush up on these SQL concepts. Remember, for the Google SQL interview, it’s never enough to just spit out SQL queries. You can check out Google SQL interview questions on Glassdoor, but know that this is more than just a technical interview - they will test you on your understanding of the fundamentals behind them, too.
This article will cover the seven most common SQL concepts that show up in Google SQL interviews. There are no guarantees, of course, but if you’re comfortable with these seven concepts, then you can feel confident in your SQL skills.
Most common beginner SQL interview questions
Q1: What is a primary key, a foreign key, and a unique key?
Let’s start with this classic Google SQL interview concept. This is testing you on how well you know the basic DNA of SQL tables.
A primary key uniquely constrains each value on the table. It cannot be null. You can only have one primary key column per table. In a table of US citizens, a social security number would be a great primary key since it can’t be null and it will be unique for everyone.
Now, what about a unique key? It also enforces uniqueness on the column, but it can be null. Ironically, you can have multiple unique key columns per table. Not so unique after all. If you have a table for all your college students, a student ID would be a good unique key since it can’t be duplicated, but if you have a new student, they might not have an ID yet.
The main difference between the two is the primary key identifies each record in the table, and the unique key prevents duplicate entries in a column.
Now, let’s throw in a foreign key. You can kind of think of it as an ambassador. It serves as the link between data in two tables. It can have null values, and it can be duplicated. Your foreign key is the primary key in another table. Say you have two tables: one is a table of your cats, their age, and their vet ID. The second table is the medication you’ve bought for each cat, also with the vet ID.
In the cats table, the vet ID is the primary key. In the medications table, the vet ID acts as the foreign key to the cats table, even though it has duplicate values. Your foreign key stops invalid data from being put into the medications table because it has to match the primary key values from the cats table.
Q2: What's an outer join?
Another SQL join question! You can expect a lot of join-based SQL query interview questions at your Google SQL interview. An outer join, also known as a full outer join, is when you join two tables based on a common column and return all of the records that have values in either the left or right table.
You can memorize that definition, but it’ll be good for you to have a deeper understanding of the concept, too.
For example, let’s say I have four cats. Certain cats have a favorite cat treat, while others don’t. I have some cat treats at home, but I might need to buy more. Because I’m a SQL fan, I have created two tables: one with cats and their favorite treats, another with the treats I have and how many of each treat.
By doing an outer join, I can create a table based on the common treat name column. This shows me whether I have all the treats I need to make my cats happy or not.
Grouping and aggregating data are crucial foundational skills for SQL beginners. Grouping data allows you to manipulate your table structure and build several aggregations on top of the initial raw data present in the tables.
So when would you group data? There are two primary reasons:
Transform the raw data and perform quick statistical analyses on the data.
Create and launch data pipelines with custom ETL’s with aggregated data for specific needs. This summary table might be used in a dashboard or for machine learning algorithms.
Some examples for fictional use cases:
Grouping orders by day of the week to understand the volume of orders for each DOW.
Grouping return data to get the sum of total returns per store and month.
Where =/= Having
Understanding the difference between the WHERE and HAVING clause is a common issue beginners face when starting to code in SQL is.The main difference is that WHERE is applied before any GROUP BY is applied while HAVING is applied after the grouping.
What does SQL syntax mean? Essentially, it means a set arrangement of words and commands. If you use improper syntax, the database engine won’t run and it’ll return an error message.
It’s a common mistake that can be avoided, it’s important to remember during your interview and when you’re on the job to check your SQL syntax before running a query.
Trust me on this one, you don’t want to be the candidate running a query repeatedly until they get an answer. Instead, focus on avoiding these common SQL syntax mistakes:
Forgetting Brackets, Parentheses, or Quotes
Specifying an Invalid Statement Order
Most common intermediate SQL interview questions
Q1: How do self joins work?
SQL runs on joins of all varieties, so this comes up a lot as a Google SQL query interview question. You have your inner joins, your left joins, your right joins, your full joins, and of course, your self joins. All joins are used to combine rows from two or more tables, based on a related column between them.
The self join, as the name implies, is when you join a table to itself. This is helpful to compare multiple rows on the same table.
SQL doesn’t have its own dedicated operator for this type of query, so to do a self join, you’ll need to use an INNER JOIN operator, one of the OUTER JOIN operators, or a CROSS JOIN.
The main difference is that when you’ll tell SQL to join table A and table B, tables A and B are the same table.
If you have an employee table where every employee can have a manager, and you want to list all employees and the name of their manager:
Q2: When would you not want to use a window function?
A set of rows in SQL is called a window or window frame. A window is defined by a row value. Window functions can be used to do a calculation across a set of table rows that are somehow related to the current row. Unlike group_bys, you don’t lose the individual rows. Window functions are great when you need to rank within a group or calculate a running total. They’re simple and quick to write. But they do have one big downside – they can be slow when you have to read and write a lot of data. For example, say you have 50M rows in a table and your query returns 1M new rows, the window function will run on those 1M new rows and will likely require scanning the existing 50M rows.
Sometimes it's best to rethink your query design in these scenarios. Maybe you don't truly need to scan those 50M rows to implement the window function. Perhaps you just need an aggregate or transform the data in incremental batches, a group_by might be better. Bert Wagner has a great example of comparing the performance of window functions to group_bys.
That said, the Google SQL interviewer is not looking for a straight answer from you. They want a thoughtful reply that proves you know a lot about what goes on under the hood. Group_bys and window functions can offer similar performance and results. Ultimately, it’s down to testing to see which is more efficient.
(The other, slightly cheekier answer is that if you don’t have SQL Server 2012 or later, or if you’re using MySQL or SQLite, window functions aren't supported.)
CASE WHEN is as simple as an if-then-else statement. Once your query conditions are met, it’ll return the initial output, if no conditions are met, it returns the value in the ELSE clause.
The CASE WHEN statement is extremely versatile and can be used to categorize and aggregate data.
Union vs Union All
When you want to concatenate multiple result sets, you’ll use Union or Union All. The main distinguishing feature is that Union All keeps duplicates, while Union doesn’t.
Your Google SQL interviewer won’t be happy with just that answer, though. It’s worth thinking about individual cases where each might be useful. Union All typically performs better than Union, since Union involves an expensive distinct SORT operation.
For unique records, use the faster Union All rather than Union.r. But if you need to remove duplicates, then go for Union. Choosing Union All and then Distinct is slower than using Union.
EXCEPT vs NOT IN
EXCEPT and NOT IN operate almost identically. They’re both used to compare the rows between two tables.
But there are three key differences between EXCEPT and NOT IN.
EXCEPT filters the DISTINCT values from the left-hand table that do not appear in the right-hand table. It's essentially the same as doing a NOT EXISTS with a DISTINCT clause.
EXPECT also needs the two tables (or subset of columns from the tables) to have the same number of columns in the left and right hand side of the query
NOT IN does not filter for DISTINCT values and returns all values from the left-hand table that do not appear in the right-hand table.
Most common advanced SQL interview questions
Q1: When is it appropriate to use mean vs. median?
This common SQL interview question actually tests knowledge of fundamental statistics rather than SQL. Both the mean and the median provide insight into the “normal” or center value of a dataset. But there’s a time and a place for each.
A mean is the arithmetic average of a group of values. To compute the mean, add up all the values and divide by the total number of entries.
A median is the center of a sorted group of values. If there is an odd number of values, take the mean of the middle two numbers.
Let’s say in my family of 11, The Cantaloupes, we all eat one watermelon a year except for my sister Melonie who eats 50 watermelons a year. The mean number of watermelons a person in my family eats per year works out to (1 + 1 + 1 + 1 + 1 + 1 +1 + 1 + 1 + 1 + 50) / 11 = 5.54 watermelons. But the median would be 1. That’s quite a difference.
To nail the Google SQL interview, you should definitely know how to find the mean and the median. Luckily, SQL has some built-in features to handle that. You can find the mean with:
There’s no equally nifty function in SQL to find the median, so you’ll need to use the PERCENTILE_CONT function. Here’s a great example from MyTecBits where they walk through how to find the median using Microsoft’s sample database WideWorldImporters.
As I mentioned above, you also need to know how and when to use each one. When you have a nice, symmetric dataset with no outliers or tails, then you should use the mean. However, if there are outliers, or you have an asymmetric dataset, it’s better to use the median, which is less sensitive to skewing due to outliers.
In my example above, Melonie skews our mean family watermelon-eating value because she’s such a fiend for watermelon. A normal Cantaloupe family member eats one watermelon a year. Since the data has such an outlier, we should use the median, not the mean, to find the center value.
Q2: What's your ETL and OLAP experience?
When an interviewer asks you one of the more advanced SQL interview questions, what they’re really looking for is your familiarity with the subject. They want to know that you know what all these acronyms mean. Here’s your cheat sheet.
ETL, or extract-transform-load, can be done in multiple languages, but SQL is very good at it. ETL involves retrieving data from database A, doing something useful to it, then storing it into database B. SQL is the standard language to access databases, so it’s geared to do ETL processes.
OLAP, or an online analytical processing system, mainly focuses on extracting information from large databases for sophisticated analysis. Your OLAP tool will make some pretty reports and graphs based on your ETL’d data.
Bonus: At your Google SQL interview, you might be asked about a third acronym, OLTP. OLTP stands for online transactional processing. It’s designed to handle large quantities of transactions and is commonly used as the step before ETL, where you source your data from.
When an interviewer asks what your experience with these concepts is, you should first demonstrate you know what they are. Then you can chat about how you’ve used them in your professional experience. If you haven’t yet, I recommend finding a fun portfolio project like this one so you can speak knowledgeably about your ETL experience.
It’s fairly common for intermediate to advanced level SQL interviews to be asked to aggregate the raw data by day, week, or month. Most of the time, this data isn’t in a separate column (e.g. month_col) but rather formatted as a raw timestamp.
Every SQL environment such as Posgres or MySQL will have variations of extract functions. In this case I’ve listed the MYSQL versions.
EXTRACT(part_of_date FROM date_time_column_name)
ROW_NUMBER() vs RANK() vs DENSE_RANK()
Ranking records is another commonly used concept while working with real data sets. Companies use it in different scenarios such as:
Ranking top selling brands by number of units sold
Ranking top product categories by number of orders or revenue generated
Retrieving movie names with the highest number of views in each genre
ROW_NUMBER(), RANK() and DENSE_RANK() are essentially used to assign sequential integers to each record within the parameter partition of the result set.
Calculating Delta Values
Comparing values from different periods is another common application. For example, if you needed to write a query to find the customer(s) who made two different orders within a two day window, you’d leverage functions such as LEAD() or LAG() to help compare the values from different periods.
Nail your Google SQL interview with these seven concepts
Your Google SQL interview rounds will test you on these seven concepts explicitly and implicitly. Here’s a recap of the seven concepts you need to understand to crush your Google SQL interview:
Means and medians
Joins of all kinds
Remember, you might get straight-up asked about primary keys in your recruiter interview to make sure you can memorize basic definitions. But your interview with a stakeholder from the sales team will probably reference ETL and OLAP. If you get hired, you’ll be making the decisions that affect their bottom line. Google interview questions test understanding, not memorization.
It’s not enough to just know which SQL queries are likely to be given in a Google interview; you need to be knowledgeable about how the subject fits into data science and its impact on your soon-to-be colleagues. First, get comfortable with the background and functionality of these phrases and queries. Then you can move on to the next step: practice solving the problems.
Your non-profit company has assigned you the task of calculating some simple statistics on donations. You have an array of integers, representing various amounts of donations your company has been given. In particular, you're interested in the median value for donations.
The median is the middle number of a sorted list of numbers. If the list is of even length, the 2 middle values are averaged.
Write a function that takes an array of integers as an argument and returns the median of those integers.
For this challenge you need to create a UNION statement, there are two tables ussales and eusales the parent company tracks each sale at its respective location in each table, you must all filter the sale price so it only returns rows with a sale greater than 50.00. You have been tasked with combining that data for future analysis. Order by location (US before EU), then by id.