View all articles

Apr 22, 2020

SQL

SQL Subqueries vs Temporary Tables vs CTEs

9 min read

John Pauler

Partner & CRO

Currently Reading

SQL Subqueries vs Temporary Tables vs CTEs

Hi everyone,

I hope you’re doing well and continuing to push your learning forward even in these unusual times.

Recently it seems like more and more students have reached out asking about the difference between using subqueries, temporary tables, and common table expressions (CTEs). The questions are often like these...

`“Is it wrong if I did this with a CTE instead of the temporary table that was used in the video?”`

`“Why did you choose to do this with a temporary table instead of a subquery?”`

These are great questions! The short of it is, using subqueries, temporary tables, and CTEs are three related methods that can be used to tackle the same problem in slightly different ways. At their core, they enable you to do very similar things, but each method has differences and advantages in certain situations.

I realize that depending on where you are in your journey with Maven, you may have more or less experience with subqueries, temporary tables, and CTEs…

Beginner MySQL for Data Analysis does not get into these concepts.
Advanced MySQL for Business Intelligence skips CTEs, touches briefly on subqueries, and really focuses heavily on temporary tables.
Our new Beginner MySQL for Database Administration course (currently exclusive to the Maven platform) focuses more on creation and maintenance of data structures, so it really stays away from these concepts entirely

With our SQL audience growing, and having varying levels of previous exposure to these concepts, I should start off with some basic definitions and examples (skip over this if you have this down already).

What is a subquery?

When you use a subquery, you’re writing one complete SELECT statement, which outputs some number of columns and rows, and then you write another SELECT statement, which SELECTs data from the output of your initial SELECT statement, just like you would with a regular table. In practice, you might hear Analysts talking about the subquery as the “inner query”. Subqueries look like this…

SELECT
	first_pageview_seen.website_session_id,
	website_pageviews.pageview_url AS landing_page
FROM
(SELECT
	website_session_id,
	MIN(website_pageview_id) AS first_pv_id
FROM website_pageviews
WHERE created_at BETWEEN '2014-01-01' AND '2014-04-01'
GROUP BY 1) AS first_pageview_seen *-- Subquery gets an alias --*
LEFT JOIN website_pageviews
	ON website_pageviews.website_pageview_id = first_pageview_seen.first_pv_id

What is a common table expression?

When you create a CTE, it’s pretty similar to what you do with a subquery, with some slight differences. Just like with a subquery, you will write a complete SELECT statement to define columns and rows, but this time you will open it up using a WITH( ), and then later in the same query, you can SELECT data from your CTE, just like you would from a regular table. Here is an example…

What is a temporary table?

Similar to subqueries and CTEs, temporary tables are used to define an entity made up of columns and rows, which you can write additional SELECT statements on. Temporary tables are almost exactly like regular tables, except that they only persist during your current session. They are non-permanent. When you want to leverage temporary tables, you’ll use a CREATE TEMPORARY TABLE statement, like this…

When should we use each of these three methods?

The most important thing I can tell you on this is that when faced with a multi-step problem that requires one of these methods, there is no wrong choice. You should go with whatever you feel most comfortable with. Yes, you might hear some know it all tell you your query would have been more efficient if you did it this way, or that way. And sure, that person may be technically right. But that person is probably also just showing off or being a pain in the butt. The reality is, unless you’re dealing with very large data sets, you will not notice the performance difference. So, while I will talk about the differences, and which of these methods I recommend applying to various situations, I want you to know there is no wrong answer, and you should feel good about picking whichever method you like the best.

In the examples pictured earlier, you can see the three methods are pretty similar. None of them is particularly complicated to write or to read. They all get the job done well. Again, there is no wrong answer here. Especially when the analyses are fairly simple like the ones shown above, I really do not think there is much difference, and want you to just pick the method you feel most comfortable with.

As our data problems get more challenging and require more steps, advantages do start to emerge with some of the methods. Personally, I almost never use subqueries at all, and the one recommendation I will make is that you might want to steer away from using subqueries for the more complicated problems. Here’s why…

When the problem is simple, like this two step analysis, then a subquery works fine:

Now, let’s say those first two steps were just the first of three, and we still need to add another layer to our analysis.

Here’s what a three-step analysis looks like with a subquery...

It’s starting to get a little more complicated to follow, and this is just three steps. We have one subquery nested inside of another subquery. You have to do a little mental processing to find the “core” of the query. Think about how you would naturally read this. Starting at the top, you see your final SELECT statement. But then when you get lower down and realize it is SELECTing from a subquery, you need to stop, and figure out what the subquery's dataset is like. So you figure that out, then you go back to the top again to think about the final SELECT, and you have to look down to the bottom to see if there are any JOINs, WHERE conditions, or GROUP BY clauses. You are jumping all over the place just to figure out what is going on. What about if there is yet another step? Could we wrap yet another subquery around this one? We could, but I really do not recommend it. Take a look at the query below...just kidding. I’m not going to put you through that. But picture adding more and more steps, and continuing to wrap more and more queries around this already complicated-looking mess. Gross! Too many nested subqueries!

Let’s look at the same three-step analysis using the CTE method...

While there is still a fair amount of code, it is also a lot easier to follow. Instead of having to read from the inside out like with subqueries, you end up reading top to bottom. Each step of the analysis is very visually distinct, so it is pretty easy to understand what each block of code is doing. To me, this is a major improvement over the subquery method.

How about if you suspect one of your steps might have an error and you need to do some QA work on it. Maybe you just want to do a SELECT * from one of the intermediate sets of columns and rows you have created and see what the data looks like. With the CTE, you can comment out code below that CTE and run a SELECT *. This is pretty easy. With subqueries, you need to “unwrap” the code in question and remove the further steps from above and below. This is much more of a pain.

So, subqueries vs CTEs… the CTE method is easier to read in general, and a lot easier to QA.

How about doing this same thing with temporary tables?

Well, take a look at this group of queries below. It looks fairly similar to the CTE method. We have discrete blocks of code, which you read top to bottom and can process linearly. To QA one of your intermediate steps, you could run a SELECT * FROM statement from any of the tables you are creating. This is pretty similar to the CTE method too.

So am I saying CTEs and temporary tables are very similar? Absolutely!

Notice I have been focusing mainly on readability and ease of QA so far. I have not even mentioned anything about performance when making these comparisons. For a lot of the things you are going to be doing, performance really will not be an issue. If on the other hand, you are going to be working with very large data sets, then we should indeed talk about performance. If we are going to be focused on performance, one of the biggest levers we have to pull in optimizing performance is…

When we have very large numbers of records, having an index created to speed up future JOINs and WHERE conditions can lead to major query performance enhancements. Temporary tables support indexes. Sadly, subqueries and CTEs do not. So, if the datasets start to get really large on you, and you feel like your queries containing subqueries and CTEs are taking a very long time to run, you might want to consider switching over to temporary tables.

Are there any advantages to CTEs? Of course! For one, if you are doing something pretty simple, and you end up needing to change your code slightly, with the CTE, you just edit your code in-line and it works fine. With a temporary table, you run into issues trying to CREATE it a second time (because it already exists) so you need to add another line of code to DROP the previous table before creating your updated version. Is this a huge pain? No, of course not. But it can make CTEs slightly easier to work with for lightweight tasks that you might need to edit as you go. There are some other differences too, but this is getting long winded and I think we have covered the most important considerations for you.

I tend to use CTEs for the simpler stuff, and then switch over to temporary tables when the number of steps needed to solve the problem gets high, or when dealing with really large data sets.

Again, when the problems are simple, any of these three methods work great. When they get harder, I’d recommend staying away from subqueries.

Just to hammer it home...there really is no wrong answer here. You do you. Pick the method you feel most comfortable with for a given situation, and use this as a guide if you are feeling uncertain.

Keep learning!

-John

NEXT WEEK: FREE ACCESS TO OUR DATA SCIENCE PATH!

Looking to learn data science skills?

Join us Monday, July 28 (9 am ET) through Sunday, August 3 (11:59 pm PT) for free access to our completed Python for Data Science Learning Path!

EXPLORE THE PATH

Share this article with your friends

John Pauler

Partner & CRO

John brings over 15 years of business intelligence experience to the Maven team, having worked with companies ranging from Fortune 500 to early-stage startups. As a MySQL expert, he has played leadership roles across analytics, marketing, SaaS and product teams.

View profile

Data Analyst Skills to Tackle After You Learn SQL

Cross-Training to Learn Analytics Skills and Increase Your Value