As a way to mitigate stress on databases I sometimes give SQL optimization classes to BI and Data Science folks. Here’s one of those presentations that I dug up from 2013. The content of this presentation was aimed at solving specific problems such as out of control ad hoc queries doing suboptimal repartitioning and killing database servers…this class and others were part of the solution to solve that problem. There are some general useful tips here none the less.
Topics to be covered
- When should I use Derived Tables or Subselects?
- When should intermediate tables be used?
- When will functions slow down my query?
- How can formatting SQL help me?
- Tips on using dates.
- When should ANALYZE/OPTIMIZE be used?
- Tips on GROUP BY vs DISTINCT, DISTINCT ON, LIKE, and Indexes.
- An example optimizing a query using derived tables and indexes.
What is a Derived Table or Subselect?
A derived table is a way to encapsulate a table within a FROM clause.
Example of a derived table where the derived table is this_table
SELECT COUNT(DISTINCT fwr.uvn) AS cnt , COUNT(DISTINCT CASE WHEN this_table.page_category_id = 7 THEN fwr.uvn END) AS signup_visitors FROM tablename_20130901 fwr INNER JOIN ( SELECT uvn FROM tablename_20130901 WHERE page_category_id in (5,7) ) this_table ON this_table.uvn = fwr.uvn;
A subselect is a way to encapsulate a query within a SELECT, FROM or WHERE clause.
Example of a subselect used in the WHERE clause
SELECT COUNT(DISTINCT fwr.uvn) AS cnt , COUNT(DISTINCT CASE WHEN page_category_id = 7 THEN fwr.uvn END) AS signup_visitors FROM tablename_20130901 fwr WHERE fwr.uvn IN ( SELECT uvn FROM tablename_20130901 WHERE page_category_id IN (7) );
- Easy to write
- Derived Tables can be used multiple times within the same query.
- The result set will not use an index
- May consume too much memory
- May not be analyzed in an EXPLAIN plan
- Bad performance when the Derived Table or Subselect is an expensive operation
- May force repartitioning (for MPP databases LIKE Redshift)
Alternatives to using a Derived Table or Subselect
- Use an INNER JOIN
- Persist your data into an intermediate (aka temporary) table then use the temporary table in your query
Common Table Expressions are another alternative but CTE suffer from the same problems. If you being your query with WITH then more power to you, but don’t expect it to be performant and definitely do not put it in production. When performance is not an issue then all these methods are ok. When your pushing things to production or when performance is a concern, then optimize your code
When should an intermediate table be used?
- To convert one expensive long query to many smaller fast queries
- When the intermediate table needs to be multiple times in subsequent queries
- To take advantage of caching
When will functions slow down my query?
Using a function on a JOIN condition can cause an index to not be used.
SELECT COUNT(*) FROM orders WHERE DATE(orders.order_date) = '2013-09-01';
Assume an index exists on the column order_date. The database has this operation to process
DATE(orders.order_date) = '2013-09-01'. The first thing the database will do is process the DATE() function then compare it to the date ‘2013-09-01’. The problem is once the DATE() operation runs, the existing index canont be used. Here is a solution that still uses an index.
SELECT COUNT(*) FROM orders WHERE orders.order_date >= '2013-09-01' AND orders.order_date < '2013-09-02';
Here the order_date is not modified. The database engine is able to use the index. It takes a little longer to write but will run faster.
How can formatting SQL help?
If someone ELSE will be reading your code then please format it. More important than how you format your code is formatting it at all. Just make it readable.
- You may recognize hard to find bugs
- Your code will be easier to read and explain to others
- Your code will be easier to maintain in the future
- You will be rewarded with good karma when someone ELSE must read your code!
- Capitalize reserved words (SELECT, FROM, WHERE, GROUP BY, CASE, INNER JOIN, SUM, MAX, etc)
- Use indentation
- Align your code vertically
- Use descriptive aliases, not single characters (You’re joining tables a, b, & c together, what does that mean?!)
select count(*) as count, sum(oll.price_amount) as sum from table_log_extended oll join orders o on o.order_id = oll.order_id join table ol on ol.order_id = oll.order_id and ol.line_id = oll.line_id join table_log_extended oll2 on oll.orig_table_log_id = oll2.table_log_id where oll.cust_monetary_transaction_id > 0 and oll.price_amount <> 0 and o.order_date >= '2011-01-01' and oll.table_log_refund_type_ > 0 and oll.table_log_trans_type_ = 13 and oll.created_date >= ? and oll.created_date < ? and ol.table_category_ = '4' and ol.table_subcategory_ = '7' and oll.is_test = 0;
SELECT COUNT(*) AS count , SUM(oll.price_amount) AS sum FROM table_log_extended oll INNER JOIN table_log_extended oll2 ON oll.orig_table_log_id = oll2.table_log_id INNER JOIN orders o ON o.order_id = oll.order_id AND o.order_date >= '2011-01-01' INNER JOIN table ol ON ol.order_id = oll.order_id AND ol.line_id = oll.line_id AND ol.table_category_ = '4' AND ol.table_subcategory_ = '7' WHERE oll.created_date >= '20130901' AND oll.created_date < '20140901' AND oll.cust_monetary_transaction_id > 0 AND oll.is_test = 0 AND oll.table_log_refund_type_ > 0 AND oll.table_log_trans_type_ = 13 AND oll.price_amount <> 0 ;
Now that this code is formatted is is much easier to recognize that there were two joins to the same table (table_log_extended).
Tips on using dates
Using a BETWEEN clause may have unintended consequences:
The following does not include 2013-09-31
SELECT COUNT(*) FROM orders WHERE order_date BETWEEN '2013-09-01' AND '2013-09-31';
The following does include 2013-09-31
SELECT COUNT(*) FROM orders WHERE order_date BETWEEN '2013-09-01' AND DATE_ADD('203-09-01', MONTH 1 DAY); Say you are doing ad hoc analysis and constanatly updating dates in your files. You may want to use date functions as a way to minimize your manual work. In this example instead of replacing both 2013-09-01 and 2013-10-01 we can use date functions and only need to replace one date value.
SELECT COUNT(*) FROM orders WHERE order_date >= '2013-09-01' AND order_date < '2013-10-01';
SELECT COUNT(*) FROM orders WHERE order_date >= '2013-09-01' AND order_date < DATE_ADD('203-09-01', MONTH 1 DAY);
When should I use OPTIMIZE and ANALYZE (MySQL)?
- Examines key distribution and updates metadata
- Can speed up queries
- Reorganizes the physical data of a table, similar to defragmenting a hard drive
- Should be run after substantial changes to a table (UPDATE/ DELETE/INSERT)
Do not run OPTIMIZE or ANALYZE on
- New tables
- Small tables
- Tables that are not modified often
Tips on GROUP BY vs DISTINCT, DISTINCT ON, LIKE, and Indexes
Use GROUP BY instead of DISTINCT to improve performance. GROUP BY operations are explicit where as a DISTINCT operation may be incorrectly used. DISTINCT operations are fine when used correctly. The problem with DISTINCT operations is people sometimes make incorrect assumptions. An easy way to avoid potential mistakes is to use GROUP BY operations. The DISTINCT ON operation is another solution but it too can be used incorrectly. Also, if you are working with MPP databases like Redshift then DISTINCT operations may be slower than GROUP BY operations.
SELECT name, address, COUNT(*) FROM contacts GROUP BY name, address;
SELECT DISTINCT name, address, COUNT(*) FROM contacts GROUP BY name, address;
A LIKE operation on both sides of a string will not use an index
Indexes have a cost. Create indexes knowing indexes have a cost. Indexes require space and slow down updates to the table.
An example optimizing a query using derived tables and indexes
Here is an example of optimizing a long running complicated query. In this example, a Data Scientist created a query and would run it almost every day. The query took about 2 hours to execute and longer if the database was already under stress. Surely the query ran very fast when it was initially created then slowed down as the volume of data grew. TL;DR: Massive slow query was broken down to smaller faster queries.
Steps to improve performance
This was general ETL optimization where few large slow complicated operations were split into many small fast simple operations.