Found an old powerpoint that I used to give to used to give to new Data Engineering, Data Science and Business Intelligence employees about how to optimize SQL code. This was at Chegg where I was officially an ETL Engineer and unofficially a DBA. The goal was minimize stress on the DB’s by explaining optimization techniques.
Relational databases generally tolerate abuse and are self correcting. An operation in Microsoft SQL Server eventually ends gracefully. Cloud database on the other hand generally enter a downward spiral when they get abused. Processes in Hadoop & Asterdata do not always end gracefully. Runaway processes on Aster were also like playing Russian roulette. In theory, things work. In practice, you may lose data by aggressively killing processes.
All that said, it made sense to hold classes every few months about how to optimize code as a way to minimize potential performance problems, with MySQL & Asterdata.
- Avoid derived tables
- Be weary of BETWEEN operations
- Use indexes
- Use an EXPLAIN PLAN if one is available
- Split operations up into smaller pieces (use indexes too!)
- Be aware of the advantages of facts vs partitioned dimensions.
A lot of the discussion on partitions was handled through speaking & not powerpoint. There’s a reason for this. In my experience, databases like Hadoop & Asterdata do not have a discrete set of solutions. You’ve got to experiment and unit test. Solutions vary from one case to the other. I could see Snowflake being simple to optimize but I’ve never used it…just read about it. Anyways. This powerpoint was interesting to dig up.
Here’s the presentation, https://imgur.com/a/9p3JvHw
 Looks like this is a repost…but…these tips are still useful. Next tips post I’ll do on bash, then python3. Time permitting…