The purpose of this post is to show the approach I took to migrating a large datamart from SQL Server to hadoop. My blog has been neglected over the past year because of this project. The year before I built a walking bike and got it into museums. This past year I refactored a ton of legacy code. Moving 15k lines of a relational database code to hadoop/hive was interesting. The end result was actually nice because the data pipeline became easier to manage.
How time consuming was this project?
After Lynda.com was purchased by LinkedIn, it was decided to move the tech stack from Lynda to LinkedIn. LinkedIn does have Teradata, which is much more comparable to SQL Server, but my peers were working on Hadoop/hive, therefor I migrated code to Hadoop/hive. I once did one a year long Hadoop migration at Chegg.com where I moved the clickstream data pipeline from AsterData to Amazon EMR but this Lynda project was drastically different. While the Chegg.com project was complicated by mimicking AsterData MPP processes in MapReduce, this Lynda project was complicated by having too much business logic in the datamart and clumsy legacy code.
Requirements:
- Migrate from
- Informatica
- Informatica Cloud
- Powershell
- SSIS
- SQL Agent
- SQL Server Stored Procedures
- Migrate to
- hadoop / hive with Appworx as a scheduler
- Gobblin with Azkaban as a scheduler
- Apply coding standards.
- Simplify business logic (in an ideal world, business logic should not exist in a DW but it is what it is).
- Refactor 15,000+ lines of code from SQL Server stored procedures to hive.
- Create a dimensional model where one previously did not exist
- Recreate downstream aggregate tables.
- Migrate data ingestion processes to Gobblin (https://github.com/apache/incubator-gobblin).
One of the major problems with this project was the original datamart was a Frankenstein of technology. It used Informatica to kick off SQL Server Stored Procedures. It used semaphore files to do dependency tracking. (A semaphore file is a file used to track completion of a process. Another process checks to see if the file exists, if so then it runs then deletes the semaphore file. Pretty clumsy but it is what it is.) There was powershell code with SQL embedded in it. Tables and key value pairs were renamed with no identifiable pattern. Imagine user_id being renamed to account_id, to source_account_id then to lynda_user_id. Pretty weird stuff. To the credit of the original designers, it did server a valuable purpose and it did produce accurate data.
The solutions to this hadoop migration project I will describe in a series of posts. In additional making the process simple, coding standards were universally applied, redundancy was removed, and atomic restartable processes were added, among other things. Here is what was built in the end. It’s much less confusing.
The start of the project was extremely confusing and I began by asking a few quesitons.
- How do you deconstruct legacy code?
- How do you estimate the cost to replace it?
- How do you organize your time?
My answer to deconstructing legacy code was to do the following:
- Define each server that exists
- Define each application/database that exists
- Define each process within each application/database
- Define relationships between each process
- Define how those relationships are tracked
Basically I had to understand the scope of the project.
My answer to estimating the cost to refactor legacy code was to be systematic. Break the code up into pieces that could be defined, calculate the complexity of the code, then create an estimate.
- Defined all relationships between all ETL jobs
- To do this I created a python script to parse SQL Server ETL files. The python script I had already created to show a colleague that the ETLs were complicated. Minor modification were done to make the script extensible. The python script would reformat each ETL file, parse JOIN conditions, then output various stats depending on how the python script was configured. It answered the following questions: How many ETLs existed? How many lines of code existed? For each ETL, how lines of blanks, comments and code existed? What tables were used? How many of each type of join was done? What tables were joined together? What columns were used to join each table? For each ETL, how many JOINS were done and what was their complexity?
- Estimated the complexity of each ETL
- Categorized the ETLs for different purposes
For example, some ETLs had complicated business logic while others existed to transform data for dimensional modeling purposes.
If I were to do these estimates again then I would reconsider the following:
- Anything that is not a T-SQL standard in your source or target database is a new cost.
- Migrating from one data migration application to another application is a new cost.
- Complicated business logic should be moved out of the data warehouse.
- Cosmetic transformations should be moved out of the data warehouse.
My answer to how I organized my time was to look at old Kimball books to break the project up into phases. Here’s a bit about that process and I’ll go into details about what the phases mean in the next post.