- Add historical data to the new data mart
- Reverse engineer the old datamart (~27 tables)
- Develop ETLs to migrate historical data
- Fix abnormal data
- Get sign off from stakeholders for data that cannot be fixed
- Test ETLs
- Execute ETLs to generate historical data in the new datamart
Old ER Model vs New ER Model
This second image shows the target tables in the new data mart. The structure is easy to comprehend.
Problems with the old data mart
The process of adding historical information to the new datamart was taxing because the data structures changed dramatically from the old data mart to the new data mart. In the original design, duplicate data was pulled in from multiple sources. Multiple scripts wrote to the same table with no indication what was creating the rows. This complicated my ability to reverse engineer the old data mart. Point being, for me to create a star schema out of weird legacy data structures, I had to get into the weeds of the old data mart.
Refactoring the data
The process of deconstructing a star schema from the old data mart involved creating new ETL processes to populate dimensions and facts with historical data. For some dimensions and facts, I source OLTP tables outside of the old datamart. Sometimes, the old datamart was processing data incorrectly. Part of my migration was to fix these legacy problems. In some cases this meant defining new business logic. At every step of the way, I worked with stakeholders to confirm my results. Sometimes it was more art than science. For those situations where extracting data was an art, I leaned on domain experts to guide my solutions.
Generally speaking, the migration script had these steps:
- ETL jobs to deconstruct data in Microsoft SQL Server
These operations were needed to access OLTP tables were available in SQL Server and not in Hadoop. Staging target tables were created in this step.
- Copy historical data from SQL Server to Hadoop
- ETL jobs to deconstruct data in Hive to create dims and facts
These operations were needed to access Hadoop tables were available in Hadoop and not in SQL Server. Since Hadoop was going to the source of truth in the new design, I sourced data as much data as possible in Hadoop. I relied on tables from SQL Server only where necessary.
The end result was a set of dimensions and facts in HDFS that included a decade of historical data.
Using a car as an analogy
To put this in perspective, consider a legacy data pipeline that resulted in a car…with a trailer that had a boat on it. I deconstructed the car, trailer and boat into their major components. For each vehicle, there were individual components like a frame, wheels, engine, transmission, etc. Where there was a missing part, I added it in. If the engine lacked a carburetor, I created a new one. If I extracted square tires instead of round tires, I fixed the tires. I then tested each component individually for quality control. It had to operate the same way in the new environment as it did in the old environment. The end result was a car, trailer and boat that operated the same, but with minor, agreed upon inconsistencies. All this data being financially related, accuracy was important. That’s the best analogy I can give to this ‘processing’ stage of ETL development. As taxing as this process may sound, you define incremental steps, iterate on development tasks, test, then move onto the next step.