Phase 5: Development, migrating a SQL Server datamart to Hadoop
The purpose of this post is to review:
- Scheduler Setup & Configuration (Appworx)
- Data Staging, Design & Development
- Appworx was chosen as the ETL scheduler because the tool was already in place in the corporate environment. This was good and bad, for reasons discussed below. The original datamart was orchestrated using a frankenstein of technologies, so a move to Appworx was ideal.
Why using Appworx was good:
- Appworx simplified depdency tracking problems.
- Appworx simplified the integration problems. Previously, multiple software solutions were used to integrate to various technologies. Appworx is flexible such that Appworx alone could be used to manage these integrations.
Why moving to Appworx was bad:
- Appworx is limited to using high watermarks. This is more of a personal preferance. I prefer to validate processes based on discrete time periods, not an agnostic high watermark. High watermarks are a superset of periods, which can make reprocessing data unnecessarily costly
- Appworx has a clumsy GUI interface, making administration a painful process. Administration becomes difficult when you have 25 or more processes on your screen.
- What is Data Staging?
For this project, the goal of data staging was to migrate historical data from the old system to the new system.
- Why was Data Staging needed?
In the old system, data was denormalized. In the new system, a star schema was used. The new system was a datamart with everything in 3rd normal form. As such, the data from the old system had to be refactored to meet the standards of the new datamart. On that note, data should alwasy be in 3rd normal form, at some point in the ETL.
- Why was Data Staging difficult?
The old system had about a decade of old data and at no point was the data in 3rd normal form. That was strange. A typical business will materalize the data in 3rd normal form then reuse it. This was one of those situations where the source data was improperly defined and the old data pipeline took liberties to massage the data. For this project, I had to redefine the 3rd normal form entities from denormalized data. To make matters worse, a lot of the data was ~10 years old and needed to be cleaned.
- How was Data Stating achieved?
Lots of manual SQL refactoring scripts. The purpose of these scripts was to extract data with an identity. For example, what is a customer, what is a sale, what is a subscription, what is a coupon, etc. All those entities needed to be defined from a confusing set of tables. Many of the key value pairs were confusingly named and reused. On the surface, this goal may seem trivial. Consider a users first name. If there are 3 tables, each with a different first name, which one do you choose? To solve this problem, new business logic had to be defined and vetted with stakeholders. Basically, incorrect data in the old data mart had to be corrected in the new data mart.
A primary goal of the new datamart was to create a transactional fact table. From there, periodid fact tables would be possible to meet the needs of future requirements. The grain of the datamart was a specific transaction, such as someone creating a subscription or cancelling a subscription.
- How did I handle changing subscriptions start/end dates?
How to handle start_date and end_date values for a subscription was interesting. How do you handle a subscription where the start_date is in the future? Is it better to set the grain to when an event is created or an event is executed? What happens if a user updates that date? It was a strange edge case.
Development was done in phases, starting with the dims and moving into the fact. Part of the goal was to define each ETL with specific mutually exclusive behavior. This led to a lot of ETL files, but helped make the data mart easier to manage.
- Dimensions (15+ scripts)
- facts (25+ ETL scripts)
- No Type 2 SCDs were used in the initial design. The complexity for the project made the ability to use SCDs prohibitive. The result of the design process of this project was 40+ ETL scripts to populate the fact table and dimensions. This design phase was a very intensive process that required 80+ hour weeks to accomplish. The sheer size of the project made that unavoidable…there was a lot of legacy code to refactor from SQL Server Stored Procedures to Hive.
Here is the end result. What’s noteworthy is every ETL was written in the same language, unlike the old datmart. In the original data mart, the ETLs were scattered throughout a variety of systems (Informatica, SSIS, SQL, Stored Procedures, and shell scripts), making it *very* difficult to comprehend. In the new datamart, the ETLs are isolated and easier to manage.