The purpose of this post is to review:
- Scheduler Setup & Configuration (Appworx)
- Data Staging, Design & Development
- Appworx was chosen as the ETL scheduler because the tool was already in place in the corporate environment. This was good and bad, for reasons discussed below. The original datamart was orchestrated using a frankenstein of technologies, so a move to Appworx was ideal.
Why using Appworx was good:
- Appworx simplified depdency tracking problems.
- Appworx simplified the integration problems. Previously, multiple software solutions were used to integrate to various technologies. Appworx is flexible such that Appworx alone could be used to manage these integrations.
Why moving to Appworx was bad:
- Appworx is limited to using high watermarks. This is more of a personal preferance. I prefer to validate processes based on discrete time periods, not an agnostic high watermark. High watermarks are a superset of periods, which can make reprocessing data unnecessarily costly
- Appworx has a clumsy GUI interface, making administration a painful process. Administration becomes difficult when you have 25 or more processes on your screen.
- What is Data Staging?
For this project, the goal of data staging was to migrate historical data from the old system to the new system.
- Why was Data Staging needed?
In the old system, data was denormalized. In the new system, a star schema was used. The new system was a datamart with everything in 3rd normal form. As such, the data from the old system had to be refactored to meet the standards of the new datamart. On that note, data should alwasy be in 3rd normal form, at some point in the ETL.
- Why was Data Staging difficult?
The old system had about a decade of old data and at no point was the data in 3rd normal form. That was strange. A typical business will materalize the data in 3rd normal form then reuse it. This was one of those situations where the source data was improperly defined and the old data pipeline took liberties to massage the data. For this project, I had to redefine the 3rd normal form entities from denormalized data. To make matters worse, a lot of the data was ~10 years old and needed to be cleaned.
- How was Data Stating achieved?
Lots of manual SQL refactoring scripts. The purpose of these scripts was to extract data with an identity. For example, what is a customer, what is a sale, what is a subscription, what is a coupon, etc. All those entities need to be defined extracted from a confusing large denormalized table. Many of the key value pairs were confusingly named and reused. On the surface, this goal may seem trival. Consider a users first name. If the users first name is extracted from multiple sources and reconstructed to create a ‘Full Name’ then what is the authority on that users ‘First Name’? New rules had to be defined to reverse engineer the old frankenstein ETL process.
A primary goal of the new datamart was to create a transactional fact table. From there, I could create periodic fact tables as needed. The grain of the datamart was a specific transaction, such as someone creating a subscription or cancelling a subscription.
- How did I handle subscriptions start/end dates?
How to handle start_date and end_date values for subscription was interesting. I chose to do store the start_date and end_date values in the fact table only. A subsequent evolution of the datamart would include those values in the dimension. This was done for simplicity. The initial design was done in such a way where slowly changing dimensions would be avoided. I presume a next iteration of the design would to turn the subscription table (subscription_id, start_date, end_date) into an SCD.
Development was done in phases such that an ETL would exist for the following
- As such there was an ETL for every event and every dimension table. No SCDs were used in the initial design. The complexity for the project made the ability to use SCDs prohibitive. The result of the design process of this project was to create 40+ ETL scripts to populate the fact table and dimensions. This design phase was a very intensive process that required many 80+ hour weeks to accomplish. The effort was exasperated due to the need to reference legacy code and convert it into hive.