Phase 4: Hadoop config + Physical database design + Analytics
In this phase my focus was to prepare for development. To complete these stage my goals were to create an environment where I could do my development comfortably. Comfortable meant avoiding having to make design decisions during the development phase. If this phase goes well, then development goes well. (I’m also going to add that these blog posts are a stream of consciousness. If I were to make a manual for this project, it would look very different from these blog posts.)
As such here was the focus:
- Configuration
- Ingestion platform
How would data be moved into the development environment?
Azkaban moved the data in 4 hour intervals. One source system was OLTP metadata about users and products. This data was safely migrated once per day. The data was guaranteed to be complete. Another source system was Zuora billing data. Zuora billing data was unpredictable such that it could arrive early or late depending on the cost to process data from the day before. As such Zuora data would be migrated on 4 hour intervals. A downstream process would be created to verify that 100% of the previous days billing data was migrated. - Hadoop platform
How would the physical data be persisted in the Hadoop cluster?
Data would be loaded into large daily partitions. In hindsight I wish I had set this up differently, or done some migration during development to optimize partitions. I chose to use large partitions to hold small amounts of data. The benefit was I could easily add and drop daily partitions. The cost is that I took up a lot of dead space. This is a tradeoff. Do I want an easy ability to manipulate a subset of a discrete set of data or do I want the ability to quickly process this data? I chose the path to easily manipulate the data.
- Ingestion platform
- Physical Design
What structures will be used to hold the dimensional data mart?
A single large transaction fact table would be used. If I wanted to slim down the fact table then I could build aggregate fact tables in the future. By building a large, bulky, non-performant transaction fact table, I could consolidate all events at an atomic level with no data loss.I also chose to loosely define type definitions for natural keys in such a way that would make development easy. For example, instead of dealing with INTs & BIGINTs in natural keys, I chose to use BIGINTs for everything. There was 10k+ lines of code to rewrite. I quickly learned that sifting through 10k lines of code to find that one intermediate table with an incompatible type definition was a luxury I did not have time to solve. BIGINTs was taking the easy way out but it is what it is. When you’re 60 hours into work and it’s only Thursday, tradeoffs like this make more sense. - Analytical Layer
How would the data be ingested into a BI platform?
The BI layer for this data was Teradata. The fact table was consumed into Teradata and massaged in that environment. The BI layer was mostly off hands from my perspective. As such there is not much for me to dive into. I managed this process by working with the engineers behind this Teradata layer to verify that my fact table was being delivered correctly. A lot of QC was done at this phase to verify that the data was accurate. Few design decisions were made. Moving the fact/dim tables from Hadoop to Teradata was enough.
A word about partitioning billing data in Hadoop:
The grain of this fact table was:
- Billing Source
A billing source was defined as a billing system. Zuora is a billing system. 4 distinctly different billing systems were used. - Event Type
An Event Type is a type of billing event. SUBSCRIPTION_CREATED is a type of billing event that defines when a subscription record is created. ~30 distinctly different events types were used. - Day
A Day is defined as data for a day in UTC.
A single day would have no more than ~300 partitions. Given ~7 years of historical data & 356 days/year, that’s 766k partitions. That is WAY too many partitions for Hadoop to manage efficiently. Work would be needed downstream in the data pipeline to avoid putting so many partitions in Hadoop.
A word about time zone conversions:
Time zone conversions was a big enough challenge that it warrants its own paragraph. Not all billing systems are created equally. Some save data in PST, others in UTC. When data from disparate billing systems is merged, the data must be written such that all data uses the same time zone. Normally this would not be trivial but the billing systems in this project involved subscription based transactions. A subscription billing system will have nightly reconciliation processes. During nightly reconciliation, new billing events are created by a billing system and those events are truncated to midnight. Major PITA. I chose to migrate all data to UTC because I wanted the transaction fact table to be a single source of truth with a little data manipulation as possible. Also…in my opinion, all data should be in UTC. Defining a billing system in PST, even if you’re in California, is a rookie move. It makes life harder down the road.
Consider this problem. We have 2 billing systems, one in UTC, one in PST, and a fact table we want to write data in UTC.
original_txn_date | new_txn_date | fact_table_txn_date | |
PST billing system | 2017-07-15 2:15 AM PST | 2017-07-15 | 2017-07-15T08:00:00 |
UTC billing system | 2017-07-15 5:37 AM UTC | 2017-07-15 | 2017-07-15T08:00:00 |
What’s happening is a user creates an event with a transaction having datetime precision. The billing system creates a new event during a nightly billing job process, then truncates the date to midnight. The ETL picks up this truncated date and moves it to UTC. This ‘feature’ becomes an issue when you try to do account reconciliation from the datamart to the source data. Because my datamart was meant for marketing needs, I chose to stick with this situation and live with it. No lives would be lost by sticking with this unfortunate data loss. Had this datamart been delivered to an accounting department then I would have done something to retain that lost precision.
The devil is in the details. This timezone conversion problem innocent at first. As I dug deeper, I had to manually review every date & datetime piece of metadata and the problem became challenging. Verification meant looking at the data to determine if it was in ICO 8601, a string, had a timezone offset, or something else. Things like this lead to scope creep but it is what it is.
Conclusions:
- The primary purpose of this phase was to validate my approach. Given the information I had at the time, my approach was sound.
- Using 128MB partitions in something I regret, because it lead to a lot of dead space. But it did make development easy at a time where deadlines were fast approaching. Consolidating data from multiple partitions would have helped.
- When working with source systems in multiple time zones, exercise caution. Particularly if some of those source systems are losing precision. It’s not always easy to notice when precision is lost. To recognize whether or not datetime precision is an issue, you may need to get your hands dirty with the source data.
(sorry for the not-so-great formatting…CSS & HTML are not my thing…)