Phase 3 of 7! Here we go!
This is the start of high level technical discussions. This is phase 3 of 7 for this project (see diagram below).
This phase is broken up into 3 sections:
- Technical Architecture Design
Goal is to define the architecture of the new system. What servers are involved to store data? What servers are involved to consume data? What servers are involved to manage data?
- Dimensional Modeling
Goal is to define the fact table and dimension tables.
- Analytical Application Specification
Goal is to define the reporting layer for the new datamart.
Technical Architecture Design
One of the goals of this project was to migrate to a database that was more commonly used at the new company, in this case LinkedIn/Microsoft. As such we moved from SQL Server to Hadoop. It is worth noting that other databases are used at LinkedIn, such as Teradata and Oracle, but the data team uses Hadoop and this new datamart would sit on Hadoop. SQL Server, Teradata & Oracle are more reasonable solutions to hosting a star schema but the direction of my department was to move as much as possible to Hadoop. Not easy but where there is a will, there is a way.
I made a conscious decision to use hive instead of pig. Most of my peers were in the pig camp. SQL Server and hive are closely related to T-SQL, so making the decision easy for me. In fact, a month into this project I had several python scripts that could port code for me. After spending way too much time fixing things by hand, I put a python script together to parse a SQL Server file and output a hive hql file, complete with HDFS locations, formats, CTAS operations, type definitions and formatting. Pretty silly to have to go that far BUT migrating 10k+ lines of code was a ridiculous chore. I was constantly doing refactoring, and having the ability to easily move from one database to another was _amazingly_ valuable. I had operations joining 6+ tables and Hadoop chokes on those type of operations. If I found a bug in hive, I could port my code to SQL Server, debug it, then port it back to hive. Pretty awesome…pretty lame having to resort to that…but pretty awesome too.
The old tools for data ingestion for this project were added haphazardly. As a result, a Frankenstein system was built using a combination of Informatica, Powershell and SSIS. This was done for time to market purposes. The powers that be wanted results and engineers cut corners to meet deadlines. It happens, nothing wrong with that, but I had to clean it up. There was no real desire to keep any of the old tech so all data ingestion would be refactored. The new tech stack was built using tools common at LinkedIn. Azkaban is an open source tool developed by LinkedIn. Azkaban was already supported, which was good. It did not include APIs to communicate with some third party billing systems, which was bad. Not having those APIs meant scope creep. We needed to pull an engineer off another project to solve this API problem. Ultimately, the project got what it needed, all source data delivered to HDFS, but the cost to get there was higher than expected.
Scheduling and Dependency Management:
The old system was using a combination of Informatica and SSIS to manage processes. It also used ‘semaphore’ files to communicate from one system to another. If a process upstream were to finish, it would create a file. A process downstream would wait for that file, kick off when it was found, then delete the file when finished. May sound strange but it worked. This clumsy system was a feature of not having the right tools available for the right job. The solution was to embrace the tech stack at LinkedIn. Appworx is a GUI xterm tool used to manage processes. Thus, Appworx was used to stand up the new datamart. (More on this later. This datamart got a little crazy with Appworx dependencies.)
Another goal of this project was to create a new star schema. The old system was more or less a denormalized flat table with every attribute and measure imaginable in one record (yuck!). Putting that data into a dimensional model was done by analyzing the target data and carefully dissecting how it was created. Ultimately you want to create as few dimension as possible, but enough such that everything is in third normal form. Anything less is sloppy. With those requirements in mind, and given enough time to reverse engineer code, the output was simple. My documentation and use of lucidchart for diagrams was extensive. It allowed me to look at attributes from a high level, related ETLs, source systems, drill down to specific behavior, then debate design decisions with peers, the team, and myself. This process was also quite time consuming.
Another feature worth noting is that every dimension would be Type 1. Creating type 2 SCDs in SQL Server, Teradata or Oracle is trivial. Doing it in hive requires some overhead. Given you can use MR functions (some were available) to create surrogate keys, using anything other than natural keys was overkill for my needs. I was already reverse engineering 10k+ lines of code, it was taking much longer than expected, so minimizing any cost was beneficial. Type 2 SCDs were one of those things that was nice to have but didn’t matter for this datamart. I was creating an event fact table, most attributes were immutable, thus natural keys were just fine. If an aggregate fact table were created on top of my event fact table then surely surrogate keys would be introduced. But that’s another project for another time.
Analytical Application Specification:
The reporting layer ultimately drives what it is you do. In my case my original reporting layer was QlikView and my new reporting layer would be some proprietary internal tool at LinkedIn.
Side story: An interesting feature of this project is QlikView has its own columnar database. This allows QlikView to offer low response times to users (good). It also increases the cost to ingest data from a normalized data set into QlikView (bad). The old datamart was created to support QlikView and design decisions were made specifically to embrace QlikView (very bad). In doing so, many typical design decisions were disregarded. First & second normal form target tables were created in the old database (huh?). One goal of this project would to put everything into a pretty event table star schema with convenient third normal form tables. This would serve as the foundation for reporting data such that aggregate tables could be built on top.
For this project, I asked the following questions and the answers determined the path forward:
- How will data be consumed? Azkaban
- How will dependencies be managed? Appworx
- How will transformations be applied to the data? Hadoop/hive
- How will data be stored? HDFS
The technology decisions were largely made for me because my technical options were limited. If the new company was predominantly using X technology, then so would I.
=> The event fact table determined the grain of my fact.
=> The start schema determined how the event fact table would be managed.
=> Attributes in dimensions were immutable.
=> Azkaban determined how data would flow between servers.
=> Hadoop determined where the datamart would be kept.
=> Appworx determined how it would be glued together.
In retrospect I can say phase 3 of this project organized itself. At the time, it wasn’t so simple. It was a lot of asking myself if something would work, validating with tests, then moving to the next step. I wasted a few days testing presto, which is an example of testing that did not work out in my favor. I had to quickly educate myself on new technologies in order to validate my approach to solving these architecture problems.