A very typical task within Data Engineering departments is to source new data. Here is a checklist for common questions you should ask. If your given a project to consume data and you ask these questions, then these questions will help you avoid pitfalls.
- How many files are involved?
- What are the file names?
For each file, what is the name, current size and number of rows? How big does the file grow each day? How many rows are added each day? These answers speak to your ability to scale.
- How frequently are files created?
- At what time are the files created?
- Where can I find definitions for each data element for each file?
- Is there a standard time zone used such as GMT, PST or PDT?
- Will each row in the file contain a created_date and updated_date or equivilant?
If the answer is no then ask whoever in charge to create this data. Typicall you are looking for:
- Can data be updated historically?
If data is going to be updated historically then you want to be aware of it. If you’re consuming financial transactions then odds are they never change. You consume them once and move on. If you’re consuming user data then there will be times when that data changes. On a periodic basis you will want to consume data that has changed. This is very intuitive stuff but by asking this question you are making an agreement.
- Can the data contain special characters, double quotes, single quotes, carriage returns, or tabs?
This speaks to how you import data. Be sure the target database is capable of consuming data in the format you expect. Take Informatica for example. Informatica typically expects a carriage return to signal the end of a row. If there are carriate returns within the values of elements, then informatica is going to throw an exception when it terminates a row unexpectedly. FWIW I depise Informatica but I see its value in the marketplace. My advice is build your own tools so you are not left a the mercy of 3rd party solutions like Informatica.
- Can the data be compressed?
- How are columns separated?
I personally prefer ^Z for flat text files but whatever works.
- Can the files be time stamped?
In the case of ISO format, massage the timestamp to be readable. For a file time stamped at 2007-04-05T14:30, the filename could look like TableName_200704051430.tgz. Or for epoch, Tablename_1447449702.tgz.
- How are NULL values handled?
Be sure that your approach to handling NULL values is the same as the source. Is an empty string a NULL? Is “NULL” the same as a NULL?
- How do we know file contains partial data?
Create a mechanism that allows you to validate your source data has been delivered correctly. One approach to solving this is to count rows at the source and the target.
- For a failure where source data is not available for several days, how is recovery handled?
Ensure that the source provider has the ability to deliver bad data and in the format you expect. For years I used to source data from Efficient Frontiers and they were notorious for having data delivery problems. They hardly recognized failures but that wasn’t the frustrating part. After being made aware of a failure they would deliver a new set of data and the new set of data would be in a different format.
- What is the SLA for data being delivered?
Have your source commit to a time when data is expected.
- What data rentention policies are in place for source data?
This speaks to your ability to recover old data. Having to consume data from Efficient Frontier taught me this lesson as well. Say the source has a data retention policy of 14 days then the data is deleted. If after 14 days you find a problem with the data delivered, and the data has been deleted at the source, then you’re out of luck.