within a product life cycle management and software product development company.
Process Description and Responsibilities:
- Developing, testing and implementing SQL queries on AWS Redshift.
- Extending the existing ETL solution (Jenkins->Shell scripts-> Redshift SQL queries) by adding extra flows and/or modify existing ones.
- Creating data flows for new sources of data.
- Developing Shell automation scripts that interact with various AWS services, Jenkins, Slack (used for alerting specific user groups in case different processes encounter failures.
- Developing and maintaining the alerting system (Slack, email) for critical processes.
For the last two years, the project I was involved into was dedicated for a client from the airline industry. The data collected from the booking platforms is feeding the data warehouse daily and given that lots of marketing campaigns are based on the daily data feeds, ETL is one of the critical processes. The technologies used for the process:
- acting as an automation server that enables running specific jobs at a specific time, triggering execution of sh scripts.
- AWS S3
- service used for raw data storage. This is where the data is extracted from.
- AWS Redshift
- data warehouse; data is loaded in Redshift and then transformed according to the business logic.
- Shell scripts
- all the logic behind the ETL process is based on shell scripts and SQL queries; SQL part is also integrated with sh scripts.Data structures:
There are two different schemas for the ETL process:
- a staging one, that will hold data as it is in the raw files.
- a production one, that will hold the same data as above, but with some transformations and having different data types. From the tables perspective, both schemas are having the same number of tables.
The production tables are having two more columns, and , used to keep track of when data was loaded in the table.Process flow:
Data is sent by the client to an internal SFTP server. From this, an AWS Data Pipeline job, copies the daily files (.gz file format) to the S3 bucket. The process is scheduled to run every day, at a specific time, being triggered by a Jenkins job that first checks if all the needed files are available in the S3 location. If one or more files are not available, then the process stops and an alert email is sent for notifying purposes. If all files are in the S3 bucket, the process can start.
Data is extracted from the S3 location, being placed at first in the staging tables.
Staging tables are truncated every day right before the COPY to staging is executed, in order to have a way of checking data in its raw format, if needed.
From staging, data is loaded in the production tables. For this step, shell scripts are being executed, running SQL queries that handle UPSERT logic for each of the production tables. This step is done to avoid having duplicates and also to keep in the production tables the most recent ‘version’ of the data.
Once data is available in the production tables, there are other sh scripts that are handling specific transformations in order to have the data ready for other jobs to pick it up, such as various marketing campaigns, client reports.Failures and action items:
- data is available but files are holding invalid data > data checks need to be executed to identify the exact column/value that are causing the issue.
- data is available but files are corrupted > data provider is notified.
- data is not available as per the agreed schedule > data provider is notified.Responsibilities:
- Daily monitoring of the process, making sure the data goes from S3 to the production tables without any error.
- In case of any failure, acting as soon as possible.
- If there is a need to integrate a new file to the existing process, making sure that the flow will not be affected and the changes are according to the current solution.
- Testing any new file or data source separately from the ETL (if possible) before integrating it, running quality checks and reporting anything that can impact ETL.
- Developing and updating the SQL logic if there is a clear business requirement.
- In situations in which an existing flow needs to be replaced with another one, making sure the new feed will follow the standards already in place in terms of S3 availability, raw file formats, types of columns, distinct values for columns holding different codifications. If possible, creating a backup of the data coming from the old feed, as this will be helpful while comparing with the ‘new’ data- document any new implementations to the existing ETL process.