Project 002: Extract Transform and Load (ETL) Pipeline


This is the culmination of an ETL pipeline project I worked on alongside four other colleages from northcoders. During this project i was primarily responsible for  the main lambda deployment script using python and boto3. My main objective ways to create a script which deployed a lambda function to AWS as quickly and efficiently as possible. In fact largely as a result of the efficiency of the script, we were the only team of four to complete the project !

I'm sure it isnt perfect but I tried my best, please feel free to view the script here: 

https://github.com/TomCoder442/de-pyoneers-project/blob/3018ae76daee49660d545dcfc5c9fbc635ad6ab2/deployment/deployment.py


Note: (I'm the second speaker)

Speaker 1 (00:00)

Hello and welcome to our presentation. We are the pioneers. My name is Hassan, and I'll introduce the rest of our team. We have Dowd, Tom, Edmund and Joe. We have created application that extract, transform and load data from a prepared source into a warehouse hosted in AWS. Using Python, AWS and database modeling, we had to manipulate raw data, transform it into data that we can then import into our destination database, where we could monitor it at different intervals. All of this would be written in Python applications and using AWS. During our project, we used different work practices using different methodologies, and we put agile practices in place that included different types of tools at our disposal. I'll now pass it on to Joe to demonstrate the different components of our project.


Speaker 2 JOSEPH RIGBY (00:45)

Thanks, Hasan. So we decided to write the deployment script in Python using Moto Three and used Moto for TDD to structure the script. We wrapped all of the basic functions into a Python class, which included ten modular methods for each basic stage. For example, one method for bucket creation, another for s three, policy creation, et cetera. At the end of the class, we created a master method which called all of the previous methods along with sleep and progress messages. Using this technique, we were able to deploy each lambda function very easily, instantiating each class with only four simple arguments and then an invocation of the master method. So just two lines of code to deploy each lambda function to the cloud. The result was three separate lambda functions extract, transform and load running on a regular schedule inside of AWS. Now I'll pass you on to Dawood, who will explain the GitHub actions and extract.


Speaker 3 (01:40)

Thank you, Joe. Our project was deployed using GitHub Actions for CI CD. The deployment Python script mentioned by Joe in the previous section was run within a YAML script as well as the mate file that handles all necessary requirements, dependencies and tests. We also use the YAML file to set up our AWS configuration and store the credentials in GitHub secrets. The extract function is the first lambda function triggered by EventBridge, which pulls all data from the source database and stores it in a s three bucket. Subsequent triggers of this lambda function only pulls newly updated data. The function utilizes SQL alchemy to interact with the source database and manipulates the pull data and converts it to CSV using Pandas. Finally, the Boto three library is used to connect and send data to the relevant s three buckets that were created in the deployment script mentioned in the previous section. Next, Tom is going to explain how these CSV files are processed in the second lambda function of a pipeline.


Speaker 4 (02:34)

Thank you, Duud. So, following the completion of the extract process, we had all the data from the source database in the AWS ingestion s three bucket in CSV format. It was now time to transform this data. The project requirements were that the data was to be transformed to a very specific schema and then converted into parquet file format. We used popular open source data analysis and liberation tool Pandas to do the data transformation. We never used Pandas before, but found it quite intuitive to pick up and flexible to use. We use interactive deployment environments, Jupyter to build up our transform code. Jupyter was also new to us all and something we hadn't used prior to this project. We found it invaluable especially due to the iterative nature of creating the transformation code using predominantly these two tools, Pandas and Jupyter, we now had the data transformed and in Pandas data frames in the required schema. We then converted the data into parquet file format and loaded these files into the process s tree bucket. From there, we loaded these files into the destination warehouse. Edmund will now speak to this final load part of the project.


Speaker 5 (03:40)

Thank you, Tom. The load function takes transformed parquet files, converts them into Pandas data frames, and uploads the data in psql into the destination database. After the initial setup, we're only attempting to load transform data which has been newly created or updated from the source database, and this ensures that there are no conflicts or duplicate entries. In the end, we managed to get a fully functioning pipeline fully deployable with code. The following dashboard slides show how the transform data can now be presented in an easily digestible format to someone not familiar with coding practices. Here, the first slide demonstrates overdue payments which can be helpful for a finance department, for example, and the next three slides give an overview at a glance of payments, sales and purchases, offering some insights into the composition of those categories. We really enjoy working on this project together. We hope you enjoyed our presentation and thank you for listening.

 Get in touch at Jobs@josephrigby.com