Workload: Data Engineer: 100%, DevOps: 20%
Technology: Data engineering: Python, Pyspark, Glue. DevOps: Terraform. Others: Github.
I joined the project as a data engineer. The size of the data team was between 2 and 5. I was responsible for implementing ETL jobs using Pyspark and run them in Glue. I contributed to deploying AWS services such as S3, Firehose, and Lambda using Terraform and a CI/CD pipeline.
The project's goal was to replace a heritage Machine Learning (ML) pipeline with a new one. The new pipeline uses cheap and scalable services to move data from different sources to the data lake. The ETL jobs that I implemented clean the data before using it to train the ML models.
The main two challenges I had to face were:
AWS Glue's functionalities were basic at that time. For example, I had to use Python 2.7 to implement the ETL jobs.
The new ETL jobs suppose to replace old ETL jobs that are running on another system. The heritage code was written in Jython. It was not easy first to understand the logic and second to build a testing system for running the old and the new implementation and ensure that both give identical results.