Workload: Lead: 100%, Data Engineer: 100%, DevOps: 20%, Data Scientist: 20%, Reporting and dashboarding: 40%.
Technology: Data engineering: Python, Pyspark, Databricks, Data Science: Python, statistics, DevOps: Terraform, Reporting and Dashboarding: Power BI, Others: GitLab, Jira, Confluence
I joined the project as a data engineer directly after the discovery phase. Shortly after that, I was the tech lead of the data team. The size of the data team was between 2 and 5. I was responsible for defining the architecture of the data pipeline and the implementation strategy. I can divide the work on that project into 3 phases:
Phase1: The research
I spent time conducting research and performing experiments to understand the data properties that should flow through the pipeline. That includes the source, the type, the frequency, and the size of the data. Based on my research, I decided which of the Azure managed services I will use and the scale of those services, and the associated costs in the short, mid, and long term.
Phase2: The implementation:
I ensured that we deploy all the managed services we use through Terraform, following the concept of Infrastructure as Code (IoC). I developed what I called the "Test Strategy for Data Quality." The strategy includes several types of testing methods that ensure the quality of the data along the pipeline. We followed Test-Driven-Development (TDD) to write our Python code for the ingestion and ETL layers. All our deployment in staging and production went through a well-designed CI/CD pipeline. We carefully designed our data lake to store our data systematically, considering data governance and security. We connected about tens of endpoints from many data sources into the data pipeline. We implemented more than a dozen of ETL jobs.
Phase3: The delivery:
Since all the data engineering work is invisible to the other teams, we had to show our work by providing reports and dashboards. I took the responsibility of building the Power BI reports and the dashboards personally. I worked together with the other teams to ensure that they can trace the business performance on the company and the team level from day one. I analyzed part of the data and wrote data science reports to answer some important business questions.
Data projects hold more uncertainty than other types of projects. For me, providing an accurate estimation was the biggest challenge. I came over that challenge by following my value no. 5. "Do the right thing and do the thing right." That translates to keep the most critical tasks that move us toward our target on the table. Never hesitate to swipe those tasks down and up based on their priorities. Take all the time needed to implement them in the best shape. That helped us to develop a solid and healthy foundation for the data pipeline. On top of that foundation, we built the flexible components of the pipeline that we kept changing to meet the business needs.