๐ Bridging the Gap: Embracing Best Practices in Data Engineering ๐๏ธย LinkedInย
In the dynamic world of data engineering, we have a unique opportunity to leverage the best practices followed by software engineers and apply them to our data pipelines. I want to shed light on an important principle that both data engineers and software engineers should embrace: the power of purging unnecessary code. ๐ช๏ธ
As engineers, we often develop a strong emotional attachment to our creations. However, we must recognize that cluttering our codebase with lines of commented-out code or leaving unused, gated code can hinder our progress and impact our ability to maintain robust and efficient data pipelines. This insight, inspired by the SRE book from Google, highlights the importance of adopting a shared mindset across engineering disciplines. ๐ก
Some might argue, "What if we need that code later?" or suggest, "Why don't we just comment it out or gate it with a flag?" However, it's crucial to consider the long-term implications of such approaches. Lines of commented-out code can introduce confusion and distractions as our data pipelines evolve, while gated code that is perpetually disabled becomes a potential time bomb. We need to learn from past incidents, like the unfortunate experience of Knight Capital Americas LLC, and avoid the traps they encountered. ๐ฃ
To bridge the gap between data engineering and software engineering, let's embrace the following principles:
1๏ธโฃ Clean Code: Maintain a streamlined codebase that is easy to understand, navigate, and modify. Remove unnecessary code, and resist the temptation to clutter our pipelines with commented-out code fragments. This ensures clarity and readability for ourselves and our team members.
2๏ธโฃ Version Control: Leverage the capabilities of version control systems to track and manage changes effectively. Utilize branching strategies, commit messages, and code reviews to ensure transparency and collaboration.
3๏ธโฃ Collaboration and Communication: Foster a culture of open communication and collaboration between data engineers and software engineers. Share best practices, learn from each other's experiences, and seek opportunities to align our approaches.
By embracing these shared principles, we create a strong foundation for efficient and scalable data engineering practices. We can learn from the successes of software engineering and apply them to our domain, ultimately driving innovation and delivering high-quality data solutions. ๐
#DataEngineering #SoftwareEngineering #BestPractices #CodePurging #Collaboration #Innovation
In data engineering, the roles of governance and auditing are of utmost importance. However, these critical aspects often face neglect or are deemed challenging due to their intricacies or a lack of specialized skills.
Through my research, I came across a noteworthy use case featured in the book "Delta Lake: The Definitive Guide" by O'Reilly (early release). This use case revolves around safeguarding individuals' information in compliance with GDPR (General Data Protection Regulation) and CCPA (California Consumer Privacy Act). ๐ก๏ธ
In this context, governance is pivotal by providing strategic direction through policies, standards, and guidelines. Of particular significance is the need to carry out delete requests, ensuring full compliance with GPDR and CCPA regulations and safeguarding individuals' privacy rights while maintaining data integrity. โ
Delta Lake seamlessly incorporates governance and auditing mechanisms into data engineering pipelines, offering robustness, reliability, and scalability.
We at KI performance remain committed to excellence in data engineering; we have recognized the transformative potential of Delta Lake and started integrating it into our projects.
If you share our passion for data engineering and recognize the paramount importance of governance and auditing in your pipelines, I encourage you to explore the use case presented in "Delta Lake: The Definitive Guide."
#DataEngineering #GovernanceAndAuditing #DeltaLake #DataProtection #Compliance #GDPR #CCPA #DataIntegrity #Scalability
๐ The Real Value of Data Science: Reducing Uncertainty to Drive Business Decisions ๐๐ก LinkedInย
Throughout my journey as a researcher, data scientist, and now as a data engineer, I've come to a profound realization about the role of data science in creating business value. In our pursuit of perfection, many data scientists strive to develop the most accurate models imaginable, tirelessly battling for that last 1% of accuracy. However, despite these remarkable achievements, I've noticed that many of these models fail to deliver tangible business value through their predictions. Countless reasons have been offered to explain this phenomenon, and perhaps all of them hold some truth. Yet, what concerns me is our collective failure to learn from these mistakes, leading us to repeat the cycle of producing models that fall short of generating true value.
Recently, I stumbled upon a simple yet insightful idea while reading the book "How to Measure Anything: Finding the Value of Intangibles in Business," published back in 2010. It introduced a notion that struck a chord within me: "Using numbers alone doesnโt make a weighted score a measurement. It must reduce uncertainty, and for that uncertainty reduction to be valuable, it must improve some decision." That fundamentally transformed my perspective on the real value of data science and machine learning.
In essence, I believe that the true power of data science lies in its ability to help us reduce uncertainty, enabling informed decision-making and empowering us to navigate the consequences of those decisions. It's not about obsessing over that elusive 1% of accuracy or the complexity of our models. Instead, it's about leveraging data-driven insights to mitigate uncertainty, thus enabling strategic choices that propel businesses forward.
This idea was further reinforced when I recently attended the Microsoft Build event opening and listened to Kevin Scott, the CTO at Microsoft, presenting "The era of the AI Copilot." He emphasized a pivotal truth: "The model is not your product." In a world increasingly reliant on artificial intelligence and machine learning, this statement holds immense significance. It reminds us that while our models serve as valuable tools, they are merely means to an endโa stepping stone toward driving informed decisions and delivering meaningful business outcomes.
So, how can we harness the full potential of data science to create substantial business value? It starts by shifting our focus from pure model perfection to the reduction of uncertainty. By developing models that aid decision-making, we equip decision-makers with the confidence to act, armed with insights that sharpen their understanding of potential outcomes. Data science becomes a partner in the decision-making process, facilitating better choices and providing a framework to manage the associated risks.
#datascience #machinelearning #ai #business #artificialintelligence #microsoft #msbuild #copilot
When it comes to building a data platform, engineers often find themselves caught up in the excitement of implementing technical solutions. They throw advanced technologies at the challenges, aiming to address business problems head-on. But amidst the rush, they sometimes forget a crucial element that holds the key to unlocking the platform's full potential: the users.
Onboarding users and enabling them to harness the platform's capabilities is essential for driving impactful results for our customers. Yet, this aspect is frequently overlooked or deprioritized. In his book, "Data Management at Scale: Best Practices for Enterprise Architecture," Piethein Strengholt emphasizes the significance of cultural aspects and their challenges, particularly in maintaining a data-driven culture and meeting governance requirements.
Strengholt rightly points out, "The cultural aspects require less rigid governance. It requires you to give users trust, to train people, and to work on awareness. These activities should not be underestimated." Indeed, our users are invaluable resources, possessing a deep understanding of specific data landscapes. By empowering them and enhancing their knowledge, we can significantly boost the efficiency of data utilization and maximize the impact of the platform.
In many of our projects, we have experienced the trials and tribulations that arise when attempting to train and equip people with the necessary skills. It has been a challenging journey that has taught us valuable lessons. We have realized the criticality of integrating the data platform's overarching components, including governance and enablement, right from the project's inception.
And now, with the recent announcement from Microsoft about Microsoft Fabric, our mission has reached another level of excitement. Enabling our customers to use Power BI and leverage a comprehensive suite of tools to bring data together using MS Fabric, enhanced by ChatGPT, will revolutionize how data platforms are experienced. It will make the process enjoyable, empowering users to make data-driven decisions like never before. This development is a true game changer in our industry.
Our diverse platform team has played a vital role in balancing business objectives and technological advancements. By fostering an environment where collaboration and innovation thrive, we have found a way to harmonize the needs of our customers with cutting-edge technology.
#dataplatform #dataengineering #dataculture #micorsoft #msbuild #governance #msfabric #innovation
๐ In the world of #dataengineering and #datascience, #security aspects often take a backseat, along with #governance. Many engineers and scientists may argue that it's not their job, leaving it to DevOps and security professionals. However, when it's time to go live and move into #production, neglecting security can become a massive roadblock. Engineers may find themselves needing to make significant changes to their pipelines, from removing hard-coded secrets to securing access and establishing private networks. The situation can worsen if the pipeline, built without considering security, fails to function as expected.
๐ก Educating engineers, scientists, and analysts about critical security aspects is crucial for a seamless transition from development to production. By including security topics and requirements from the project's inception, preparing development and production environments with security in mind, and providing proper education, we can prevent security issues and project delays.
๐ The security education journey can start with non-technical knowledge. For example, it's important to guide the team on what information is allowed and not allowed to be shared about the project and its components. As Benjamin Perkins notes in "Microsoft Azure Architect Technologies and Design Complete Study Guide Exams AZ-303 and AZ-304," the first rule of security is similar to the first rule of fight club: don't openly discuss your patching strategies, software versions, or corporate firewall settings. The less someone knows about your protection measures, the less likely they are to compromise them. Sharing information about the project is akin to leaving it behind a glass wall; it's only a matter of time before someone shatters that barrier and gains unauthorized access.
๐ป On the technical front, cloud vendors provide managed services to enhance #data and #datapipeline security. For instance, #azure offers Key Vault and secrets within Azure DevOps variable groups as secure storage for sensitive data in applications and their CI/CD pipelines. Leveraging private networks and endpoints helps avoid reliance on public endpoints, reducing exposure to potential threats. By managing access to data through service principles and managed identities, insecure connection strings can be avoided.
๐ค Security is not solely the responsibility of DevOps or security professionalsโit's a shared responsibility among all team members. Accountability becomes the cornerstone of preventing security breaches and ensuring the protection of valuable data assets.
#security #dataengineering #datascience #azure #microsoft #production #azuredevops #azurecloud #azurearchitect
As #datascientists, we're constantly immersed in the world of complex models, evaluating their #performance and striving for #accuracy. We measure, compare, and select the most promising #models based on various metrics and loss functions. Undoubtedly, these measures are crucial, validating our models against #benchmarks and pushing the boundaries of what's possible. But amidst all the technicalities, it's essential to keep in mind a simple truth: "Essentially, all models are wrong, but some are useful," as George Box astutely observed.
No matter how sophisticated our models become, they remain abstractions of reality. They inevitably omit numerous factors that may be vital to the problem at hand. Thus, when it comes to solving real business problems, the focus should shift from pure accuracy to usefulness. What good is a highly accurate model if it fails to provide actionable insights or make a tangible impact?
Before making the decision to deploy a model into production or invest further in its training, it's crucial to address a few pivotal questions:
1๏ธโฃ What does the current level of model accuracy mean for the business? Can we make valuable business decisions based on the current model's performance?
2๏ธโฃ If the current model falls short, what level of accuracy is necessary for it to become truly useful? Setting realistic benchmarks helps us avoid chasing unattainable perfection.
3๏ธโฃ What are the potential costs and profits associated with improving the model's accuracy by a certain percentage? A thoughtful cost-benefit analysis ensures that our efforts align with business goals.
4๏ธโฃ Perhaps the most critical question of all: Do we have a strategy in place to handle wrong decisions based on the model? Models are not infallible, and having contingency plans for mitigating potential risks is paramount.
These considerations become even more fascinating in the era of Large Language Models (LLMs), which have proliferated across the internet. Many of these models have showcased exceptional performance on benchmark problems, such as acing specific exams. However, the challenge lies in their integration into real business success and, equally important, how we navigate any unintended consequences they may bring.
In the end, striking a balance between model accuracy and real-world impact is an art. By recognizing that no model is a perfect reflection of reality, we empower ourselves to prioritize usefulness in solving business problems.
#DataScience #MachineLearning #ml #BusinessImpact #llm #artificialintelligence #business