Data Science Enablement
By: Brady Bastian
At the heart of any data science project is the set of tools, technologies, practices, and systems which give your company the capacity to develop, build, deploy, and maintain data science solutions in the cloud. Data Science Enablement is the difference maker between an analytics team focused on short term data analytics and a true data science solution which provides useful and powerful solutions which drive business.
Work to build a powerful coalition within your business dedicated to the success of your projects and delivering value to customers. Data science is a team sport, and requires buy in, and participation with, many different teams and leaders at your company.
Article Contents
Data Science EnablementWhere Businesses Go Wrong
Project Scoping
Project Development
Project Production Deployment
Data Science Enablement
Data Science Enablement is the set of tools, techniques, practices, and rules which determine how your company builds, deploys, and maintains data science solutions. As the original pioneers of production ready data science solutions quickly discovered, without a true data science ecosystem, along with strong collaboration among partner teams, data science solutions can quickly devolve into a mess of tech debt and disjointed models. The data science model of the data itself is actually a very small piece of the overall machine learning architecture. In order to get data science right and have it continue producing for your business into the long term, you must get the entire ecosystem correct.
Without a clear set of goals and a solid enablement plan, your data science project will inevitably fail to reach its full potential and may even be counter productive. Technical debt, in the case of ML and data science, accumulates over time as the real world observability of your data features change and evolve. Monitoring for and resolving this debt requires a constant reevaluation of all of your data inputs and the methods of data collection.
Metrics gathering and evaluation is an ongoing an iterative process. There are no tried and true best practices governing all data science projects, they must be addressed on a one-by-one basis. Each model requires a complex and intricate ecosystem of tools and techniques to effectively gather, analyze, measure, and evaluate different features and hyper-parameters. To this end, before any actual work can occur, you must first develop a data science enablement team to determine the methods, processes, tools, and practices that your project requires. However, the methods and development practices can be standardized, as well as monitoring, bug resolution, and data quality resolution.
Where Businesses Go Wrong
Businesses often have a misguided view of what is actually required to build data science solutions. Often, businesses believe that they just need to hire a data scientist to build analytical models of various business processes. While this can be helpful in the short term for small level and focused projects, it will not provide a robust and high quality solution over time and will generate significant tech debt which must be accounted for. True data science requires understanding the entire MLOps architecture, of which the prediction engine is one small part. Indeed, as it is often noted by seasoned data science teams, tech debt grows over time and it is not always immediately apparent that it is even occuring.
Data science projects are about far more than just building a predictive engine. Behind that predictive engine is a massive effort and data infrastructure solution which cuts across the multiple teams, production centers, and technologies. Building data science solutions in a vacuum is a losing battle, because proper data science requires a unified effort across multiple teams and profit centers. The data science enablement team is built from the ground up to solve for these problems by systematically analyzing a data science project as one component of a complete ecosystem.
Data Science is a very complex process which requires significant investment across a wide variety of teams. A data scientist will be producing the model itself, but the success of the project requires input and assistance from many different teams and sponsors. Often, because the project touches upon a wide variety of systems and processes, a corporate officer must be recruited to provide sponsorship and budgeting. A common misstep by businesses is a failure to properly scope and plan for the entire data science ecosystem development. As noted above, the persuasiveness of a simple data model to solve focused projects can be highly misleading, as it incorrectly demonstrates the full set of potential problems which could arise.
Project Scoping
Data science projects should be scoped properly before any development begins. This includes developing stage gates for various data engineering processes, such as data ingestion, archiving, warehousing, and analysis. Additionally, many businesses fail to properly account for the rather significant hardware investments required to achieve actionable insights. This includes proper cloud computing resources,data storage costs, compute costs, GPU's, CI/CD pipelines, API development, and many others.
Due to the fragility of ML models, each and every component of the architecture must be properly accounted for, measured, and tracked. Each process requires a significant level of investment, monitoring, and standardization methods. While there can be engineers assigned to each component of the architecture, a high degree of collaboration and consolidation is required to achieve success across all of your initiatives. Without effective management practices the ecosystem itself may fail, resulting in wasted resources and poor outcomes.
A Data Science Enablement team will analyze the entire ecosystem, gather the current capabilities of your organization, and develop a plan of action to bring your organization up to the standards required to build the projects effectively and produce results which last. Proper scoping of the entire ecosystems is a difficult task for organizations which do not already have an established practice. Businesses are often quite surprised to learn of the scale of requirements needed to successfully build and deploy a data science platform.
Some of the ecosystem requirements include feature development and storage, feature skew detection and normalization, training and retraining criteria, feature sharing practices, security, and serving infrastructure. Each of these processes requires a solid plan to account for a wide variety of circumstances to ensure that projects succeed and stay within project guidelines.
Project Development
Data Science Enablement teams assist with project development, as well. This includes facilitating data science aspirations by providing supporting datasets, debugging, solving data quality issues, proofing and providing data hardware acquisitions. Data Science Enablement for project planning helps ensure that resources are properly shared between scientists or within teams, are properly secured, and comply with industry regulations.
DSE will improve operational efficiency of your data science workloads and facilitate new development overall through context mapping, data development standardization, data cataloguing, and data product development. DSE will also help other teams or stakeholders take advantage of your models and analyses while ensuring a high quality solution and standardized methods of input.
DSE ensures a big picture success of your data science projects by looking at the entire ML infrastructure from data onboarding, quality, feature engineering, model development, and deployment. All of these components are necessary to have a sustainable and production ready data science practice. In addition to deployment, DSE will ensure standardization of REST API development so that other teams can readily access the models developed and to run online predictions.
DSE will improve the value of the models produced by ensuring a high quality for standardization, a robust architecture for data profiling, and standardization of data quality rules. This can include model monitoring, feature skew detection, experiment categorization, and model version control.
Project Production Deployment
DSE is not only important during scoping and model development, but it also essential when you model is in production. DSE can assist with CI/CD development, pipeline modernization and monitoring. Without proper standardization of resource management or pipeline deployment practices your data science projects can produce disjointed, inaccurate, and unsound results. This could sink any data science practice and instead of producing value your projects will instead suffer and become more of a burden than their value produces.
Production monitoring will ensure that any problems with data quality are quickly identified and addressed, and that any issues with concept drift or schema drift are captured in a standardized way that can be resolved without impacting other models in production. For feature engineering purposes, DSE can standardize feature selection, testing, and retraining production models as a separate process to the data science workflow. DSE can response to changes in data distributions, typing, or other profiling issues without requiring constant data science supervision.
Overall, this is a small sample of the benefits of a true Data Science Enablement team and, if your organization wants to take the next step of evolution in your data science practice, I encourage you to set up your own team and standardize practices and methods.