Completing the Machine Learning Loop

(Source: Image by Shutterstock and Daniel Jeffries)

Capability vs. Ability

The Data Science Process

Functional diagram of a machine learning system. Model development is typically an offline process which results in a trained model or inference pipeline to be incorporated into a production analytics system. Over time, data from the production system (typically a data lake) is pulled into the model development process to improve an analytic’s quality and/or performance. (Image by author)

Software Development: Two Life Cycles Diverge in the Woods

The Software Development Life Cycle (SDLC) is a useful construct to show the journey software must continually undergo. (Image by author)
  1. Version control — managing code in versions, tracking history, roll back to a previous version if needed
  2. CI/CD — automated testing on code changes, remove manual overhead
  3. Agile software development — short release cycles, incorporate feedback, emphasize team collaboration
  4. Continuous monitoring — ensure visibility into the performance and health of the application, alerts for undesired conditions
  5. Infrastructure as code — automate dependable deployments
Testing and monitoring required in traditional software systems (Image adapted from: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction)
  1. We start with the code repository, adding unit tests to ensure that the addition of a new feature doesn’t break functionality.
  2. Code is then integration tested. That’s where different software pieces are tested together to ensure the full system functions as expected.
  3. The system is then deployed and monitored, collecting information on user experience, resource utilization, errors, alerts and notifications that inform future development.

The Two Loops

The Two Loops: A model of what machine learning software development encompasses. The Code Loop is crucial to develop the ML software for model stability and efficiency, while the Data Loop is essential to improving model quality and maintaining the model’s relevance. Creating ML models requires the Code Loop and Data Loop to interact at various stages, such as model training and monitoring. (Image by author)

Data: The new source code

  • Data is bigger in size and quantity
  • Data grows and changes frequently
  • Data has many types and forms (video, audio, text, tabular, etc.)
  • Data has privacy challenges, beyond managing secrets (GDPR, CCPA, HIPAA, etc.)
  • Data can exist in various stages (raw, processed, filtered, feature-form, model artifacts, metadata, etc.)

Data Bugs

Testing and monitoring required in machine learning systems (Image adapted from: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction)

“Keeping the algorithm fixed and varying the training data was really, in my view, the most efficient way to build the system.” — Andrew Ng

MLOps: DevOps for Machine Learning

MLOps Principles and Key Considerations (updated/modified by author from Summary of MLOps Principles and Best Practices)

Practical MLOps: Pachyderm for data and code+data

Pachyderm architectural and user overview (Image by Pachyderm)
  • In the case of how to handle changing data, Pachyderm treats data as git-like commits to versioned data repositories. These commits are immutable, meaning that data objects committed are not just metadata references to where the data is stored, but actual versioned objects/files so that nothing is ever lost. This means that if our training data is overwritten with a new version in a new commit, the old version is always recoverable, because it is saved in a previous commit.
  • In the case of types and formats, data is treated as opaque and general. Any type of file: audio, video, csv, etc. — anything can be placed into storage and read by pipelines, making it generalizable and uncompromising, instead of simply focused on one kind of data the way a database focuses only certain kinds of well structured data.
  • As for the size and quantity of data, Pachyderm uses a distributed object storage backend which can scale to essentially any size. In practice, it is based on object storage services, such as s3 or google cloud storage.
From blog post Pachyderm and the power of GitHub Actions: MLOps meets DevOps (Image by author)

There and Back Again

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store