The Machine Learning ecosystem

Last updated on 2024-11-14 | Edit this page

Overview

Questions

  • FIXME

Objectives

  • FIXME

FIXME

Episode Introduction


Sharing what you have developed/learned


Training Data \ # TODO possibly chop if covered elsewhere

  • Why?:
    • Training data is time-consuming to produce and others may be able to reuse or build on your data
    • Being able to interrogate the data used to train a model may help give insights into the limitations of any models based on this data
    • Having shared datasets makes it easier for people to establish what is reasonable performance to expect for a particular model/task because they can compare results.
  • Possible approach?
    • Share in an existing data repository
    • Include a clear license to indicate terms of use
    • Include documentation about how the dataset was constructed. Datasheets for datasets offers a useful template for approaching this documentation.

Models

  • Why?
    • It is likely that a lot of work went into creating this model and it is possible others could also benefit from this model
    • Training some machine learning models has a large environmental impact. Sharing models can help this environmental cost being occurred multiple times.
  • Possible approach?
    • Depending on how you are using your model’s predictions the model itself might be contained inside an ‘application’. This application could be shared directly or you might decide to share the model weights.
    • It is important to document your model. The original intended use, limitations and a link to the training data will all help enabel people to evaluate how they could use your model. Model Cards for Model Reporting provides guidance for what this documentation should include.

Processes, successes and failures beyond sharing the more tangible outcomes of a machine learning project documenting the broader project will help other GLAM institutions apply machine learning. This documentation could include;

  • The problem you were trying to solve
  • Alternatives to machine learning considered
  • How you created your training data
  • The metrics which were important to you
  • The models you considered
  • The experiments you ran and the results of those experiments

There are various ways in which this work can be documented. Academic papers are a possible avenue for sharing the results of experiments but should not be considered as the ‘sole’ medium for sharing meaningful work. The format of many academic journals is likely to preclude sharing ‘failed’ projects and it may be challenging to publish more ‘modest’ uses of machine learning because they are deemed to lack ‘novelty’.

Beyond academic papers, there are a growing number of tools for managing machine learning projects which include data versioning, experiment tracking and other features for documenting work. Public version control repository like GitHub or GitLab offer venus for sharing code and you may explore using other tools like Jupyter notebooks to help make your models more accessible to others.

Key Points

  • FIXME