Shell: Evaluating the performance of machine learning models used in the energy sector
Contents Background & Description How this technique applies to
the AI White Paper Regulatory Principles Safety, Security &
Robustness Appropriate Transparency and Explainability
Accountability and Governance Why we took this approach Benefits to
the organisation Limitations of the approach...Request free trial
Contents
Background & Description This project leverages deep-learning to perform computer vision tasks – semantic segmentation on specialised application domain. The project had about 15 deep-learning (DL) models in active deployment. The DL models are applied in a cascaded fashion to the generated predictions, which then feed into a series of downstream tasks to generate the final output which would be input to manual interpretation task. Hence, AI assurance through model performance evaluation is critical to ensure robust and explainable AI outcomes. Three types of model evaluation tests were designed and implemented into the DL inference pipeline:
How this technique applies to the AI White Paper Regulatory Principles More information on the AI White Paper Regulatory Principles. Safety, Security & Robustness The regression and integration tests form backbone provide model interpretability against a set of test data. During model development they provide a baseline to interpret whether model performance is improving or degrading conditional on the model training data and parameters. During the model deployment phase these tests also provide early indication of concept drift. Statistical tests are more designed to predict model performance given the statistics of test data, hence providing a mechanism to detect data drift as models are deployed. Additionally they also give an indication of how robust the DL model performance is to statistical variations in test data. Appropriate Transparency and Explainability The output of this AI assurance technique is communicated to AI developers and product owners to monitor potential deviation from expected DL model performance. Furthermore, if performance deviates these teams can operationalize appropriate mitigation measures. Also, for frontline users and business stakeholders to maintain a high degree of trust in the outcomes of the DL models. Accountability and Governance AI developers are responsible for designing and running the model evaluation tests to strengthen the performance testing. Product owners are responsible for leveraging these tests as a first line of defence before new model deployments. The project team works together to adapt the tests to tackle data and concept drift during deployment. Why we took this approach In this project, the predictions of the DL models are ultimately generating inputs for a manual interpretation task. This task is complicated, time consuming and effort intensive, hence it is crucial that the starting point (in this case DL model predictions) be of high-quality in terms of accuracy, detection coverage and very low noise. Furthermore, the outcome of the manual interpretation feeds into a high-impact decision making process. The quality and robustness of the DL model’s prediction is thus of paramount importance. The most important metric to judge the ML model’s prediction performance is human-in-the-loop quality control. However, to automate the performance testing into a first line of defence, the model evaluation test suite technique was adopted. Data version control and creating implicit ML experiment pipelines was mainly to ensure that the models could be re-produced end to end (data, code and model performance) within an acceptable margin of error. Benefits to the organisation
Limitations of the approach
Further AI Assurance Information
|