Fairness and Accuracy of Apollo 4 Live Facial Recognition Algorithm

Why test facial recognition (FR) systems?
Critical to the use of the technology is ensuring it is implemented in a responsible, transparent, and ethical way: doing so requires an understanding of the accuracy and demographic equitability of the technology.
NPL was commissioned by the National Biometric Function in collaboration with the Office of the Police Chief Scientific Adviser to assess the performance, accuracy and equitability of Corsight Live Facial Recognition, version Apollo 4. The independent testing was conducted by the National Physical Laboratory (NPL), the UK’s National Metrology Institute, an independent and impartial organisation.
The test methodology and dataset used in the NPL evaluation were specifically designed to help identify any impact the facial recognition technology may have on the protected characteristics, of ethnicity, age and gender. The same methodology and dataset were used previously to evaluate the NEC NeoFace algorithm in 2023. [Operational Testing of Facial Recognition Technology].
What do the results tell us?
The NPL report gives an impartial, scientifically underpinned and evidence-based analysis of the performance of the Corsight FR algorithm for the use of live facial recognition in UK policing.
Recognition accuracy for Live Facial Recognition is assessed via two criteria:
- The True Positive Identification Rate (TPIR): Would a person in the video stream who has a reference image in the watchlist be correctly identified?
- The False Positive Identification Rate (FPIR): Would a person in the video stream who does not have a reference image on the watchlist be incorrectly identified?
NPL tested the algorithm over a range of face-match thresholds and with watchlist sizes up to 180,000. Operational equitability was considered at three face-match thresholds (63, 55 and 50) on two watchlist sizes: i) 18,000 reference images and ii) 1,800 reference images. At thresholds 55 and 63, observed demographic differences in TPIR and FPIR were not statistically significant.
This assessment will add to Law Enforcement’s understanding on how their FR systems perform and will provide information to help configure FR technology for effective and fair deployment on operational use cases.