One of the key challenges in developing a chatbot is correctly interpreting the intent of users' natural language messages. Models for message categorization are typically trained and optimized on heterogeneous data, without a focus on demographic equity, which can result in substantial disparity in accuracy and accessibility for underrepresented population segments. At Nivi, we have developed an Equitable Annotation Framework to explicitly center equity throughout incremental development and assessment of Natural Language Processing (NLP) models. In this post, we examine the Equitable Annotation Framework and early promising results from an application to improving gender equity.
Background: Chat intent classification at Nivi
The askNivi chatbot is primarily a multiple choice interface. However, every day, it receives hundreds of spontaneous natural language messages from users seeking agency in their own health. We have trained classifiers to recognize those intents, and have a growing set of customized reactions and protocols to react to detected user intent.
To date, these models are trained using supervised machine learning on small annotated data sets. Retraining, fueled by more annotated data, leads to improved accuracy and improved user experience, especially if done equitably.
The Equitable Annotation Framework
Nivi has prototyped a tool and workflow, the Equitable Annotation Framework (EAF), to enable incremental accuracy improvement for Natural Language Processing tasks (initially, chat intent) with continuous optimization for equity. We use the EAF to implement a human-in-the-loop evaluation and retraining process, in which the chat intent decisions of NLP classifiers and bot responses are evaluated in context, with user demographic associations preserved. Human judgements in this workflow serve purposes: equitable evaluation and equitable system training.
With each iteration of the workflow, accuracy of the NLP system (compared to human judgements) is evaluated both for the whole sample and segmented by demographic (in this case, gender). The newly annotated data becomes gold-standard data which is used to retrain the NLP classifiers. With each iteration, data is sample weighted to intentionally increase the portion of data from demographic classes previously showing lower accuracy; thus, the retraining framework counteracts existing inequity, biasing development to improve most for those who have experienced the worst outcomes.
Baseline results found significantly lower accuracy on men’s messages than women’s – likely due to men being the minority in the operating context and therefore also in the training data (training data was 79% women and 21% men). Our initial experiment protocol was to approach retraining to mitigate this disparity while improving the model with incremental batches of roughly 500 audited messages with an intentionally skewed distribution (40% women, 60% men).
Retraining in this protocol turned out to be surprisingly sensitive to gender composition -- after one increment, accuracy on men's messages outperformed women's messages on both precision and recall (two measures of accuracy). It was certainly effective at improving outcomes for the minority demographic, but impossible to observe incremental progress.
In order to better observe incremental outcomes, we restarted with batches of roughly 125 examples. Under that modified protocol, we can clearly see the gradual effect of gender-aware retraining. Both precision and recall improve immediately and gradually for the minority demographic, whereas the majority demographic takes an initial hit in accuracy but then returns to baseline by the third round of training.
Equitable Annotation is an effective approach
From these early results, we conclude that the Equitable Annotation Framework is an effective tool for mitigating biased outcomes in NLP. Nivi is committed to using this and other best practices for ensuring equity within our product now and in the future.