A drawing depicting a group of people in a meeting room discussing evidence of the use of GenAI
Created with Dall-E using prompt “round table with global participants on GenAI, SBC and evidence”. An image depicting a group of people in a meeting room discussing evidence of the use of GenAI.

In March 2024, MERL Tech Initiative, supported by iMedia Associated and the Gates Foundation, set out to define a research agenda on the use of Generative AI within Social and Behaviour Change (SBC) programming in development work. Over 50 organisations were represented at our workshop during the ICT4D Conference in Accra, and it rapidly became clear that it was way too soon to develop a nuanced research agenda when many practitioners were still grappling with foundational questions like “what even is GenAI?” and “how could it be used in SBC”. At this time, there weren’t enough anecdotes, let alone case studies or data, to start drawing any meaningful conclusions as to how this new technology was influencing our field – with the exception of frontrunners such as Viamo and Data Science Nigeria.

One year on (which is a long time, in AI-years!), the SBC & AI Working Group sought to revisit this state of play and brought together 6 organisations all using GenAI in various ways to deploy and evaluate SBC interventions.

As with all things AI, it’s important to be specific about exactly which use of GenAI is under examination. In this case, our speakers used GenAI as part of a chatbot or messaging-based system, either voice or text driven, which enabled users in low and middle-income countries (LMICs) to ask questions across the fields of Sexual and Reproductive Health, Agriculture, Education, and General Knowledge. All, bar one, were live and used by thousands of users. In addition, Large Language Models (LLMs) were also used by some to make back-end decisions or analyse data to refine or evaluate the intervention, and many interventions used GenAI Q&A capabilities alongside pre-written content such as articles or quizzes. It should be noted that this is just one of many ways in which GenAI could support SBC programming, as explored in our initial piece.

Our panellists included:

Gayatri Jayal, Director of Consumer Innovations at Dimagi, sharing evidence on MathBot, a Maths coaching chatbot.
Jay Patel, Director of Technology at Jacaranda Health, discussing learnings from UlizaMama, an SMS service which uses GenAI to answer questions from pre- and post-natal Kenyan women.
Jona Repishti, Assistant Director, Global Gender at Digital Green, sharing evidence generated by Farmer.Chat, a service which answers farmer’s questions in 5 countries.
Soma Mitra-Behura, Lead Data Scientist at Girl Effect, whose longstanding chatbot Big Sis answers users’ sexual and reproductive health and mental health questions using GenAI.
Lukas Borkowski, VP of Strategic Partnerships at Viamo, sharing what they learned when they deployed a voice-based GenAI Q&A service in Zambia.
Sidd Goyal, Co-founder and CEO at Nivi, whose chatbot, Ask Nivi, allows users in Kenya, Nigeria and India to ask questions on contraception, sexual health, and vaccines.

Mindful of the great research that has already been done in the space by the teams at Stanford Centre for Digital Health and others, we wanted to move beyond ‘here’s what our tool does’ to drill into the data generated by these tools (more information on each of the partners and links to try out the services can be found here). Panellists were asked to come prepared with a 5-minute lightning talk on 1 data point generated by their work (avoiding relatively shallow metrics such as reach), and explain why the data felt meaningful. Here’s what we learned.

Surprise! We’ve already got too much data

Ask and ye shall receive…despite the ‘1 data point’ prompt, we collated 14 different metrics during the course of our discussion, all of which have been compiled here. These ranged from reach-esque indicators (total users, messages received, total questions asked…), to more juicy indicators of meaningful engagement (user demographics, questions per user, question relevance, adjacent content engagement…), behavioral impact (health service directory access, self-reported impact), and User Experience related indicators (latency, language, modality and user satisfaction…).

When it comes to GenAI-related data, our problem now is not that we don’t have enough; it’s firstly figuring out how we can make sense of the data and put it to good use. This raises fundamental questions relating to budgets, timelines, and team skills, which will need to be taken into account by funders and programme designers, but also to deeper issues relating to informed consent.

A second vital need is ensuring that organisations doing similar work reach a semblance of alignment in the nature of the data collected. In this ad-hoc exercise alone, we can see potentially exciting overlap between data points across partners in comparable fields, particularly when it comes to demographic patterns (see below); but a lot more work would need to be done to ensure that this data could be meaningfully compared and benchmarks established. This would involve, at the very least, defining a shared set of definitions for common GenAI Q&A tool metrics.

Doing so would deepen and expand our collective learning, ultimately to the benefit of the entire sector. Again, funders have a crucial role to play here in supporting consistency and transparency across their grantees doing similar work, and ideally encouraging grantees to share data widely. Similarly, renewed efforts into creating some sort of centralised evidence base feel crucial at this early stage. This is not unique to GenAI – we never solved this problem with data generated by SMS, mobile web or apps, either, with data generated by digital interventions either hidden away in internal databases, scattered across many different public resources, or simply too complex to make sense of. Maybe GenAI itself can be put to good use here, as highlighted by Sidd, founder and CEO of Nivi, who uses GenAI to support rapid insight generation, shaving weeks and months off the speed at which they were previously able to iterate.

Some of the indicators which would be most useful to understanding overall impact of GenAI in SBC solutions and cited as part of this event should probably include:

Gender demographics of question-askers and gendered qualitative analysis of questions.
Questions per user and questions per day/month
Qualitative analysis of questions asked
Proportion of questions successfully answered by model
Language choices (if relevant)
Impact (self-reported behaviour change, proxy or referral indicators)
User satisfaction ratings

Ultimately, this effort towards consolidation and transparency would help us establish benchmarks and avoid pitfalls, which would accelerate both product development and eventual impact.

4 Insights from using GenAI in SBC chatbots

The insights gathered were wide-ranging, but we did see some parallels emerge – some of which have been unpacked in more detail below.

Adaptive, or personalised content experiences powered by GenAI can increase content consumption and completion.

In Dimagi’s MathBot experiment, users taking maths quizzes where questions were dynamically adapted based on users’ responses to the previous question are 16% more likely to complete subsequent questions than users in non-adaptive quizzes. This impact is more pronounced the earlier in the experience the dynamic adaptation happens. This has interesting implications for User Experience design of such tools – for example, suggesting that short, sharp use of GenAI early in the experience could be just as valuable as wholesale GenAI integration. Similarly, Nivi reported that using GenAI to analyse user inputs and questions early on in the backend helped them route users faster and more accurately towards content better suited to their stage on a behavioural journey.

Girl Effect also shared how users receiving direct, personalised answers to their questions asked 200% more questions than their counterparts only signposted towards pre-written, static content: a timely and personalised answer is a powerful repeat engagement driver.

Improving your model’s performance in one area can have serious implications on quality for other metrics.

Jacaranda Health’s UlizaMama service answers questions from pre- and post-partum women in Kenya. When attempting to refine the tone of their first iteration,“improving the human aspects of [the model unexpectedly] came at the expense of safety and reliability”. Indeed, in version 2 of their model, the answers overall were more likely to be medically accurate than in the previous iteration, but when they weren’t, the error was more serious. Jay Patel, Jacaranda’s Director of Technology, emphasised that this was only spotted thanks to the involvement of human agents in both their user-facing product experience, and their evaluation process.

He also stressed: “It was critical to get [our model] out in the real world, learn from the help desk agents, learn from mums themselves and those lessons helped us learn much faster than if we tried to iterate for ourselves.” By exposing their model to real user data, early (having built in rigorous technical and human guardrails), they were able to spot issues, fix them, and learn from them faster – as opposed to spending 6 months designing in a vacuum.

GenAI question-answering (often combined with static content) can increase likelihood of taking relevant action, and raises revolutionary questions about the nature of impact.

Both Girl Effect and Digital Green reported that their GenAI Q&A capabilities led to increased self-reported behaviour changes or evidence of intention to act: 40% of Digital Green’s Farmer.Chat users reported taking an action or changing their previous behaviour as a result of their interaction with the bot, and Big Sis GenAI users were 13% more likely to access service uptake content, such as family planning service directories.

Jona Repishti, Associate Director of Global Strategy at Digital Green, noted that by getting users precise, timely answers to their questions, they were forced to realise that they needed to change how they define ‘regular engagement’ and ‘impact’. In a top-down world, where users are drip-fed lots of more or less relevant content that they need to parse in order to find something relevant, expecting a user to engage weekly or even daily makes sense. But in a user-driven, bottom-up world where the user knows they can get reliable, relevant and precise answers anytime they like, platform usage patterns will inevitably change to being more cyclical and ad-hoc. The integration of instantaneous GenAI question answering capabilities also deepens the existing mismatch between broad-brush desired impact and the actual impact a farmer may experience simply based on having their questions answered correctly and usefully.

Female users may drive the most impactful GenAI uptake.

Viamo’s Ask Viamo Anything (AVA) service allows users in Zambia, Nigeria and other countries to ask questions on any topic, via voice in a toll-free phone call, and receive a GenAI answer via voice without needing the internet or a smartphone. When exploring the data generated by this service, the team noticed that, contrary to assumptions (and usual digital usage patterns), uptake was driven by female users who represented 59% of the total successful users. Women were also more likely to ask questions related to the SDGs, including on family planning and gender based violence. Nivi noticed similar trends, Sidd remarking (to much hilarity) that “men are more likely to submit random utterances, where women are looking for specific guidance on specific methods.”

During our post-panel discussion, attendees tried to unpack why this may be, with Lukas, Viamo’s VP of Strategic Partnerships sharing that early research suggested this could be as a result of ‘freedom fever’ (my expression). Put simply, because of the historical constraints to knowledge-seeking that women experience in Zambia and elsewhere (lower education, lower digital literacy, lower agency…), this voice-based portal which leaves no trace (unlike a written chatbot, for example) must feel truly magical.

Jona from Digital Green shared a similar pattern amongst female Farmer.Chat users who were more likely to ask questions about nutrition (likely because of traditional gender roles) but also stretching the boundaries of traditional roles, using the service to explore themes which might traditionally have been outside of their purview. Sidd stressed the importance of gendered content to meet this need: if the content or answers have not been adapted for female audience members, any impact of this maybe unexpected demographic split can rapidly be blunted.

These were just 4 highlights from an hour’s discussion, demonstrating just how much is now waiting to be unpacked in the field of SBC chatbots alone (only 1 of the possible uses of genAI throughout the SBC programme lifecycle). We are certainly a lot closer to our intended aim of developing valuable shared research agendas around AI usage – with the question of gendered engagement surely at the forefront.

We invite you to explore the additional data shared, and get in touch if you are interested in participating in similar efforts to increase transparency and comparability across evidence.

by Isabelle Amazon-Brown