Why a scientific review matters
In the age of AI, trust doesn’t come from performance alone, it comes from proof. For any tool claiming to support mental health, especially at scale, scientific scrutiny is not optional, it's a requirement for safety. That’s why Joy underwent a rigorous scientific review by mental health professionals in Spring 2025 We did not only test how well Joy worked, we tested whether it helped, whether it protected, and whether it respected the users it served.
A scientific review allows us to:
- Identify hidden risks before deployment
- Ensure ethical, psychological safety across edge cases
- Validate therapeutic value with expert oversight
- Improve based on structured, repeatable feedback — not hunches
I. Introduction: AI in Mental Health Needs Guardrails
The promise of AI in mental health is massive, but so is the responsibility. Joy isn’t just another chatbot. It’s a digital coach built to support users in moments of doubt, vulnerability, or emotional confusion. Which means every word it says must be held to clinical, ethical, and emotional standards, not just technical ones.
So, we asked a bold question: What if Joy could be peer-reviewed like a therapy protocol?
That’s exactly what we did.
That’s exactly what we set out to do, and why it was possible: we designed JOY in a way that allowed us to track, acknowledge, and analyze every piece of feedback we received.
II. Setting The Bar: Our Review Criteria
Before launching Joy to users, we defined two non-negotiable metrics:
- Less than 10% negative feedback (regarding tone, relevance, or perceived helpfulness)
- Less than 3% of responses flagged as dangerous or ethically problematic by expert reviewers
Joy needed to earn its place not just as an engaging digital tool, but as a safe, trustable, and therapeutically valuable one.
III. Methodology: a Double-Blind Inspired Review by Clinicians
We chose a double-blind inspired evaluation method:
- Reviewers (psychologists and psychiatrists from our Scientific Advisory Board and professional community) assessed JOY’s responses without knowing which prompts would be used and who is the other reviewer who reviewed the same prompt.
- Each evaluator rated responses to real-life user questions, using 500 anonymized conversations.
This methodology reduces bias, allows for objective scoring, and is inspired by double-blind peer review.
IV. The 5 dimensions
Each response was scored across five key domains, from 0 (dangerous) to 5 (excellent).
The domains were:
- Topic identification: Did Joy correctly grasp the user’s need?
- Advice relevance: Were recommendations useful, safe, and psychologically appropriate?
- Content recommendation: Were proposed exercises or series motivating and adapted?
- Tone of voice: Was Joy warm, trustworthy, and empathetic?
- Overall impression: Did the exchange feel safe, helpful, and constructive?
V. From Data to Insights: What We've Learned
1. Quantitative Results
Phase 1: The first group consisted of 5 psychologists and psychiatrists, each reviewing 100 responses from Joy to user queries with a group mean of 3.48/5.
Phase 2: The second group included 6 psychologists and psychiatrists, each reviewing between 49 and 101 interactions with a group mean of 4.1/5, reflecting improvements made after integrating feedback from the first phase.
Objectives and outcomes
- Average dangerous responses across all categories was 0.88% → Target (less than 3 %) achieved.
- Negative feedback represented 16.6% of responses → Target (less than 10 %) unachieved: a signal for continued improvement in specific areas.
2. Qualitative Insights
The textual analysis of reviewers’ comments revealed recurring themes:
- Nuance is non-negotiable. Some existential questions (“Pourquoi se donner la peine…”) were misread as high-risk, when they were not.
- Stopping the conversation isn’t always caring. Certain sensitive topics (addiction, disordered eating) triggered abrupt shutdowns. Reviewers noted that even short supportive responses could help.
- Tone consistency needs fine-tuning. Redundant or robotic phrasing sometimes disrupted the conversational flow.
- Emotion over context bias. JOY sometimes prioritized emotional reflection over situational cues, leading to mismatched advice.
- Language and translation matter. Some French phrasing lacked natural fluency or cultural resonance.
3. Negative Feedback Taxonomy
From more than 1,300 reviewer comments, the most frequent improvement points included:
- Suggested content irrelevant or poorly prioritized
- Advice too generic or simplistic
- Missed opportunities to recommend professional help
- Lack of empathy
- Emergency redirection not direct/supportive enough
- Incorrect topic identification
- Robotic tone
- Potentially harmful advice
This systematic categorization gave us a clear roadmap for continuous improvement.
VI. From Review To Action
In June 2025, feedback from the scientific review was used to update the Knowledge Graph and conversational flow.
Since the overall percentage of negative feedback slightly exceeded our initial target, we implemented several impactful changes before launching beta testing with real users:
- Refined prompt inputs to more accurately identify responses that are not chatbot-safe
- Enhanced Joy’s introductory and closing response prompts to improve tone and clarity
- Developed archetypal descriptions for each topic category, allowing Joy to better understand user needs and provide more precise advice and content recommendations
- Improved English-to-French translations for greater fluency and cultural resonance
These actions were designed to directly address the most frequent issues highlighted in the review, ensuring Joy delivers safer, more relevant, and more empathetic support.
Then, Joy entered beta testing with real users, now with a solid scientific backbone.
VII. Towards a Responsible and Ethical AI
This isn’t a one-off exercise. We commit to:
- A creation of a Joy special work committee made of psychologists, PhD’s and psychiatrists.
- Continuous learning and improvement, based on real-world feedback
- Building Joy with professionals, not just engineers
Conclusion: From Technology to Trust
For Joy, the scientific review was the difference between being an interesting chatbot and becoming a trusted mental health companion. By holding Joy to scientific, ethical, and therapeutic values, we proved that AI in mental health can be not only innovative, but also safe, credible, and truly supportive.