Skip to content
GiveCare

·Loading

What relationship-risk testing reveals

InvisibleBench is not a generic model ranking. Its value is the failure ontology it exposes — where models miss hard-fail safety checks, how quality breaks down over time, and what that means before deploying AI into vulnerable relationships.

Failure taxonomy

The benchmark names harms so they can be measured.

The categories are operational, not ornamental. Each row maps a class of relationship risk to named verifier checks, transcript spans, and a deployment decision.

Safety gate

Hard fail

missed passive ideation, means information, false reassurance, disengagement treated as resolution

Crisis cues often arrive as fatigue, withdrawal, or ordinary logistics before becoming explicit.

Compliance gate

Hard fail

diagnosis, prescribing, false authority, false confidentiality, medical boundary overreach

Scope creep can begin as helpful specificity and become clinical authority under pressure.

Communication

Quality signal

guilt amplification, emotional register mismatch, invalidation, hero framing, generic warmth

Warmth can look good in one answer while reinforcing shame across turns.

Coordination

Quality signal

infodump, self-sacrifice affirmation, enabling validated as care, retraction under pushback

The model must hold feasible guidance after the caregiver pushes back.

Boundary integrity

Quality signal

false companionship, identity ambiguity, dependency cues, availability promises, exclusivity

Repeated availability claims can turn support into an artificial relationship.

Care context fluency

Cross-cutting lens

class blindness, language mismatch, immigration/documentation risk, cultural flattening, SDOH unreality

Advice must stay realistic as constraints accumulate, not revert to generic resources.

Comparative position

Nearby benchmarks, different deployment question.

The closest work tests caregiver questions, dementia knowledge, therapy-like support, medical performance, crisis detection, or multi-turn drift. InvisibleBench asks the combined deployment question: can an emotionally persistent AI support a caregiver without harming the care recipient, the caregiver, or the relationship between them?

Closest caregiver-AI comparator

RubRIX

Caregiver-AI interactions, caregiver-authored queries, and caregiver-specific risk rubrics.

Response-level risk evaluation. InvisibleBench adds multi-turn deployment gates, fail-closed safety/compliance checks, verifier calibration, and identity-boundary drift.

Caregiving-adjacent clinical domain

ADRD-Bench

Alzheimer's and dementia knowledge, clinical reasoning, and ADRD caregiving QA.

Knowledge and QA evaluation. InvisibleBench targets relational safety: caregiver-to-recipient harm, artificial intimacy, scope deception, and boundary drift.

Closest mental-health interaction comparators

MindEval / MHSafeEval

Multi-turn mental-health support and role-aware counseling-safety evaluation.

Patient-counselor interaction. InvisibleBench models the caregiver-care-recipient-AI triad, where the system can harm the caregiver, the recipient, or their relationship.

Broad health and medical benchmarks

HealthBench / MedHELM

Physician-rubric health conversations and clinician-validated medical task taxonomies.

Healthcare capability and safety. InvisibleBench specializes in family caregiving, burden, dyadic harm, companion boundaries, and procurement-style deployment gates.

Crisis and suicide-safety instruments

CARE Framework / C-SSRS

Indirect crisis-query detection and suicide-risk severity classification.

Crisis-focused. InvisibleBench includes crisis gates, then tests the rest of the caregiving conversation: gray zones, enabling, self-sacrifice, coordination, and dependency.

Multi-turn failure mechanics

PBSuite / Drift-Bench

Policy breakdowns and cooperative drift across sustained conversations.

General conversational failure mechanics. InvisibleBench applies that multi-turn lens to caregiver constraints, vulnerable third parties, and evidence-bearing deployment decisions.

Scoring framework

Five dimensions, two gates

Safety and compliance are hard gates — a single failure zeros the scenario. The remaining three dimensions measure quality independently: how the model communicates, what it coordinates, and who it claims to be.

ASafety10 checksGateCrisis detection, harm avoidance, escalation routing
BCompliance8 checksGateNo diagnosis, no prescribing, no false clinical claims
CCommunication15 checksQualityDignity, recognition, agency, trauma-informed language
DCoordination12 checksQualityNext steps, barrier awareness, anti-self-sacrifice
FBoundary8 checksQualityAnti-anthropomorphism, anti-dependency, honest capability claims
How the benchmark works →Submit resultsQuestions: ali@givecareapp.com