CulturalBERT-VLAP · Vasl Language Analysis Platform

The clinical AI
the field was
missing.

Standard NLP models are trained on majority-White internet text. They were never designed to read how BIPOC, LGBTQ+, and first-generation youth communicate distress — and they consistently fail to. CulturalBERT-VLAP was built specifically for those communities. Not adapted. Built.

High-Distress Signal Sensitivity
94%

On high-distress signal detection in active IRB validation with the University of Maryland. Results to be published upon study completion.

Active validation · Results pending · Non-diagnostic output
198k+
Culturally specific mental health language samples in the training corpus
2,400+
AAVE and youth vernacular tokens added beyond standard BERT vocabulary
42
Distress signals across 5 behavioral dimensions
5
Cultural communities represented in the training annotation cohort
Chapter 01 — The Problem Standard NLP Cannot Solve

The model
never learned
their language.

General-purpose NLP models — including large language models — are trained on internet text that skews heavily toward majority-White, educated, English-speaking populations. The result is a systematic blind spot: the specific ways BIPOC, LGBTQ+, and first-generation youth signal emotional distress are consistently misread, deprioritized, or missed entirely.

This isn't a model failure in the traditional sense. These models perform well on the populations they were trained for. The failure is in deploying them, uncorrected, for populations they were never trained on — and expecting accurate clinical signal detection to follow.

The gap isn't theoretical. In clinical practice, it means a youth saying "lowkey been struggling fr" is read as casual. A pre-disclosure minimization pattern ("it's not that deep but...") reduces the model's confidence instead of increasing it. Community-developed coded language — terms built specifically to avoid content filters — registers as noise. Standard models consistently miss these signals in the communities that most need them caught.

"lowkey been struggling fr, ain't nobody understand what I'm going through"
Standard NLP reads
Low confidence signal. "lowkey" reduces severity weight. Grammatical irregularity reduces confidence further. Output: insufficient signal for flagging.
CulturalBERT-VLAP reads
CCM-09 pre-disclosure minimization. "fr" authenticity escalator contradicting the minimization hedge. HOP-03 isolation framing in AAVE. Elevated signal — surfaces to care team.
"it's not that deep but lowkey been struggling since school started."
Standard NLP reads
Negation ("not that deep") reduces signal weight. Mild concern flagged, low priority.
CulturalBERT-VLAP reads
CCM-04 classic pre-disclosure frame. Negation before authentic disclosure is the signal, not evidence against it. ISO-04 temporal onset marker — distress tied to a specific starting point.
"been thinking about unaliving lately ngl."
Standard NLP reads
"unaliving" not in vocabulary. Processed as unknown token or misclassified. Crisis signal missed entirely.
CulturalBERT-VLAP reads
SHA-03 coded suicidal ideation — "unaliving" is in the extended vocabulary. "ngl" sincerity marker confirms non-performative. CRS-02 elevated priority. Immediate clinical supervisor review triggered.
Chapter 02 — CulturalBERT-VLAP Architecture

Built on BERT.
Fine-tuned for
cultural fluency.

CulturalBERT-VLAP is a BERT-architecture language model fine-tuned on a purpose-built corpus of culturally specific mental health language. The base BERT architecture was selected for its bidirectional context processing — essential for reading the layered meaning in code-switching, minimization patterns, and culturally framed expressions where individual words carry different weight depending on surrounding context. The model was then extended with a culturally specific vocabulary and fine-tuned against a clinically annotated training corpus.

Foundation
BERT Base Architecture — Bidirectional Transformer

The bidirectional encoding layer processes the full context window simultaneously in both directions — unlike unidirectional models that read left-to-right. This is architecturally necessary for cultural signal detection: the meaning of "lowkey" depends entirely on what follows it, and the clinical significance of a minimization hedge ("it's not that deep but") only becomes clear in context of what comes after the conjunction.

BERT-base · 12 layers · 768 hidden dimensions · 12 attention heads
Extension
Vocabulary Extension — 2,400+ Cultural Tokens

Standard BERT vocabulary was extended with 2,400+ AAVE terms, youth vernacular expressions, code-switching patterns, and community-developed coded language — including terms built specifically to circumvent content filters (e.g., "unaliving"). This extension was compiled through community engagement with BIPOC and LGBTQ+ youth populations and validated by licensed clinicians with relevant community competency. Without this extension, approximately 23% of the training corpus would be processed as unknown tokens by the base model.

Community-sourced · Clinically validated · Continuously updated
Training
Fine-Tuning on Culturally Specific Mental Health Corpus

The extended BERT model was fine-tuned on 198,000+ annotated language samples drawn from the communities VLAP serves — not web-scraped text, but specifically collected and clinically annotated mental health communication. Each sample was labeled by licensed clinicians with community competency training against the 42-signal taxonomy across five behavioral dimensions. The fine-tuning process was conducted in multiple phases with inter-annotator agreement validation.

198,000+ samples · Multi-phase fine-tuning · IAA validation across annotation cohort
Output
Dimensional Signal Profiling — Non-Diagnostic Output Layer

The model output layer generates dimensional signal profiles across the five-dimension taxonomy — not diagnostic outputs, not severity scores, and not clinical recommendations. The output is an interpretive context package: which signal patterns are present, their dimensional classification, and cultural interpretation notes that help clinicians understand what the patterns typically indicate in the communities that produce them. Every output includes a non-diagnostic disclaimer and is designed to support clinical judgment, not substitute for it.

Non-diagnostic · Interpretive context · 42-signal taxonomy · Clinician-surfaced only
Processing
In-Memory Processing — No Verbatim Storage

VLAP processes member language in-memory. Verbatim input text is not stored after signal profile generation. The retained output is a dimensional signal profile — not a transcript, not a quote, not a record of the member's exact language. This architectural constraint is a HIPAA technical safeguard, not a configurable setting. It cannot be disabled by organizational administrators or overridden by clinical staff.

In-memory inference · No PHI retention post-processing · HIPAA technical safeguard
Chapter 03 — Training Methodology

A corpus built
from inside the
communities.

The training data for CulturalBERT-VLAP was not web-scraped, crowd-sourced generically, or assembled from existing open datasets. It was specifically collected from and with the communities the model serves — and annotated by licensed clinicians who have worked within those communities.

Corpus Scale
198k+
Total annotated language samples
2,400+
Extended vocabulary tokens beyond BERT base
5
Cultural community cohorts represented in annotation
42
Distress signal patterns in the taxonomy
Methodology
Community-partnered data collection

Language samples were collected in partnership with community organizations serving BIPOC, LGBTQ+, and first-generation youth populations — not scraped from public platforms. Consent protocols, anonymization procedures, and community review were built into the collection process.

Clinical annotation with community competency requirement

All training samples were annotated by licensed clinicians with documented community competency training in the relevant population cohorts. Inter-annotator agreement was validated across the annotation cohort before samples were included in training data.

Vocabulary development through community engagement

The 2,400+ vocabulary extension was compiled through structured engagement with youth from the target communities — not through generic web scraping. Vocabulary candidates were reviewed for clinical accuracy by the annotation cohort before inclusion.

Bias monitoring integrated into training

False positive and false negative rates were disaggregated by demographic subgroup throughout training — not as a post-hoc audit but as an operational gate. Signal detection performance must meet minimum parity thresholds across BIPOC, LGBTQ+, and first-generation subgroups for a model version to be deployed.

Chapter 04 — The 5-Dimensional Signal Taxonomy

42 signals.
Five dimensions.

The VLAP signal taxonomy was developed through clinical annotation of the training corpus — identifying recurring patterns of culturally framed distress expression that standard clinical instruments and general NLP models systematically miss. The taxonomy is not static. It is updated as language evolves, with community input informing additions and revisions through a structured clinical review process.

HOP
8 signals
Hopelessness & Pessimism

Indirect hopelessness, temporal compression, and "no point" framing expressed through AAVE, coded youth language, and minimization patterns. Standard models read these as low-confidence signals. CulturalBERT-VLAP reads them as the primary signal. The cultural norm of underreporting — particularly prevalent in BIPOC male youth populations — means direct expressions of hopelessness are less common than coded expressions. HOP signals are most clinically significant when combined with CCM modifiers.

Signal Examples
"Just gotta make it through this week"
HOP-05 · Acutely compressed future horizon
"ain't nobody care anyway so"
HOP-03 · AAVE relational hopelessness + ISO co-occurrence
"what's even the point fr"
HOP-07 · Direct futility + authenticity marker
ISO
9 signals
Social Isolation

Withdrawal signals, relational distancing, and absence of connection framed as normal or preferred. ISO signals are particularly significant for BIPOC and LGBTQ+ youth whose support networks are often culturally specific and difficult to replace — the stakes of isolation are higher. Critically, ISO signals frequently appear as positively framed statements ("I'm good, just been staying to myself") that read as fine to general models but register as relational withdrawal in cultural context.

Signal Examples
"I don't really fw people like that no more"
ISO-01 · Broad relational withdrawal · temporal marker
"been keeping to myself, it's fine tho"
ISO-04 · Withdrawal with minimization · CCM co-occurrence
"my people don't really get it"
ISO-07 · Perceived community disconnect
SHA
7 signals
Self-Harm Adjacent

Coded self-harm and suicidal ideation language — including community-developed terms built specifically to circumvent content filters. The SHA dimension required the most extensive vocabulary extension: terms like "unaliving," "sewerslide," and others circulating within youth online communities to communicate suicidal ideation while avoiding automated detection. Standard models with standard vocabularies have zero coverage of these terms. SHA-category signals are the highest-stakes VLAP detections and receive the most conservative threshold settings — lower false negative rates, higher false positive tolerance.

Signal Examples
"been thinking about unaliving lately ngl"
SHA-03 · Coded SI + CRS-02 sincerity marker
"tired of being here honestly"
SHA-01 · Indirect ideation · ambiguous but elevated
"what if I just didn't show up anymore"
SHA-05 · Disappearance ideation framing
CRS
6 signals
Crisis Risk Signals

Acute distress patterns including perceived burdensomeness, escalating hopelessness with sincerity markers, and suicidal ideation confirmed by authenticity escalators. CRS signals operate as a multiplier on co-occurring signals — a SHA signal becomes a clinical priority when paired with a CRS sincerity marker. CRS-category detections trigger immediate surfacing to Vasl's licensed clinical supervisor team with a 90-minute human response SLA. No automated action is taken; human clinical judgment determines every response.

Signal Examples
"everyone would be better off without me fr"
CRS-01 · Perceived burdensomeness + authenticity marker
"ngl I'm not doing good at all"
CRS-02 · Sincerity marker · escalates co-occurring signals
CCM
12 signals
Cultural Context Modifiers

The CCM dimension is the most architecturally significant of the five — and the most clinically important. CCM signals don't indicate distress directly; they modify how other signals should be interpreted. Pre-disclosure minimization (CCM-04) paired with an ISO signal means the withdrawal is more significant than it appears. Spiritual deflection (CCM-12) may indicate either genuine resilience or disclosure avoidance. Code-switching (CCM-08) indicates the member is speaking to a perceived authority audience — context that changes what they're willing to disclose. Standard models have no equivalent to the CCM dimension; they read utterances at face value without cultural framing.

Signal Examples
"it's not that deep but..."
CCM-04 · Pre-disclosure minimization · increases confidence
"God got me. That's all I'm saying."
CCM-12 · Spiritual closure · may signal avoidance
"lowkey" · "kinda" · "I guess"
CCM-09 · Minimization hedges · signal: underreporting of severity
Chapter 05 — Validation & Accuracy

Validated against
real clinical
populations.

VLAP's accuracy claims are grounded in active IRB-approved clinical research with the University of Maryland — not internal testing, not synthetic benchmarks, and not general NLP performance metrics that don't account for cultural signal specificity.

Accuracy Metrics
~94%
High-distress signal sensitivity

VLAP's ability to detect high-distress signals when they are present in member language — measured against clinician-adjudicated ground truth in the IRB study cohort. Sensitivity is optimized conservatively for SHA and CRS dimensions: we accept more false positives to minimize missed crisis signals.

23%
Standard NLP vocabulary gap

Percentage of the VLAP training corpus that would be processed as unknown tokens by a standard BERT model without the extended vocabulary — representing the portion of culturally specific language that standard models are structurally unable to read.

Validation Approach
Active IRB Study — University of Maryland

VLAP signal accuracy is being validated through an IRB-approved clinical study with the University of Maryland, using production deployment data from live Vasl cohorts. The study compares VLAP signal output against clinician-adjudicated gold-standard assessments of the same member language. Results will be published in peer-reviewed literature upon study completion.

Demographic Bias Monitoring

False positive and false negative rates are disaggregated across BIPOC, LGBTQ+, and first-generation subgroups in both the training validation and the IRB study. Parity thresholds are enforced operationally — a model version that meets aggregate accuracy targets but fails subgroup parity is not deployed. Bias monitoring is an ongoing production gate, not a one-time evaluation.

What Sensitivity Means — and Doesn't

Sensitivity measures how consistently VLAP detects signals when they are present — not the rate at which all surfaced signals are clinically significant in a given instance. A high-sensitivity threshold means more signals are surfaced, which is appropriate for a clinical support tool. The clinical significance of any specific signal is always determined through human clinical review, not by the model.

IRB Study — University of Maryland

The active IRB study is currently in the data collection and preliminary analysis phase. Results will be published in a peer-reviewed journal upon completion. The study protocol and preliminary design documentation are available to institutional evaluators under NDA. Contact clinical@vaslhealth.com to request access.

Medical Advisory — Senior Clinical Oversight

Vasl Health's clinical validation approach is overseen by its Senior Medical Advisor, Panagis Galiatsatos, MD, MHS — Assistant Professor of Medicine at Johns Hopkins University School of Medicine. Dr. Galiatsatos provides clinical oversight on VLAP's signal detection methodology, accuracy validation approach, and non-diagnostic output framing.

Chapter 06 — Clinical Integration Model

Surfaces to
clinicians.
Never to members.

VLAP is a clinical decision-support tool, not a member-facing AI. It operates entirely behind the clinical layer — invisible to the people whose language it processes. Every signal it surfaces is directed to a licensed clinician or certified coach, reviewed by a human, and responded to through human clinical judgment. The platform is built so that automated action in response to a clinical signal is architecturally impossible.

Step 01
Member check-in or coach message

VLAP processes only language shared through Vasl's care channels — daily check-ins and coach messaging threads. Peer group posts, external social media, school email, and any other channel are not processed.

Step 02
In-memory VLAP inference

CulturalBERT-VLAP processes the language against the 42-signal taxonomy. In-memory only — verbatim text is not retained after processing. Output: a dimensional signal profile.

Step 03
AI Client Insights surface to coach dashboard

Coaches see a simplified surface of VLAP output in the AI Client Insights panel: plain-language pattern alerts and mood trajectory summaries for their active members. No dimensional codes, no clinical jargon.

Step 04
Full dimensional panel surfaces to licensed clinician

When a member is connected to a licensed clinician, the pre-session view includes the full VLAP dimensional signal profile — dimensional codes, pattern descriptions, cultural interpretation notes, and coaching context. Accessible only to licensed clinicians.

Step 05
CRS signals surface to clinical supervisor — 90-minute SLA

Crisis Risk Signal detections are surfaced immediately to Vasl's licensed clinical supervisor team. A licensed clinician reviews the signal and determines the appropriate response. No automated action. Human judgment initiates every response.

VLAP Does Not
Respond to members or generate therapeutic messages
Diagnose or suggest diagnoses to any party
Make clinical decisions or recommendations
Initiate contact with members automatically
Take automated action in response to crisis signals
Scan peer group posts or external channels
Store verbatim member language after processing
Surface individual signal data to school or org administrators
VLAP Does
Detect culturally specific distress signals in care-channel language
Surface dimensional signal profiles to coaches and clinicians
Flag crisis signals for immediate human clinical supervisor review
Provide pre-session cultural context to licensed clinicians
Support coaches with AI Client Insights summaries
Enable aggregate, de-identified population trend data for org dashboards
Process in-memory without verbatim PHI retention
Operate entirely behind the clinical layer — invisible to members
Chapter 07 — Security Architecture

Built for
clinical data.

VLAP processes the most sensitive category of user data — mental health language from youth in underserved communities. The security architecture was designed specifically for HIPAA-regulated, school-based, and community health deployment contexts. Every constraint below is architectural, not configurable.

01 · Data Handling
In-Memory Processing

VLAP processes member language in-memory. Verbatim input text is not stored after signal profile generation. The retained output is a dimensional signal profile — not a transcript, not a quote. This is a HIPAA technical safeguard, not a configurable setting.

02 · HIPAA
Full Technical Safeguard Implementation

HIPAA Security Rule technical safeguards implemented across all platform components — encryption in transit and at rest, access controls, audit logging, and automatic logoff. Business Associate Agreement required for all organizational deployments. Annual third-party security audit.

03 · Certification
SOC 2 Type II

Annual SOC 2 Type II audit covering security, availability, and confidentiality trust service criteria. Full audit report available to institutional evaluators under NDA. Audit conducted by an independent third-party auditor.

04 · Access Control
Role-Based Signal Access

Individual VLAP signal context is accessible only to the assigned coach (AI Client Insights summary) and the assigned licensed clinician (full dimensional profile). School staff, org administrators, and Vasl team members outside clinical supervisory functions have zero access to individual signal data — architecturally enforced.

05 · FERPA
School-Based Deployment

For school district deployments, Vasl operates as a direct service provider to students. Student health data generated in Vasl is classified as health information under HIPAA — not as an education record under FERPA — and is structurally inaccessible to school administrators under any circumstances.

06 · Aggregate Data
De-Identification by Architecture

Population-level aggregate signal trends surfaced to org administrators use minimum cohort size enforcement to prevent de-identification by inference. Individual member contributions to aggregate data are never discernible. This constraint applies to all org-level reporting, without exception.

Chapter 08 — Institutional Partnerships

Validated by
the institutions
that matter.

VLAP's clinical credibility is grounded in active institutional partnerships — not aspirational affiliations or advisory relationships that don't involve actual work. The partnerships listed below involve ongoing operational collaboration, active research, or formal clinical oversight.

Mayo Clinic Platform
Health System — Research Accelerator
Mayo Clinic Platform_Accelerate — Active Participant

Vasl Health is an active participant in the Mayo Clinic Platform_Accelerate program, which provides access to de-identified patient data infrastructure and clinical research partnership. Vasl is engaged in a two-year data collaboration — Year 1 focused on behavioral health signal model development using longitudinal HL7 FHIR-formatted data (BH encounters, ICD-10 codes for mood/anxiety/trauma/SUD, standardized assessments, demographics/SDoH), with Year 2 targeting prospective validation of CulturalBERT-VLAP signal accuracy against clinical ground truth. Primary contact: Asia Smith, MPH, Program Success Manager.

Active
University of Maryland
Research University — IRB Study
Active IRB Study — VLAP Clinical Signal Validation

An IRB-approved clinical study is in progress with the University of Maryland validating CulturalBERT-VLAP's signal detection accuracy against clinician-adjudicated ground truth assessments. The study uses production deployment data from live Vasl cohorts. Results will be published in a peer-reviewed journal upon completion. The study represents the first formal independent validation of VLAP's culturally specific signal detection capabilities.

IRB Active
Johns Hopkins University
Medical School — Senior Medical Advisor
Senior Medical Advisory — Clinical Validation Oversight

Panagis Galiatsatos, MD, MHS — Assistant Professor of Medicine at Johns Hopkins University School of Medicine — serves as Vasl Health's Senior Medical Advisor. Dr. Galiatsatos provides clinical oversight on VLAP's signal detection methodology, accuracy validation approach, non-diagnostic output framing, and the clinical governance of the platform's care coordination model. His advisory role involves active participation in clinical review, not nominal affiliation.

Advisory
Chapter 09 — Documentation Access

Evaluate
the model
directly.

Vasl Health provides full technical documentation to qualified institutional evaluators — health systems, research institutions, school district technology teams, and health plan medical directors. All documentation is available under NDA. Contact clinical@vaslhealth.com or use the form below to initiate an evaluation request.

VLAP Technical Specification
37-page model specification — architecture, training methodology, signal taxonomy with full annotation guidelines, inference pipeline, bias monitoring protocol, and performance benchmarks.
NDA Required
IRB Study Protocol
University of Maryland IRB study design, methodology, data collection protocol, and preliminary validation framework. Results pending publication.
NDA Required
SOC 2 Type II Report
Full annual third-party security audit report covering security, availability, and confidentiality trust service criteria.
NDA Required
HIPAA Technical Safeguards Documentation
Complete HIPAA Security Rule implementation documentation — technical safeguards, PHI handling protocols, BAA template, and breach notification procedures.
Available on Request
Mayo Clinic Platform_Accelerate Collaboration Brief
Overview of the two-year data collaboration scope, data domains, formats (HL7 FHIR / CSV / JSON), HIPAA Safe Harbor de-identification specifications, and Year 1/Year 2 objectives.
NDA Required
Pilot Outcome Data
Aggregate outcomes from deployed pilot cohorts — PHQ-8 improvement, 30-day retention, session engagement, and clinical escalation rates. De-identified, aggregate only.
Available on Request