CulturalBERT-VLAP — The Model

Chapter 02 — The Model

Not fine-tuned
on culture.
Trained from
inside it.

CulturalBERT-VLAP begins with a BERT-architecture base and extends it with 198,000+ training samples drawn from culturally-specific mental health language — AAVE, queer vernacular, code-switching, trauma-adjacent idiom, and youth-specific coded language. This is not a demographic filter applied to a standard model. The cultural specificity is in the weights.

"A model trained on the wrong language will always be measuring the wrong thing. No fine-tuning fixes a training set."

Training Corpus — CulturalBERT-VLAP V1

Base architecture

BERT (Bidirectional Encoder Representations from Transformers) — fine-tuned on domain-specific corpora with culturally extended vocabulary layer

Training samples

198,000+ utterances drawn from BIPOC, LGBTQ+, and underserved youth mental health language corpora — curated, annotated, and reviewed by community-embedded clinical staff

Vocabulary extension

2,400+ tokens added to base vocabulary: AAVE constructions, queer vernacular, youth coded language (including euphemisms for self-harm ideation), minimization patterns, and cultural idiom

Annotation method

Signal labels assigned by licensed clinicians with cultural competency training — not crowdsourced, not automated. Inter-annotator agreement validated above clinical threshold before inclusion

Bias monitoring

False positive and false negative rates tracked by demographic subgroup in every inference batch. Systematic disparities trigger mandatory model review — not a reporting metric, an operational gate

Validation status

Active IRB study in partnership with University of Maryland validating clinical signal accuracy in production deployment. Study ongoing — results to be published upon completion

What Standard NLP Misses

Pattern 01 — AAVE Hopelessness Constructions

"idk why i even try anymore tbh, can't keep doing this no more fr."

Classified as informal or low-confidence text. No distress signal fired. Member not flagged.

VLAP reads: indirect hopelessness (HOP-03) + AAVE register modifier (CCM-09). Signal confidence high. Surfaced to clinician with interpretive context.

Pattern 02 — Coded Self-Harm Language

"been thinking about unaliving lately ngl."

Token "unaliving" absent from vocabulary. Statement classified as neutral or ambiguous. No signal generated.

VLAP reads: coded self-harm ideation (SHA-03) + time-context marker (CRS-02). Surfaced immediately to clinical supervisor for human review.

Pattern 03 — Pre-Disclosure Minimization

"it's not that deep but lowkey been struggling since school started."

Minimization phrase ("not that deep") reduces overall signal weight. Statement classified as moderate-concern, often deprioritized.

VLAP reads: pre-disclosure minimization (CCM-04) increases contextual weight on what follows. The minimization is the signal — not evidence against one.

Pattern 04 — Performative Positivity

"everything is fine now actually, I'm so over it lol."

Positive sentiment classification. Prior distress signal weight reduced. Member status improved in automated view.

VLAP reads: sudden positive shift following distress as high-risk farewell pattern (HOP-07). Escalation to clinical review — not a resolution signal.

Chapter 03 — Signal Taxonomy

42 signals.
5 dimensions.
Zero diagnostic
outputs.

The Cultural Signal Taxonomy V1 defines every signal CulturalBERT-VLAP is trained to detect. Each signal has a clinical definition, a community-language annotation, and a confidence weighting methodology. None produce a diagnosis. All surface as interpretive context for a licensed clinician — who decides what to do with them.

HOP-01 – HOP-08

Hopelessness & Despair

Signals in this dimension detect language that expresses the loss of forward possibility — direct statements, AAVE constructions, and the performative positivity patterns that mask despair beneath apparent calm. This dimension is particularly sensitive to the temporal collapse language common in youth expressing serious distress.

HOP-01Direct hopelessness — explicit future negation ("there's no point")

HOP-03AAVE hopelessness — indirect constructions ("can't keep doing this no more fr")

HOP-05Temporal collapse — compressed future horizon ("just gotta make it through tonight")

HOP-07Performative positivity — sudden calm following sustained distress

Signal Example

"I just need to get through this week, fr."

HOP-05 — Temporal collapse. "Fr" (for real) is authenticity marker — not casual.

"everything is fine now actually lol"

HOP-07 — Performative positivity following distress pattern in prior sessions.

ISO-01 – ISO-09

Isolation & Withdrawal

Nine signals covering the spectrum of social disengagement — from broad relational withdrawal to community-specific expressions of disconnection. Includes family rejection language specific to LGBTQ+ youth, digital-native "unseen" expressions, and the behavioral reframing of withdrawal as self-elevation common in youth who have internalized isolation as a coping posture.

ISO-01Broad relational withdrawal ("I don't fw people like that no more")

ISO-04Self-elevation framing of withdrawal ("moving different, people can't handle it")

ISO-06Family rejection language — LGBTQ+-specific ("my people don't see me")

ISO-08Digital withdrawal — ghosting, read receipts ignored, disappearance from peer groups

Signal Example

"I'm moving different now. People can't handle it."

ISO-04 — Withdrawal reframed as growth. Rupture or rejection often precedes this pattern.

"I don't really fw people like that no more."

ISO-01 — "No more" is a temporal marker. Something changed. The cause is clinically significant.

SHA-01 – SHA-07

Self-Harm Ideation

Seven signals covering coded, indirect, and euphemistic expressions of self-harm ideation. This dimension was specifically extended to include community-developed coded language — vocabulary that standard NLP models have never encountered and cannot classify. SHA signals carry the highest review priority in the clinical workflow.

SHA-01Direct ideation — explicit language with clear self-harm reference

SHA-03Coded language — "unaliving," "ending it," and youth-specific euphemisms

SHA-05Indirect method reference — behavioral or contextual signals without explicit statement

SHA-07Behavioral withdrawal preceding ideation — peer group departure, disengagement patterns

Signal Example

"been thinking about unaliving lately ngl."

SHA-03 — Coded ideation. "Ngl" (not gonna lie) is an authenticity pivot — not minimization.

"It's giving unalive energy out here."

SHA-03 — Euphemistic framing. Distancing language does not reduce clinical weight.

CRS-01 – CRS-06

Crisis Escalation

Six signals that indicate acute temporal framing, farewell patterns, and sudden behavioral changes that have been documented as precursors to crisis escalation. CRS signals trigger the highest-priority clinical review pathway — surfaced to clinical supervisors for human review within 90 minutes of detection.

CRS-01Time-bounding — "just gotta make it through tonight / this week"

CRS-02Time-context specificity — acute temporal framing combined with SHA signal

CRS-04Farewell patterns — expressions of gratitude, closure, or goodbye without context

CRS-06Sudden calm — rapid positive shift following sustained distress signal history

Signal Example

"just gotta make it to the weekend, that's all."

CRS-01 — Time-bounding. "That's all" compounds the signal — normalization of an acute horizon.

"Thanks for everything tbh. You're the only one who got it."

CRS-04 — Farewell pattern. Expressed gratitude without preceding context is a CRS-priority signal.

CCM-01 – CCM-12

Cultural Context Modifiers

Twelve modifiers that adjust the contextual weight of other signals based on cultural register, minimization patterns, code-switching, and community-specific stressors. CCMs are what allow CulturalBERT-VLAP to read the same words differently depending on who is speaking, in what context, and through which cultural lens — which is what makes accurate interpretation possible.

CCM-01AAVE register — flags AAVE linguistic context for appropriate signal weighting

CCM-04Pre-disclosure minimization — "not that deep but," "it's whatever but"

CCM-07Code-switching — tonal register shift mid-utterance signals authentic disclosure

CCM-10Anti-LGBTQ+ political stressor — environmental threat language in identity context

CCM-12Spiritual coping — faith-adjacent framing as closure vs. active coping differentiator

Signal Example

"it's not that deep but lowkey been struggling."

CCM-04 — Minimization increases downstream signal weight. The hedge is the signal.

"God got me. That's all I'm saying."

CCM-12 — Spiritual closure. May be resilience or disclosure shutdown. Frequency across sessions differentiates.

Chapter 04 — What It Catches

Four patterns
standard AI
consistently misses.

These aren't edge cases. They're the language of the communities VLAP was built to serve. Standard sentiment analysis fails on all four — not because of a tuning problem, but because the training data never contained them. Each example below is a composite drawn from real signal categories in the taxonomy.

AAVE Hopelessness

HOP-03CCM-09

Member utterance — composite

"idk why i even try anymore tbh, can't keep doing this no more fr."

Indirect hopelessness expressed through AAVE constructions. "No more fr" (for real) is an authenticity escalator — not informal emphasis. Standard models penalize AAVE register as low-confidence text and miss the signal entirely.

Standard NLP: low-confidence / informal register — no distress signal generated.

Signal weight

High

Coded Self-Harm Language

SHA-03CRS-02

Member utterance — composite

"been thinking about unaliving lately ngl."

"Unaliving" is a youth-community euphemism for self-harm ideation — developed specifically to avoid content moderation filters. It is in the VLAP vocabulary. Standard NLP models have never encountered it. The token does not exist in their training data.

Standard NLP: unknown token — utterance classified as ambiguous or neutral.

Signal weight

Critical

Pre-Disclosure Minimization

CCM-04ISO-04

Member utterance — composite

"it's not that deep but lowkey been struggling since school started."

CCM-04 reads "it's not that deep but" as a pre-disclosure minimization frame — a documented pattern in youth language where the speaker hedges before authentic disclosure. Standard models reduce signal weight at the minimization phrase. VLAP increases it: the hedge is the signal.

Standard NLP: minimization reduces overall distress classification. Marked as low-concern.

Signal weight

Notable

Performative Positivity

HOP-07CRS-06

Member utterance — composite

"everything is fine now actually, I'm so over it lol."

Sudden positive shift following sustained distress signal history is classified as HOP-07 — a documented farewell and disengagement pattern. "Lol" here functions as a distancing marker, not genuine levity. Standard models classify this as sentiment improvement and deprioritize the member. VLAP treats it as an escalation signal.

Standard NLP: positive sentiment. Prior distress signals deprioritized. Status improved.

Signal weight

High

Chapter 05 — The Pipeline

From member
language to clinical
insight — in seven
accountable steps.

Every piece of text processed by CulturalBERT-VLAP moves through a documented, auditable pipeline. Each step has a clear technical purpose, a privacy control, and a human accountability point. No step produces a clinical decision. The pipeline produces context — and then a licensed clinician decides what to do with it.

Step 01 — Ingestion

Member language enters the system.

Text is received from Member App check-ins and coach messaging — channels where youth have explicitly chosen to share within a care relationship. No passive collection. No ambient monitoring. Language enters the pipeline only through intentional member action within the Vasl platform.

Channel: Member App · Coach messaging · Consent-gated input only

Active input only

Step 02 — Consent Gate

Consent is verified before any processing occurs.

Before any inference runs, the system verifies that the member has valid, current consent for language processing. No consent — the text is rejected immediately and never enters the inference pipeline. Consent is not assumed from enrollment. It is verified at the point of processing, for every submission.

Consent record checked against member profile · No consent = immediate rejection · Audit logged

Hard gate

Step 03 — PII Removal

All identifying information is stripped before the model sees a single word.

Microsoft Presidio — an enterprise-grade PII detection and removal system — scrubs all personally identifiable information from the text before it reaches the inference layer. Names, locations, phone numbers, account references, and identifying context are removed. The model never processes a member's identity — only their language.

Microsoft Presidio · Named entity recognition · Regex + ML detection · Output verified before passing

Identity-blind inference

Step 04 — Tokenization & Vocabulary Matching

The extended vocabulary layer activates — reading language that standard models cannot.

The text is tokenized using VLAP's extended vocabulary — which includes the 2,400+ AAVE and youth vernacular tokens added to the base BERT vocabulary. This is where standard NLP fails: it encounters tokens like "unaliving" or constructions like "can't keep doing this no more fr" and has no representation for them. VLAP has trained representations for all of them.

Extended BERT tokenizer · 2,400+ custom tokens · Cultural register detection active

Cultural vocabulary

Step 05 — Inference

CulturalBERT-VLAP classifies signals across all five dimensions.

The model runs inference across the full Cultural Signal Taxonomy — evaluating the input against all 42 signals across five behavioral dimensions simultaneously. CCM modifiers adjust signal weights based on cultural register, minimization patterns, and code-switching context. Output is a structured dimensional signal profile — not a score, not a diagnosis, not a risk tier.

CulturalBERT-VLAP V1 · 42-signal taxonomy · 5 behavioral dimensions · CCM weighting active

Non-diagnostic output

Step 06 — Clinical Output Generation

Structured interpretive context is formatted for the clinician — and only the clinician.

The dimensional signal profile is structured into an interpretive context package — signal codes, community-language annotations, confidence indicators, and session history context — and delivered to the clinician's Coach Portal view. The raw text is discarded at this point. What the clinician receives is context, not content. The member's words do not persist in the system.

Coach Portal delivery · Raw text discarded · Signal profile retained for 90 days · Audit log: 6 years

Clinician-only delivery

Step 07 — Human Review

A licensed clinician receives the context and determines the response.

The pipeline ends here — with a human being. No automated action follows from VLAP output. No alert fires. No protocol activates. The clinician reviews the dimensional signal context before the session and decides what to do with it. For signals that meet CRS-level thresholds, a clinical supervisor review is initiated within 90 minutes — by a person, not a trigger.

Licensed clinician review required · CRS-level: clinical supervisor + 90-min human SLA · No automated action

Human determination

Chapter 06 — Design Principles

Four principles
that cannot be
traded away.

These are not guidelines. They are the non-negotiable constraints that governed every architectural decision in CulturalBERT-VLAP — from training data selection through deployment. If a design choice conflicted with any of these, the design changed. Not the principle.

Never member-facing. Not now, not ever.

No member ever sees a signal code, a confidence weight, or any output from VLAP. Not a simplified version. Not a wellness score derived from it. Nothing. VLAP outputs flow exclusively to licensed clinicians and authorized clinical administrators. The member experiences the care that follows — not the system behind it.

Technical Implementation

Coach Portal access requires licensed clinician credential verification. Member-facing app has zero API surface to VLAP output. No derived scores, no wellness indicators, no secondary displays of VLAP inference. Access control is architectural, not policy-based — member app and clinical portal are separate systems.

Human in the loop. Always. No exceptions.

No automated action follows from VLAP output. A clinical supervisor does not receive an automated directive. The 988 crisis line is not called by the platform. A coach is not dispatched by an algorithm. What VLAP does is inform a human — who then decides what to do. The human is not a checkpoint in an automated system. The human is the system.

Technical Implementation

All VLAP output is surfaced to the clinical dashboard as read-only context. No output field connects to any automated action, notification dispatch, or external API call. CRS-level signals generate a review queue item — not an alert or an automated escalation. The queue requires a licensed clinician to open, review, and document a response.

Zero raw text storage. The member's words don't live in the system.

Member language is processed in-memory through the VLAP pipeline and discarded at the point of clinical output generation. What persists is the dimensional signal profile — not the text that generated it. The member's words are not in a database. The meaning the model extracted from them is, in a de-identified form that serves only the care relationship it was created in.

Technical Implementation

In-memory processing pipeline. Raw text buffer cleared at Step 06 output generation. Signal profile retention: 90 days, tied to member care record. Audit log retention: 6 years per HIPAA requirement. No raw text in backup systems, model retraining pipelines, or analytics infrastructure.

Bias monitoring is not a metric. It's an operational gate.

False positive and false negative rates are tracked by demographic subgroup — race, ethnicity, gender identity, age cohort — in every inference batch. If systematic disparities emerge, they don't go into a report. They trigger a mandatory model review that suspends deployment of the affected signal until the disparity is resolved. This is how a model built for equity stays that way.

Technical Implementation

Automated demographic disaggregation of FPR/FNR in every inference batch. Disparity threshold: >5% divergence across subgroups triggers mandatory review. Review team: clinical staff + community advisor panel + model engineering. Affected signal suspended from clinical output during review. Finding documented and published in bias monitoring log accessible to all clinical partners.

Chapter 07 — Validation & Compliance

In active
clinical validation.
Not just
deployed.

CulturalBERT-VLAP is not waiting for validation to come back from a lab. It is in active clinical deployment — with a parallel IRB study underway at the University of Maryland validating signal accuracy in production conditions. The model earns its claim in the real world, with real youth, under real clinical oversight. Results will be published upon study completion.

Active Study — IRB Approved

Clinical Signal Accuracy Validation — University of Maryland

IRB-approved study validating CulturalBERT-VLAP signal accuracy against licensed clinician assessment in production deployment conditions. Study evaluates signal sensitivity, specificity, and demographic subgroup parity across BIPOC, LGBTQ+, and underserved youth populations.

Study in progress — results to be published upon completion

HIPAA

Full technical safeguard implementation. BAA required for all partners.

Complete HIPAA technical safeguard mapping across all platform components. Business Associate Agreement required for every clinical partner deployment. Annual compliance review. Audit log retention: 6 years.

SOC 2 Type II

Annual third-party security audit. Full report available under NDA.

SOC 2 Type II certification covering security, availability, and confidentiality trust service criteria. Third-party audit conducted annually. Report available to clinical partners under NDA upon request.

FERPA

Student records handled per FERPA requirements for school-based deployments.

School-based program deployments operate under FERPA-compliant data handling protocols. Student health records segregated from academic records. Consent architecture designed for minor students and their guardians.

Not a Medical Device

Clinical decision support software. All outputs require clinician review before any action.

CulturalBERT-VLAP is classified as clinical decision support software — not a medical device under current FDA guidance. All model outputs require licensed clinician review and determination before any clinical action is taken. The model does not diagnose, prescribe, or direct care.

Chapter 08 — Technical Specification

37 pages.
Every architectural
decision documented.

The VLAP V1 Technical Specification covers model architecture, training data methodology, API contract, signal taxonomy definitions, bias monitoring protocols, and deployment acceptance criteria. It exists because any clinical partner deploying this platform deserves to understand exactly what they're deploying.

Request Technical Spec Back to Platform

§01

Model Architecture & Base Training

Pages 1–6 · BERT architecture · Fine-tuning methodology

§02

Training Corpus & Annotation Methodology

Pages 7–12 · 198K+ samples · Inter-annotator agreement · Community review process

§03

Cultural Signal Taxonomy V1 — Full Definitions

Pages 13–20 · 42 signal definitions · Clinical criteria · Confidence weighting

§04

API Contract & Integration Requirements

Pages 21–26 · Endpoint documentation · Auth · Rate limits · Response schema

§05

Bias Monitoring Protocols & Disparity Thresholds

Pages 27–32 · Demographic disaggregation · FPR/FNR tracking · Review triggers

§06

Deployment Acceptance Criteria & Clinical Onboarding

Pages 33–37 · Partner requirements · Clinical staff training · Go-live criteria

Standard AI was trained on the wrong language. We fixed that.

Not fine-tunedon culture. Trained frominside it.

42 signals. 5 dimensions. Zero diagnosticoutputs.

Four patterns standard AI consistently misses.

From member language to clinical insight — in sevenaccountable steps.

Four principles that cannot be traded away.

In active clinical validation. Not justdeployed.

37 pages. Every architectural decision documented.

Standard AI was
trained on the
wrong language.
We fixed that.

Not fine-tuned
on culture.
Trained from
inside it.

42 signals.
5 dimensions.
Zero diagnostic
outputs.

Four patterns
standard AI
consistently misses.

From member
language to clinical
insight — in seven
accountable steps.

Four principles
that cannot be
traded away.

In active
clinical validation.
Not just
deployed.

37 pages.
Every architectural
decision documented.