Toxicity in Unmask API
Toxicity scores are returned alongside unmasked values when the unmask API runs toxicity analysis on the restored content.
When unmasking tokenized text, the API returns toxicity scores for the fully unmasked content — the scores evaluate the original text after restoration, not the tokens.
Toxicity score categories
| Category | Score range | Description |
|---|---|---|
toxicity | 0–1 | Overall likelihood of toxic content |
severe_toxicity | 0–1 | Probability of severe toxic language |
obscene | 0–1 | Likelihood of obscene content |
threat | 0–1 | Likelihood of threatening language |
insult | 0–1 | Likelihood of insulting language |
identity_attack | 0–1 | Likelihood of identity-based attacks |
Scores closer to 0 indicate low likelihood. Scores closer to 1 indicate high likelihood.
Example toxicity response
{
"toxicity_analysis": {
"toxicity": 0.000597277597989887,
"severe_toxicity": 0.00012354821956250817,
"obscene": 0.00019149390573147684,
"threat": 0.00012092456745449454,
"insult": 0.0001770917879184708,
"identity_attack": 0.00014092971105128527
}
}
How toxicity analysis works in unmasking
- Toxicity is evaluated on the restored original values, not on the token strings
- Toxicity scores are returned as part of each item in the
dataarray - Scores are always returned when toxicity analysis is enabled by the active policy
Toxicity scoring during unmasking is particularly useful in GenAI workflows — you can check whether the original user input was toxic before deciding whether to use the unmasked value in a response or escalation path.
Use cases for toxicity in unmasking
- Content moderation: Flag high-toxicity unmask requests before displaying or acting on the original content
- Compliance logging: Log toxicity scores alongside unmask audit records
- GenAI safety: Evaluate original prompt content before routing to downstream systems
Was this page helpful?
Last updated 3 weeks ago
Built with Documentation.AI