Unmasking APIsToxicity in Unmask API

Toxicity in Unmask API

Toxicity scores are returned alongside unmasked values when the unmask API runs toxicity analysis on the restored content.

When unmasking tokenized text, the API returns toxicity scores for the fully unmasked content — the scores evaluate the original text after restoration, not the tokens.

Toxicity score categories

CategoryScore rangeDescription
toxicity0–1Overall likelihood of toxic content
severe_toxicity0–1Probability of severe toxic language
obscene0–1Likelihood of obscene content
threat0–1Likelihood of threatening language
insult0–1Likelihood of insulting language
identity_attack0–1Likelihood of identity-based attacks

Scores closer to 0 indicate low likelihood. Scores closer to 1 indicate high likelihood.

Example toxicity response

{
  "toxicity_analysis": {
    "toxicity": 0.000597277597989887,
    "severe_toxicity": 0.00012354821956250817,
    "obscene": 0.00019149390573147684,
    "threat": 0.00012092456745449454,
    "insult": 0.0001770917879184708,
    "identity_attack": 0.00014092971105128527
  }
}

How toxicity analysis works in unmasking

  • Toxicity is evaluated on the restored original values, not on the token strings
  • Toxicity scores are returned as part of each item in the data array
  • Scores are always returned when toxicity analysis is enabled by the active policy

Toxicity scoring during unmasking is particularly useful in GenAI workflows — you can check whether the original user input was toxic before deciding whether to use the unmasked value in a response or escalation path.

Use cases for toxicity in unmasking

  • Content moderation: Flag high-toxicity unmask requests before displaying or acting on the original content
  • Compliance logging: Log toxicity scores alongside unmask audit records
  • GenAI safety: Evaluate original prompt content before routing to downstream systems