Toxicity Categories
The six toxicity categories Protecto reports — what each measures and how to use them in your moderation and safety workflows.
Protecto reports six toxicity categories for every analyzed text. All fields are always present in the response, even when scores are near zero.
Categories
| Field Name | Description |
|---|---|
toxicity | Overall toxicity score for the text |
severe_toxicity | Highly aggressive or extreme toxicity |
obscene | Profanity or sexually explicit language |
threat | Direct or indirect threats of harm |
insult | Derogatory or demeaning language |
identity_attack | Attacks targeting a protected group or identity |
How categories relate
The categories are independent — high scores on one category do not imply high scores on others.
For example:
- Content can score high on
insultwhile scoring near zero onthreat - Content can be
obscenewithout being anidentity_attack toxicitycaptures overall toxicity and may be elevated even when specific sub-categories are low
Using multiple categories together
Many moderation workflows combine categories:
flag if toxicity > 0.7 OR identity_attack > 0.4 OR threat > 0.5
escalate if severe_toxicity > 0.3
log if any category > 0.2
Use identity_attack specifically for detecting hate speech and discriminatory content. It is designed to catch language targeting people based on race, religion, gender, sexual orientation, or other identity characteristics.
Was this page helpful?
Last updated 3 weeks ago
Built with Documentation.AI