Breaking mBad! supervised fine-tuning for cross-lingual detoxification

Beniwal, HimanshuKim, YoungwooSap, MaartenDan, SohamHartvigsen, Thomas2025-08-282025-08-282025-05-0110.48550/arXiv.2505.16722https://d8.irins.org/handle/IITG2025/19874As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBaden-USBreaking mBad! supervised fine-tuning for cross-lingual detoxificatione-Printe-Print123456789/435