AI vs. hate speech: How do moderation systems fall short of human judgment?

As the United Nations marks the International Day for Countering Hate Speech on June 18, UN Secretary-General Antonio Guterres has warned that social platforms are amplifying the threat.

Hate speech that once circulated in person now travels farther and faster via anonymous online accounts. With artificial intelligence (AI) increasingly tasked with detecting and removing hate speech online, Al Jazeera looks at where these systems fall short compared with human judgment.

According to the UN, hate speech covers any communication – spoken, written or behavioral – that discriminates against or incites violence towards a person or group, targeting race, ethnicity, religion, gender, sexual orientation or disability.

A 2023 joint survey by Ipsos and UNESCO of 8,000 people in 16 countries found that more than two-thirds of internet users encountered hate speech online. The survey also found that 33% of people thought LGBTQI people experienced the most hate speech, followed by ethnic and racial minorities (28%) and women (18%).

Meta, which owns Facebook, has removed fewer hateful posts since 2023. In the last quarter of 2025, the company removed 1.3 million posts from Instagram and 1.3 million from Facebook, compared to 7.4 million from Instagram and 5.8 million from Facebook in Q4 2024.

TikTok, on the other hand, said it removed 96.3% of all hate speech and content in Q4 2025 before it was reported.

To combat hate speech, social media companies have increasingly turned to AI, using content moderation systems powered by large language models (LLMs). However, a 2025 study by researchers at the University of Pennsylvania found significant inconsistencies across systems and demographic groups.

The study evaluated seven AI moderation systems – including models from OpenAI, Anthropic, DeepSeek, Mistral, and Google – and found major differences in how they scored hate speech. For instance, Mistral Moderation Endpoint often assigns high scores, while OpenAI Moderation Endpoint tends to produce much lower scores.

Arkaitz Zubiaga, an associate professor at Queen Mary University of London, noted that AI systems struggle with implicit hate speech, such as positive-sounding messages that contain derogatory undertones. Conversely, AI may falsely flag reclaimed language – slurs repurposed by marginalized communities as terms of endearment – as hateful.

Source: www.aljazeera.com

World

AI vs. hate speech: How do moderation systems fall short of human judgment?

Latest News

Latest News