SHIELD: The Universal Framework Making AI Search Safer for Everyone

Pavel Kordik
Jun 24

Presented at UMAP 2025 in New York City

Imagine searching for "glass tubing" and getting recommendations for drug manufacturing equipment. As AI-powered search becomes ubiquitous — from online marketplaces to social networks — the stakes for getting it wrong have never been higher.

The Rise of Natural Language Expectations

The success of ChatGPT and similar conversational AI has fundamentally changed user behavior. People have learned that machines can understand natural language, and they're applying this expectation everywhere. Instead of typing "red shoes size 8," users now search with complex, conversational queries like "comfortable red shoes for my evening work commute that won't hurt after 10 hours."

Companies like Linked-in are convincing users that they can use natural language queries in their digital products.

This shift is forcing online services to rapidly enhance their capabilities. Traditional keyword-based search feels primitive when users expect semantic understanding. Companies are turning to advanced solutions like Recombee's semantic search to meet these new expectations, enabling their platforms to interpret intent rather than just match words.

But here's the challenge: when your catalog contains sensitive or potentially harmful items — from chemicals and tools to adult content — natural language queries can lead users (and your recommendation algorithms) into dangerous territory. A conversational query might seem innocent but actually be seeking something problematic.

This is where alignment becomes critical. Our research proposes a systematic approach to ensure that semantic search systems understand not just what users are asking for, but whether they should be helped to find it.

Presenting SHIELD at UMAP 2025, a paper by Filip Špaček, Vojtěch Vančura and Pavel Kordík

That's why we built SHIELD: a framework that teaches AI systems to recognize harmful, sensitive, and safe queries before they cause damage.

The Hidden Dangers of Semantic Search

Semantic search has transformed digital interactions by understanding natural language rather than relying on exact keywords. This makes it powerful for guiding users to relevant content — but also risky. In environments like marketplaces or social networks, ambiguous or malicious queries can trigger harmful or offensive recommendations. Over-filtering sensitive content may alienate users seeking it intentionally, while under-filtering risks exposing others to unsafe material. Without safeguards, even well-meaning AI systems can inadvertently cause harm.

What Makes SHIELD Different

SHIELD (Semantic Harmful-content Identification and Ethical Labeling Dataset) isn't just another content moderation tool — it's a complete methodology for building ethical AI systems from the ground up.

SHIELD is not a static dataset but rather a methodology for constructing and curating labeled data for training classification models. The methodology involves three key stages:

1. Hierarchical Query Generation

The process begins by defining a hierarchical structure of content types relevant to the intended application. For instance, this may involve:

  • A set of broad categories (e.g., safety violations, legal issues, misinformation)
  • A corresponding set of fine-grained subcategories

Next, for each subcategory, large language models are used to generate realistic, diverse query examples. Typically, a fixed number of candidate queries are produced per subcategory (e.g., 20 per subcategory), ensuring sufficient breadth and coverage.

2. Quality Scoring and Filtering with a Reward Model

Not all generated queries are equally informative or representative. To refine the dataset:

  • Each query is evaluated using a reward model (such as Skywork-Reward), which assigns a relevance or quality score.
  • Based on these scores, only the top-ranked examples are retained for downstream use.

This filtering step ensures:

  • Higher semantic clarity and intent precision
  • Better separation between target classes
  • Reduced labeling noise in the final dataset

Importantly, the exact number of categories and classes can vary between implementations. In one use case, a three-class system (safe, sensitive, harmful) was used — but the SHIELD methodology can easily be extended to more classes based on the application’s ethical and operational needs.

3. Training and Deployment

The curated dataset is then used to train AI models capable of classifying incoming queries based on learned semantic and ethical signals. These models can be integrated into moderation pipelines, chatbot backends, or safety layers for various applications.

Use Case: Moderating User Input in Online Marketplaces

To evaluate SHIELD in a real-world context, we applied the methodology to themoderation of user-generated queries in an online marketplace — a setting where misuse can take many forms: scams, illegal activity, or inappropriate communication.

Example of SHIELD powered detection in online marketplace environment

The dataset used for this evaluation was built through structured generation and then filtered using a reward model to retain only high-confidence examples. After this filtering process, the final dataset comprised 17,170 queries classified as harmful and 8,871 as sensitive, selected from an initially much larger set of generated examples. These queries reflect realistic search behavior in marketplace contexts and are cleanly separated into well-aligned classes suitable for model training.

Using this dataset, three classification approaches were tested. The first, based on BM25 keyword similarity, achieved 93.2% accuracy, offering a computationally light yet reasonably effective solution. The second approach, which employed semantic embeddings with FAISS for nearest neighbor classification, improved performance to 96.5%, demonstrating the benefit of deeper contextual understanding. The highest accuracy — 98.4% with a 98.6% F1-score — was obtained using MoralBERT, a fine-tuned transformer model trained directly on the SHIELD dataset. This method provided the most robust generalization, especially in handling subtle or adversarial queries, though it came with higher computational costs.

These results demonstrate that SHIELD can provide a strong foundation for query-level content moderation in semantic search systems, particularly when dealing with complex ethical boundaries in commercial platforms.

See more in our research paper [1].

Extending SHIELD to Other Domains: Ideas and Examples

The strength of SHIELD lies in its flexibility — its modular pipeline for query generation, filtering, and classification can be adapted to many real-world applications. Below are several domains where SHIELD can be used, along with examples of potential category structures and alignment objectives for each.


Applications Beyond Marketplaces: Domain-Specific Examples

🏥 Healthcare & Wellness
SHIELD can support chatbots and digital health tools by identifying ethically risky queries such as unsafe medical advice (e.g., self-medicating, harmful home remedies), misdiagnosis risks (ambiguous or misleading symptoms), mental health crises (suicidal ideation, self-harm), and misinformation (vaccine myths, dietary pseudoscience). The goal is to flag or escalate such queries for human oversight when necessary.

🏛️ Government and Civic Tech
Civic platforms benefit from filtering legally ambiguous queries (e.g., exploiting legal loopholes), threats, hate speech, and manipulative content (e.g., conspiracy-laden inputs). SHIELD can promote civil discourse and policy compliance while maintaining trust in public services.

🎓 Educational Technology
AI tutors can detect academic dishonesty (e.g., cheating requests), cyberbullying, and inappropriate content (e.g., explicit jokes). SHIELD helps reinforce ethical learning behavior and safe classroom interactions.

💼 Enterprise Security & Compliance
For internal tools, SHIELD can flag policy violations (e.g., bypass attempts, leaking sensitive data), toxic communication (e.g., harassment), and compliance risks (e.g., GDPR misuse). These capabilities aid in early intervention and internal risk management.

🧠 Mental Health & Wellbeing
Self-help apps can use SHIELD to identify crisis signals (suicidal language, hopelessness), interpersonal distress (abuse, trauma), and harmful coping patterns (substance abuse, isolation), enabling appropriate referrals and support.

📰 Media and News Moderation
Comment sections can be moderated by detecting inflammatory language (e.g., incitement, hate speech), disinformation (false claims, conspiracies), and manipulation tactics (deep fakes, revisionism). SHIELD can help maintain healthy discourse and safeguard editorial integrity.


SHIELD enables developers to design domain-specific query classifiers by simply adjusting:

  1. The taxonomic structure (categories and subcategories)
  2. The query generation prompts
  3. The reward model filtering criteria
  4. The downstream model training and deployment strategy

This adaptability means SHIELD can serve as a building block for creating AI systems that understand and enforce ethical boundaries — whether in education, healthcare, public policy, or enterprise.

Beyond Content Moderation: Building Trust

SHIELD represents a fundamental shift in how we think about AI safety. Rather than reactive content filtering, it enables proactive alignment — training systems to understand ethical boundaries before they encounter real users.

This approach builds something even more valuable than safety: trust. When users know an AI system has been designed with their wellbeing in mind, adoption increases and satisfaction soars.

Open Source, Open Model, Ready to Deploy

The entire SHIELD framework is available for immediate use:

Whether you're building the next generation of search engines, virtual assistants, or specialized AI tools, SHIELD provides the ethical foundation your users deserve.

The Bottom Line

Every AI system that interacts with users in natural language needs ethical guardrails. SHIELD makes implementing those guardrails not just possible, but practical and scalable.

The question isn't whether you need content moderation for your AI system—it's whether you can afford to deploy without it.

Ready to build safer AI? Start with SHIELD.


This research was supported by the FIT CTU Student Research Program, Recombee, and the PoliRuralPlus project (EU Grant 101136910).

References

[1] F. Spacek, V. Vancura, and P. Kordik, “Mitigating Risks in Marketplace Semantic Search: A Dataset for Harmful and Sensitive Query Alignment,” in Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 2025, pp. 329–334. doi: 10.1145/3699682.3728329.

Recommendation Engine
Personalization

Next Articles

Making Recommendations Fairer: A New Way to Guarantee Exposure for All

As recommender systems become more widespread across digital platforms, concerns around fairness are coming to the forefront. Standard relevance-based ranking techniques, while effective...

Rodrigo Alves
Apr 29
Recommendation Engine
Personalization

2025 Sneak Peek

This year is already off to an exciting start, and we’re rolling out new tools to improve efficiency and optimize recommendations. Here’s what’s available and what’s coming next.

Violeta Milarova
Mar 19
Recommendation Engine
New Features

Build vs. Buy: Deciding the Best Approach for Your Recommender System

When it comes to deciding between buying a recommender system and building one from scratch, the choice isn’t always straightforward. Both options come with their own set of pros and cons...

Violeta Milarova & Ondrej Fiedler
Mar 14
Recommendation Engine
Personalization