How We Placed 2nd Out of 350+ Teams at the Agentic Legal RAG Challenge

Two weeks ago our team Gless AI placed 2nd out of 350+ teams at the international Agentic Legal RAG Challenge 2026 and took home $8,000. We were supposed to fly to Dubai for the Machines Can See conference at Dubai AI Week to present in person, but couldn't make it due to the situation — so here are the takeaways instead.

What the competition was

The task sounds simple: answer 900 legal questions over a corpus of 300 PDFs — court decisions, laws, regulations from the DIFC (Dubai International Financial Centre) — with exact source pages cited as grounding.

In practice, the scoring formula was the harshest we've seen in any RAG benchmark:

Scoring was multiplicative — weak grounding (the pages you cite as sources) crushed the entire result no matter how good the answers were.
Some questions were traps: they referenced cases or laws that don't exist in the corpus, and the correct answer was "nothing found." A confident hallucination cost multiple scoring components at once.
Latency mattered: slow pipelines were penalized by a separate multiplier.

In other words, the benchmark was specifically built to reward production-grade systems. A "beautifully over-engineered slow pipeline" and a "fast but ungrounded bot" both lost.

Final results

The top-5 teams finished within six points of each other:

RAGnarok — 77.9
Gless AI — 76.7
CPBD — 76.0
Cohomology — 72.0
Dmitry Ulybin — 71.9

Full leaderboard at agentic-challenge.ai/leaderboard.

Why this matters for business

Most RAG systems we see in production are optimized for a single metric — "does it answer or not." That's not enough. In legal, medical, or financial work, every citation has to be verifiable, and a confident wrong answer is worse than "I didn't find it."

That's exactly what this competition tested: you couldn't win with a slow over-engineered pipeline, and you couldn't win with a fast but ungrounded one. So 2nd place out of 350+ teams matters to us more than the prize money — it's concrete proof that we can build RAG that holds up to production load and to clients who check the citations.

Takeaways

A few short lessons that apply to any RAG project, not just legal:

Grounding isn't a feature — it's the core of the system. If a user can't click through to the source page, RAG in serious domains is useless.
A simple pipeline with the right details beats "smart" agents. On the final phase, our 600 lines of Python beat a SOTA agent that ran on the same task for 1.5 hours.
Measure answer quality, grounding, and latency together. Optimizing for one of them lies — a pipeline that answers beautifully but is slow or ungrounded won't survive in production.

Technical details

If you want the full technical breakdown — parsing, embeddings, retrieval, reranking, structured output, what we tried and dropped — we wrote a detailed technical write-up on LinkedIn.

If you're building a RAG system where the cost of a wrong answer is your reputation or regulatory exposure, get in touch — we'll help you design a pipeline where citations actually work and users trust the answers.

What the competition was

Final results

Why this matters for business

Takeaways

Technical details

Want to implement an AI agent?