How to use Alibaba Tongyi DeepResearch?
When people talk about research agents, the first name that pops up is usually OpenAI’s DeepResearch. Alibaba’s Tongyi Lab just dropped something that changes the game:
Tongyi DeepResearch, the first fully open-source research agent that actually goes toe-to-toe with proprietary systems.
This isn’t another chatbot pretending to “research.” It’s a web agent that can plan, reason, dig into multiple sources, and stitch together answers at a level that starts to look like what a junior analyst or lawyer would do. On benchmarks, it hits 32.9 on Humanity’s Last Exam, 43.4 on BrowseComp, and 75 on xbench-DeepSearch.
Translation: it’s competitive with the best closed-source agents, but you can download, inspect, and run it yourself.
Why it Matters
Most so-called AI “research” today is either shallow search or prompt tricks. Tongyi DeepResearch is different because of two things:
- Data synthesis at scale: they don’t just scrape or annotate, they generate massive synthetic datasets to train agents.
- Full pipeline training: continual pre-training → supervised fine-tuning → reinforcement learning, all adapted for agents instead of static LLMs.
This means the model isn’t just predicting the next token. It’s trained to reason, plan actions, and use tools over multiple steps.
The Training Recipe
- Continual Pre-training (CPT)
They start with what they call Agentic CPT. Instead of just reading the internet, the model trains on synthetic “trajectories” that look like research processes: asking questions, pulling documents, taking actions.
They use a system called AgentFounder that turns raw text, graphs, and tool logs into structured question-answer pairs and action sequences. Think of it like building a memory palace for the model.
2. Post-training Synthetic QA
For fine-tuning, they don’t hire thousands of annotators. They generate high-difficulty QA pairs automatically. The trick is not just asking questions, but making them harder by hiding or blurring information, forcing the agent to reason.
They even formalized “question difficulty” in set theory terms basically breaking problems into atomic tweaks (merge entities, hide attributes, etc.) so difficulty can be dialed up systematically.
3. SFT Cold-Start
Before reinforcement learning, the model gets a jump-start with supervised fine-tuning. They use two styles of trajectories:
- ReAct: the classic Thought → Action → Observation loop.
- IterResearch: their new approach that resets context every round, avoiding the clutter of dumping everything into one giant memory. This keeps reasoning clean and prevents the agent from suffocating under its own notes.
4. Reinforcement Learning (RL)
Finally, they run full on-policy RL with a custom method called GRPO (Group Relative Policy Optimization). The model interacts with a simulated web environment (no costly API calls) and gets rewarded for producing solid research outputs.
They even filter out bad negative examples to keep training stable, otherwise the model collapses into gibberish after long runs.
Rollout Modes
Tongyi supports two modes:
- ReAct Mode: The vanilla setup, no prompt hacks, just letting the model run Thought → Action → Observation until it solves the task. Simple, effective, and shows what the base model can do.
- Heavy Mode (IterResearch): For harder tasks, the model reconstructs its workspace every round, keeps only the essentials, and builds a running report. They even extend this to multi-agent synthesis, where several research agents work in parallel and a synthesis agent merges their findings.
Real Applications Already
This isn’t just lab demos:
- Gaode Mate (Xiao Gao): an AI copilot for maps that plans multi-day road trips with specific spots and constraints like “pet-friendly hotels.”
- Tongyi FaRui (Legal Research Agent): a junior-lawyer-like system that pulls statutes, cases, and cross-references them with proper citations. Not just answers, but grounded evidence.
Why This Approach Works
There’s a lesson buried here: data quality > fancy algorithms. Their RL setup, their simulated environment, their iterative QA engine all of it points to one thing: you don’t need secret sauce if your synthetic data loop is tight and stable.
It also highlights something a lot of teams miss: you can’t just pre-train on random text and expect an agent to emerge. Research is a sequence of actions, decisions, and memory management. Tongyi’s pipeline builds that explicitly.
Limitations
It’s not perfect:
- 128k context still isn’t enough for truly long-horizon research.
- Methods haven’t been tested at GPT-4-sized scales yet.
- RL efficiency could improve they’re exploring partial rollouts and off-policy methods.
But as a proof-of-concept, it’s strong. It shows how to train an open-source agent that actually competes with closed-source ones, not just imitates.
Closing Thought
Tongyi DeepResearch isn’t about “being a better chatbot.” It’s about turning LLMs into proper research assistants that can handle complex, multi-step reasoning without breaking. The open-source release means researchers, startups, and even hobbyists can now study and build on these methods.
If DeepResearch was OpenAI’s move to lock this space, Tongyi’s answer is: you don’t need to wait for permission, build your own researcher.
