The New AI Framework "Dive" Makes Language Models That Use Tools Better by Training Them on a Variety of Tasks
- Editorial Team

- 14 hours ago
- 6 min read

A new framework called Dive was recently published in a research paper on arXiv. Its goal is to make it much easier for AI systems to use outside tools. The study's main goal is to teach large language models (LLMs) how to use a wider range of digital tools, like search engines, code execution environments, and domain-specific databases, in a way that is more reliable and can be used in more situations.
Researchers are looking into ways to turn AI systems into "agents" that can interact with outside tools to gather information, analyze data, and do tasks on their own as AI systems get better at doing complicated things. But it is still very hard to make models that can reliably do these things in different areas. The Dive framework solves this problem by emphasizing the need for a variety of training tasks, which helps AI agents learn how to use tools more effectively.
The Difficulty of Teaching AI to Use Tools
Modern large language models have shown that they can think and talk very well. But many tasks in the real world need more than just writing. For instance, an AI assistant might need to look things up on the internet, run code to process data, get information from specialized databases, or work with business software systems.
Researchers have started making AI agents that can use tools to deal with these kinds of situations. These agents can use natural language reasoning and call on outside tools at the same time. For example, an AI could look for medical information, look at financial records, or write and run code to fix a problem.
Even with these improvements, a lot of AI agents have trouble generalizing when they come across new tools or tasks they don't know how to do. When systems that work well in controlled settings are used in the real world, where tools and workflows are very different, they often don't work. The researchers behind Dive say that one of the main reasons for this limit is that the training data for these agents isn't very diverse.
Most training pipelines that are already in use make a lot of fake tasks, but these tasks usually only involve a few workflows or a set of tools. Because of this, the models learn how to follow strict rules instead of coming up with new ways to solve problems.
The Dive Framework Is Now Available
The Dive framework suggests a new way to make training tasks for AI agents. Dive does the opposite of what most people do: it starts with hypothetical questions and then tries to see if they can be solved with the tools that are already available.
The system starts by running real tools and gathering evidence from those interactions. After the evidence is collected, tasks are made from the data traces that are left behind. This "evidence-first" method makes sure that the tasks can be done and checked because they come from real tool outputs.
This design fixes a common problem with making synthetic data: when tasks are made without checking to see if the tools can really solve them, a lot of tasks become invalid or impossible to do. Dive makes sure that every task it creates has a valid solution path by basing them on real interactions with tools.
The framework's main goal is to create tasks that meet four important standards:
Structural diversity: Tasks use different tools and workflows.
Verifiability: There is a clear answer for each task that can be used to check it.
Executability: You can use the tools you have to solve tasks.
Scalability: The process can automatically make big datasets.
When used together, these rules help Dive make good training data for AI agents.
Creating a Variety of Task Resources
The Dive system makes different tasks by making several resource pools that can be put together in different ways. These pools have:
A pool of tools that has hundreds of them from different fields
A seed concept pool that has a lot of different topics and domain knowledge
A pool of examples with example task patterns
The framework can make a lot of different task scenarios by taking samples from these pools separately. This design greatly increases the number of training tasks that can be done without having to collect data by hand.
The researchers created a pool of 373 validated tools for both general and specific purposes for the experiments described in the paper. These areas include finance, biology, medicine, and academic research, among others.
This kind of variety lets AI agents practice solving problems that require different kinds of tools and steps of reasoning, which is more like how things work in the real world.
Task Synthesis Based on Evidence
The Dive framework has a loop at its heart that switches between gathering evidence and coming up with tasks.
The system runs tools and does reasoning steps during the evidence collection phase. Evidence traces store the results of these tools. These traces could be documents that were found, results from calculations, or data that has been processed.
The system looks at the traces and makes tasks that can be answered with the information that has been collected once there is enough of it. This method makes sure that each task is based on actual tool outputs instead of made-up ideas.
The system slowly builds more complicated tasks that need reasoning in steps and interactions with tools over many iterations. These tasks might include getting data from one tool, processing it with another, and putting the results together to make a final answer.
This process makes complex, multi-step workflows that are similar to how people use digital tools to solve problems.
Using Dive Data to Teach AI Agents
Researchers use the synthetic dataset to teach AI agents in two main ways after they make it:
Supervised Fine-Tuning (SFT): Models learn from examples of how to solve problems.
Reinforcement Learning (RL): Models get better by exploring and getting feedback.
The study demonstrates that training with a variety of tasks markedly enhances the performance of tool-utilizing AI agents. In tests with more than one benchmark, models trained on Dive data did a lot better at generalization.
For instance, the researchers used Dive to make a dataset that they then used to train a Qwen3-8B model. The trained system did 22 points better on average across nine benchmarks that were not in the training set, which was 68 percent better than the best baseline model.
These findings indicate that diversity in training data may hold greater significance than merely augmenting the volume of training data.
Diversity Versus Quantity in AI Training
One of the most important things the study found is that scaling diversity works better than scaling quantity. In other words, getting better results comes from adding more types of training tasks instead of just making more of the same type.
The researchers did experiments to see which of two methods worked better:
Adding more tasks while keeping the types of tasks the same
Adding more tools and ways to do tasks
The results showed that diversity always made generalization better across benchmarks. The dataset that focused on diversity did better even though it was only a quarter the size of the one that focused on quantity.
This finding shows that AI training datasets may be made very differently in the future. Instead of just looking at how much data there is, researchers might put more emphasis on making training environments that are varied and realistic.
What This Means for the Future of AI Agents
The Dive framework is a step toward AI agents that are better at working in a variety of fields and with a variety of tools. As AI systems become more common in offices, labs, and online platforms, it will be very important for them to be able to work with other tools.
Dive may help bridge the gap between lab experiments and real-world AI applications by letting models learn from a wide range of tasks that can be verified. Systems trained with these kinds of frameworks could help with complicated workflows in areas like software development, healthcare, and finance.
The research is still in its early stages, but the results show that making tasks more diverse and tools more useful could be a big step toward making AI systems that are more reliable and flexible.
As AI gets better, frameworks like Dive may become an important part of the training system that lets smart agents interact with the digital world without any problems.



Comments