- Turing Post
- Posts
- 🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools
🦸🏻#13: Action! How AI Agents Execute Tasks with UI and API Tools
we explore UI-driven versus API-driven interactions, demystify function calling in LLMs, and compare leading open-source frameworks powering autonomous AI actions
By now, we've explored all key building blocks in autonomous agents: Profiling (identity, goals, constraints), Knowledge (base facts), Memory (past contexts), Reasoning and Planning (task breakdown, inference, action plans), and Reflection (evaluating outcomes to improve future performance through feedback loops). All but one – Actions, the practical steps through which autonomous agents execute planned activities, interact with environments or external tools, and produce tangible outcomes. Actions bridge theory and reality, making them essential for agent autonomy. They enable an AI agent to “do something” rather than merely “say something.”
In agentic AI, an action is any operation an agent performs to interact with external systems – going beyond passive text responses to actively fetch data, execute code, invoke APIs, or control interfaces. Tool integration is essential, as it extends an agent’s capabilities beyond its model weights, enabling true autonomy. Agentic AI dynamically applies tools and real-time information from sensors, databases, or web APIs to adapt and solve complex, real-world tasks.
In this article, we examine UI-driven versus API-driven approaches, clarify function calling within LLMs, and compare prominent open-source frameworks like LangGraph, Microsoft AutoGen, CrewAI, Composio, OctoTools, BabyAGI, and MemGPT (Letta). It’s not a casual read, but it’s packed with useful insights if you’re into agents.
What’s in today’s episode?
Essential Components of Action
Tool Learning: UI-Based vs. API-Based Interactions
Function Calling: How LLMs Invoke External Functions
Open-Source Frameworks Enabling Actions (This overview of frameworks is a goldmine for anyone looking to build or experiment with agentic AI.)
Emerging Trends in AI Action Execution
Concluding Thoughts
Resources to dive deeper
Essential Components of Action
Tool Learning: UI-Based vs. API-Based Interactions
One fundamental choice in enabling agent actions is how the agent interacts with external tools or applications. Broadly, these interactions fall into two categories: UI-based interactions and API-based interactions.
UI-Based Tool Use: In this approach, an AI agent operates the software’s user interface (UI) like a human would – clicking buttons, typing into forms, and reading on-screen information. Such a computer-use AI agent essentially simulates a human user’s behavior on the frontend​. This method is akin to robotic process automation (RPA) driven by AI. The advantage of UI-based action is that it works even when direct programmatic access is unavailable or prohibited. For example, if an enterprise application has no public API or a website’s terms of service forbid scraping, an agent can still perform tasks by navigating the UI just as an employee would​. UI-based agents inherently comply with front-end usage policies and can integrate workflows across multiple disparate applications​. However, this approach can be slower and more brittle – changes in the interface or layout can break the agent’s “script,” and setting up a virtual browser or desktop environment for the agent to operate in adds complexity.
API-Based Tool Use: Here, an agent uses backend APIs or function calls to interact with software systems directly. Instead of clicking through a web page to get stock prices, for instance, the agent might call a REST API that returns the data in JSON. API-based actions are more structured and efficient: they provide the agent with precise data or let it trigger defined operations (e.g. create a calendar event via an API) without having to parse visual interfaces.
In practice, modern AI agent frameworks prioritize API-based tools for their reliability and speed. Even Anthropic’s Computer Use tool, which enables agents to interact with virtual environments, relies on APIs. Tool learning in this context means prompting the AI to understand when and how to use a tool, often through a series of prompts, constraints, and examples. Given descriptions of available tools and usage examples, an LLM-based agent can select the right tool for a query and generate the correct API call format, effectively learning the tool’s interface from instructions. AI practitioners say it’s easy to make it work but hard to keep it working consistently. Research like Toolformer shows LLMs can be fine-tuned to insert API calls autonomously, but practical systems typically use prompt engineering or function-calling interfaces instead of retraining. For businesses, the choice between UI and API tools matters: API-focused agents excel in efficiency and scalability with robust APIs, while UI-based agents are necessary for legacy systems or UI-only platforms.
Function Calling: How LLMs Invoke External Functions
Upgrade if you want to be the first to receive the full articles with detailed explanations and curated resources directly in your inbox. Simplify your learning journey →
or follow us on Hugging Face, you can read this article there tomorrow for free
Want a 1-month subscription? Invite three friends to subscribe and get a 1-month subscription free! 🤍
Reply