(Hi! We're Riza, and we're making LLM tool-use simple and safe.)
Last week Anthropic announced Claude computer use, where they basically said, "what if we give Claude control of a computer?"
The quickest way to experiment with computer use is to use Anthropic's reference implementation: a docker container that takes less than ten minutes to set up, even if you've never used Docker before.
I spent several hours playing with the computer use reference implementation last week. The experience felt like an exagerated version of the emotional journey I've had while trying out so many AI-powered products over the last 18 months. At first there's a giddy sense of, "Wow! This is amazing. I can't believe this works! You could do anything with this!"
But as you use it more, you realize that even when it does complete a task, the final result isn't always what you wanted. It's also slow, unreliable, and expensive -- so much so that it's hard to imagine a scenario where computer use is the right tool for the job today.
That said:
-
Watching a computer use a computer in an open-ended way was such an exciting and novel experience, that I’d encourage anyone with an interest in AI to spin up the reference implementation and give it a go.
-
Computer use will get better. It’s hard to say how much better -- and it’s got a long way to go before this is a practical tool -- but it'll never be worse than it is today.
In this post, I’ll show you how to get up and running with the Claude computer use reference implementation, and share a few observations from the various experiments I tried over a few hours and $30 dollars worth of experimentation.
Or if you prefer, you can watch this video of getting started with computer use instead.
Running the computer use reference implementation
The general idea behind Claude computer use is this loop:
- Send a screenshot of the desktop GUI to Claude via the API
- Claude uses vision to determine the state of your desktop
- Claude chooses a tool to use on the computer: mouse or keyboard emulation, string manipulation, or bash commands
- You (programatically) execute that action on a computer
- Send another screenshot of the desktop to Claude...
You can make computer use work with nearly any computer, so long as you can orchestrate the loop between the Anthropic API and that machine.
If you want to start experimenting quickly, use Anthropic's reference implementation. The reference implementation is a docker container with a simple chat-with-Claude webapp sitting on top of an orchestration loop that handles tool execution within the context of the container. Given the complexity of all that's happening here, Anthropic did a great job on this reference implementation -- this is a lot more than just a simple API call!
If you've already gotten this far in this post, you probably owe it to yourself to try out the reference implementation. It'll take less than 10 minutes to get up and running. You'll need:
- An Anthropic API Key which you can find in your Anthropic Console.
- Anthropic credits. $20 should be sufficient for an hour’s worth of play. (Yes, it’s expensive.)
- Docker. If you've never used Docker, don’t be intimidated. Just head over to Docker's website, install the desktop app, and run it.
You do not need to clone Anthropic's reference implementation repo for this to work. When you run the below Docker command the first time, Docker will pull down the container with everything you need.
Open a terminal, and set your Anthropic API key as an environment variable.
export ANTHROPIC_API_KEY=your_anthropic_api_key
Then run this Docker command
docker run \
-e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
-v $HOME/.anthropic:/home/computeruse/.anthropic \
-p 5900:5900 \
-p 8501:8501 \
-p 6080:6080 \
-p 8080:8080 \
-it ghcr.io/anthropics/anthropic-quickstarts:computer-use-demo-latest
Once the container spins up (it might take a minute), visit https://localhost:8080. You’ll find a Streamlit app on the left, and a VNC connection to the virtual desktop GUI on the right. From here, you can give Claude tasks and watch it use the computer to complete them.
What to do from here? Presumably you can ask Claude to do anything that you yourself might do with a computer.
Here’s what I tried.
Web research
I started by asking Claude to find the score from the previous night’s Laker’s game – well past the knowledge cutoff of an LLM.
Claude opened Firefox, searched Google, but searched for the wrong date (i.e., it didn't know the date of "last night"). It landed on YouTube videos and quickly figured out that it wasn’t going to find the right answer there.
At first I expected that it would get stuck there, but I was delighted to watch it realize it had gone down the wrong path, back up, visit NBA.com, and then find the correct score.
Despite the speedbumps, I was impressed to see it self-correct in real-time. This was a constant theme throughout my experiments: though I didn't always get the output I was hoping for, I rarely saw it get stuck in a dark alley that it couldn't get out of.
Having successfully gotten last night's score, I decided to go deeper. The game in question was actually the 2024 season opener for the Lakers, and there seemed to be some excitement on the internet over the Laker's breaking an opening night curse.
I asked Claude to compare the results from this game to all opening nights for the Lakers over the last ten years. I was expecting it to use Firefox and search for a dataset, but it didn't. Without doing any searching, it just gave me the results for the last ten years of opening nights. There was no citation, and it didn't show its work. I don’t know if that data was accurate or hallucianted.
This was another repeating theme: lack of transparency around reasoning. Presumably the value add of computer use is that you can build autonomous agents. But I often found I was getting answers without insight into how Claude had arrived at the answer. To some extent it doesn't matter if the historical scores it provided are accurate or not: I can't trust them because I have no way to verify them, and I know that LLMs are prone to hallucinate.
Working with spreadsheets
Let's pretend that the basketball data is accurate... what can I do with it now that I have access to a full computer?
I noticed the LibreOffice Calc icon in the task bar and asked Claude to open a spreadsheet and put the data in there. It was able to click on the application icon, open the app, create a new spreadsheet, and then it started trying to fill out the columns. Then things went off the rails.
Turns out that manipulating spreadsheets via a GUI is a pretty bad use case for computer use, which Anthropic calls out in the docs under computer use limitations:
Spreadsheet interaction: Mouse clicks for spreadsheet interaction are unreliable. Cell selection may not always work as expected. This can be mitigated by prompting the model to use arrow keys.
Spreadsheet use also highlighted another known limitation:
Latency: the current computer use latency for human-AI interactions may be too slow compared to regular human-directed computer actions. We recommend focusing on use cases where speed isn’t critical (e.g., background information gathering, automated software testing) in trusted environments.
Working in a spreadsheet is almost the worst-case scenario if you want to see these limitations in full effect: cells are small and it's hard to precisely click on the right one. Entering text via the GUI is also comically slow: take a screenshot, run the screenshot through a vision model, click on a cell, take a screenshot, run the screenshot through vision, enter text, take a screenshot, etc... And as soon as Claude misclicks, it’s very hard/slow/costly to set things right.
Of course, spreadsheets aren't designed for computers, they're designed for humans! And computers are actually pretty good at manipulating tabular data without a GUI. Asking a multi-modal LLM with vision to create a spreadsheet via the GUI is a Rube Goldberg-esque exercise. If the goal is "get this data into a spreadsheet," surely there's a better way to play to its strength.
I asked Claude to export the data to a CSV and then open that file in the spreadsheet app. This task it accomplished flawlessly: it ran a bash command to create the CSV, then navigated the menus in LibreOffice Calc to open the file. In the process, it had to deal with a couple of alerts, such as “do you want to save the existing file?” and it impressively navigated those to task completion.
Of course LLMs have been able to write code to turn tabular data into CSVs for quite some time. When is a GUI a better interface for an LLM?
Product research
I need a new clock for my living room wall, something like what you’d see hanging on the wall in a classroom. I asked Claude to research Amazon for wall clocks matching specific parameters: a sweeping second hand, between $20 and $50, and bigger than 12 inches in diameter.
Once again, it was delightful to watch Claude manipulate the GUI to open Firefox, go to Amazon, and run a search. Its search phrase was fairly naive – more or less just pasting in the literal description I used without refinement. It navigated to several items and reported those options back to me.
So while it did successfully give me five options of wall clocks, almost all the options it provided me were outside of the parameters I provided.
Writing and running code
Trying again to play to its strengths, I asked Claude to write and run a Game of Life implementation.
I thought it might open up a text editor like a human would, but instead it just zero-shotted a Game of Life implementation in Python, cat
ted that string out to .py file, and executed that file from bash using the Python interpreter. Unfortunately it ran the bash command in the background, not in a terminal, so I couldn’t watch it work.
This was funny -- it completed the task I gave it on the first try, but I couldn't watch it. The program timed out, and I instructed it to repeat the task but to do it all in the terminal so I could watch the output.
This is a task that I've seen an LLM complete many times over the last 18 months and it delights me every time. It sort of blows me away because while this is code that I could write myself, it would probably take an hour or two and it definitely wouldn't run on the first try.
But again, this simple “write code and run it” was not a new ability of LLMs. I was just doing it in a docker container rather than with ChatGPT. I wanted to try something more complex.
Data analysis
I wanted to try something that could piece together a few aspects of having a computer at your disposal, and would also play to an LLM's traditional strengths of writing code and text as opposed to working with the GUI.
I asked it to do a data analysis project: to FTP to a server, download publicly available data about New York City, and do some simple analysis. I wanted to test something that used multiple steps and multiple tools, even though this task is more open-ended than what you'd do in the real world.
And Claude did indeed complete the task!
It first tried to install an FTP client, but failed. Then it switched its strategy over to use curl. It tried a couple of different datasets and ultimately decided on a dataset from New York Open Data about restaurant inspections.
It downloaded that data, wrote a Python script to analyze the data, and ran into an error because a required library wasn't installed. So it installed Pandas and then was able to run the code and do some analysis.
This, to me, is pretty incredible. There were clear mistakes made, but it was able to accomplish an open-ended multi-step task. And at each step it encountered resistance and was able to course-correct and still accomplish the broader goal. This was delightful to watch, and one of the most promising demonstrations I've seen of the promise of AI agents.
Playing games with confident incompetence
One reason why the data analysis project felt like a good fit for computer use is because it almost entirely avoided use of the GUI. But navigating the GUI is the major new feature with computer use.
What type of task can we do with computer use that we couldn't do three weeks ago?
My thought: play a game.
I asked Claude to start a new game of Freeciv. It downloaded Freeciv, and opened it, and successfully started a new game.
Inititally it was able to move units around and build a city, but it quickly got stuck in a dark alley after the second or third turn.
What's worse though was its overconfidence. For instance, four of the five statements in this status update were false:
This was disappointing. I had this fantasy of setting Claude to play Freeciv and coming back a few hours later to see that it had accomplished world domination. It couldn't even get out of the parking lot.
I thought maybe Freeciv wasn't the right game. Is there another game that plays more to Claude's strengths? These are large language models, after all, so maybe some sort of turn based word game?
I told it to go play Wordle. It opened the NYT website, navigated to Wordle, and beat the game in under 20 seconds. It was so fast, in fact, that I had strong suspicions that perhaps that it had some sort of knowledge about this particular Wordle (#1223) in its training data. It did not seem to be taking screenshots each turn -- it seemed to just know what it was going to do and fly through those actions.
I wanted to try another Wordle, but that requires an NYT login, and Anthropic strongly discourages you from signing into anything within the reference implementation of computer use because things could... go wrong.
I found a free site that has a Wordle archive and had Claude play with far worse results. Like Civ, the problem was not just that it was unable to complete the Wordle but that it was completely confident that it had.
Cost and conclusion
Everything I've described above cost me about $30 in Anthropic credits over roughly two hours of experimentation.
That's an hourly rate of ~$15/hour with very low quality results. This current first iteration of computer use is not a substitue for human workers for general tasks.
I don't say any of this to criticize Anthropic. I love their willingness to deploy iteratively. A quick glance at the limitations section of the computer use docs tells you that Anthropic has a sober perspective of computer use's current capabilities... and they shipped it anyway!
I so appreciate the willingness to get new tools out into the hands of folks. And candidly, as you can see from this video, I was giddy that this works at all.
A big challenge with any generative AI is figuring out when it's the right tool for the job. After a few hours of playing with computer use it's not clear to me when computer use would be the right tool, given its current limitations. But I'd love to hear what you've found in your experiments, and where you think this would go next if we saw order of magnitude improvements in speed, reliability, and cost.