From ChatGPT to Figure 01

Ludwig Wittgenstein's quote, "The limits of my language are the limits of my world," is a profound insight into how language shapes our perceptions and experiences, and it's one that parallels Open AI's attempt to use the LLM to create AI that understands the world through language. Now, AI is interacting with the world as if it really "gets it."

In the year and a bit since OpenAI released chatGPT, we've seen a lot of evolution in the way AI interacts with the world: from conversational chatbots, to plug-in capabilities via APIs, to multimodal capabilities, to physical interactions with figure 01, we're constantly expanding the range of experiences users can have when interacting with AI. Today, we'd like to take a look back at that journey.

The shocking rise of ChatGPT

In the five days it took chatGPT to reach 1 million users, I'd never seen anything spread so quickly. The realism of feeling like you're talking to a real person was exactly what people had been waiting for since AlphaGo. chatGPT went viral in no time, and before you knew it, it was part of the world.

Chat GPT gathered 1 million users in 5 days

GPT-plugin: AI uses a Service

Sky scanner to find cheap airline tickets, Google scholar or Arxiv to find papers. Who does this? The plug-in feature in chatGPT made it possible for AI to interact with other services through different APIs, and led many people to expect a "new ecosystem where you run services as plug-ins in chatGPT instead of downloading them from the Google Play Store." GPTs are not very active yet, so it's an unknown.

GPTs that don't meet expectations

Multimodal for a broader understanding of the world

The multimodal feature allows ChatGPT to process various forms of information, such as voice recognition and image recognition, in addition to text, which further enhances the understanding of AI. And it's not just text that can be answered, but also voice or image responses, and with the recently announced Sora AI, it's even possible to interact with video.

Sora addresses the challenges of 3D consistency, HD realism, versatility, and complex interactions, and will impact AI video technology. It is only available to select artists due to ethical concerns and issues.

App agent with direct mouse and keyboard manipulation

What if LLMs could interact with apps directly by scrolling, tapping, or swiping instead of just using an API? Well, it sounds like something could happen that could shake up the user experience of manipulating PCs and mobiles.

Recently, Google Deepmind published research on an AI agent called SIMA. They trained the AI to output keyboard and mouse actions and evaluated it to perform simple actions in a game. This suggests that LLMs can interact with the world without limits and is an area of research that will continue to be explored in the future.

A generalist AI agent for 3D virtual environments

What if AI had a physical body?

When I was a kid, I watched the movie "I, Robot," which featured Sunny, a robot that could think and move on its own. To be honest, I never thought that this sci-fi fantasy would ever become a reality, but after watching the demo of Figure 01, I realized that it might be coming soon.

Figure 01 showed me how LLM can understand your conversations and images from the camera, and interact with you through the robot. I can't wait to see what the world will look like when we have an Android robot that does your laundry and cooks with your appliances, rather than a dryer, washer, and styler all in one.

What's next in the evolution?

LLM's interaction with the world is evolving to include a broader understanding of the world and more ways to communicate with it. And what's amazing is that this isn't a costly change - we're embedding LLM into sensors, data, controllers, devices, and more to create a completely different user experience.

LLM is expected to breathe life into everything from automotive, iot, neural links, and new devices. In the process, many services and products will likely consolidate or disappear. Conversely, current devices may need to evolve in form or function to accommodate LLM. It's time to think about how to ride this wave of change.