Andy Matuschak

From my notes: first impressions of Humane's Ai Pin

Added 2023-11-14 18:14:20 +0000 UTC

I spent some time this weekend writing notes around the Ai Pin and recent related announcements in AI-centric personal computing. This isn't a proper essay—just some rough notes—but I thought enough of you might be interested that I'd share.

Humane

From Imran and Bethany and Ken Kocienda and friends, a new personal computer which de-emphasizes screens and direct interaction in favor of Ubiquitous computing ideas and AI-centrism. The company’s first product is the “Ai Pin”, announced 2023-11-09.

Ai pin

The device is a wearable pin with a camera, microphone, speaker, touch pad, and laser projector, along with various other smaller sensors and components. The intent is to enable a mostly screen-free computing experience, driven primarily by voice.

The primary interaction pattern with the device is to press and hold its face plate, and to speak a query aloud. What distinguishes this interaction from Siri and its ilk? The query is evaluated not via POFAI-style heuristic trees, but rather via GPT-4 (or maybe sometimes 3.5-turbo?), with Retrieval-augmented generation through appropriate contextual information like location and user data (calendar, messages, contacts, recent queries, etc). It also appears to implement something like the Reason+Act pattern, so that for instance information about music or specific locations can be supplied to response generation; the model can output action plans which include things like sending a message, playing a song, and so on.

The device has a camera, but (as of 2023-11-09) its integration into queries appears to be limited to questions about nutrition and food (“How much protein is in this?”). Otherwise, it can take impromptu photos, though of course the composition will be somewhat unpredictable.

The device’s primary output modality is audio, but it also features limited visual output via a monochromatic “laser ink” system, projected onto the user’s hand. We’ve been shown a minimal framework of gestural interaction: you can pinch your hand to actuate, tilt your hand to select a secondary function via a pie-menu-like design, and move your hand in Z to access system menus. I presume that it would be uncomfortable to use this for more than a few seconds at a time, but that’s compatible with the device’s intent to keep users present in their environment.

Queries, actions, photos, reminders, and other key items are made available as a stream of memories via the Humane.Center web site.

“We don’t do apps”

Actions begin with natural-language queries, so there’s no traditional home screen or app switcher. Instead, the device appears to use something closer to the blackboard pattern: various services can supply information and actions to the query, as contextually appropriate (again, presumably through a combination of Retrieval-augmented generationand something like the Reason+Act pattern).

One can imagine retrieving appropriate information based on one’s context, so that (for instance) a transit service could supply location-appropriate timing information, or information about a museum you’re visiting. The blackboard pattern also permits distributed sourcing of information and actions: perhaps your local restaurant could have a small Bluetooth device which broadcasts its menu so that it would be available for appropriate local queries.

But “we don’t do apps” clearly isn’t the full story. For example, the music player has an interface on the laser ink display. What defines that interface? And where does the music come from? In this instance, the answer appears to be Tidal, which has partnered with Humane. But I have a Spotify subscription; how might that be made to work? Perhaps Spotify (and other music players) could expose a suitable API which would permit the device to query and play music. Perhaps that API could even transmit declaratively-specified interfaces for the laser ink display.

Likewise, consider the text messaging service. Many of my friends use WhatsApp or Signal. Perhaps on this device we think of those systems as API endpoints which expose suitable information and actions to the device’s LLM queries, and perhaps declarative interface specifications for service-specific presentations and interactions. I see from Humane’s web site that they have a partnership with Slack; I wonder if that necessitates extensive first-party work on their part, or if it’s more of a permission-granting relationship. That is: is it more like the original iOS YouTube app, or like iOS VOIP apps (which must be Apple-blessed IIRC)?

Finally: suppose I want to use the device to think aloud about my research as I’m walking around. It would be natural to query my research notes, which are stored on my laptop as a big folder of Markdown files. Perhaps this information could be synced into Humane’s cloud so that appropriate content could be supplied to conversation in the future.

I’m extremely curious to learn more about the architectural design of the platform. I would guess that for now, in the name of shipping, it’s fairly ad-hoc and special-cased—but that they have some more systematic plans in mind.

The power of context and deixis

One naive way to look at the Ai Pin is to ask: how is this different from Siri? One first and important difference, of course, is that it’s powered by GPT-4, which is capable of open-ended natural reasoning, while Siri is more like one of those hideous phone tree systems that tries (and often fails) to use simple heuristics and decision trees to map your request onto some limited action space it supports.

But a much more interesting difference is the way you can use words like “this”. “How much protein is in this?” “What’s the best way to get to the meeting this afternoon?” “When was this park built?” You can point to things, or times, or places, or things, and the device’s sensors and prior activity surface the context necessary to make answers possible via Retrieval-augmented generation.

The device’s responses to queries which include the word “I” or “my” are likewise enriched: “Have I been there before?” “Don’t my notes include something about this?” I imagine that many people will end up giving basically unlimited context about their lives to these models—e.g. via Rewind, every piece of text I’ve ever viewed on a screen.

This same logic is what makes Dot interesting to me.

On the astounding luck of Humane’s timing

Humane was founded in 2018, a year before GPT-2 showed that first glimmer of promise which most technologists (including me) still largely ignored. At the time, when I heard about the company’s ambitions, the main goal seemed to be to reimagine the personal computer without screens, so that people could remain connected and present in their worlds. The patent and fundraising hype emphasis was on the laser projection display and the wearable pin form factor. Insofar as AI was focused in job reqs, the emphasis seemed to be on the computer vision which enabled the laser ink gestures. We heard talk of voice-based inquiries, but the aspirations seemed closer to something like “Siri++” than to a weak AGI.

But now, in 2023, what’s interesting about this device—to me—bears little resemblance to any of that. As far as I can tell, this device will succeed or fail because it deploys a Siri which actually works, a Siri which uses a frontier LLM and supplies it with gobs of appropriate context. The laser ink display isn’t nearly as central as the early hype suggested. I haven’t used an Ai Pin, of course, but my impression is that if the laser ink projector weren’t present at all, the device would be worse off, but its ultimate success or failure would not change. Likewise, the computer vision doesn’t seem terribly essential (yet—perhaps deixis will extend more substantially to vision in the future). And the wearable pin form factor doesn’t seem very important either; it’s easy to imagine a very similar device as an earbud.

All this to say: the extraordinary work of many brilliant designers and engineers notwithstanding, I believe this device will owe its success (if it succeeds) to the shocking capabilities of GPT-4. If we were still in 2018’s NLP days, and they could ship something only roughly Siri-level, the Ai Pin would amount to something like an extraordinarily expensive and cumbersome AirPod alternative. And so, my gosh: the timing! They got so, so incredibly lucky! They can’t have predicted in 2018 that this much progress was going to happen by 2023; and if they did make that prediction, it would have been a pretty irresponsible bet.

How long might it have been clear that these kinds of results were possible? Maybe a fine-tuned GPT-2 could have achieved a few of their demos; they could have tried that as soon as November 2019. But probably not. GPT-2 had a 1024 token context window, which would have been too small for the contextual awareness which most of these demos rely upon. Maybe they could have made it work with many-shot GPT-3 prompting as soon as its invite-only availability in June 2020. But I doubt they would have gotten far before InstructGPT at the earliest, in January 2022. That’s awfully recent! And prior to that release, very few technologists had internalized the astonishing acceleration of language transformers’ capabilities. I doubt the Humane folks had.

In summary: what unbelievably fortunate timing for the Humane team! If InstructGPT had taken two years longer, would they have been able to sustain their funding? Would they have released some “Siri++”-like thing instead? Could they have survived that?

Versus AirPods and an Apple Watch

One problem for the Ai Pin is: if what I find compelling is that it offers an intelligent, context-laden, voice-driven AI assistant… then does that justify a significant hardware purchase? Is it a defensible moat?

I often keep an AirPod in all day while walking around. I can access a voice-driven AI assistant via an AirPod. Of course, an AirPod can’t show me visuals—there’s no laser ink display—and only I can hear the audio responses, which you could argue cuts me off from people around me in a way that the Ai Pin does not. But, OK: let’s consider the combination of the AirPods with an Apple Watch. Now I’ve got a visual display with similar I/O limitations: you only want to use it a few seconds at a time, it’s a bit anti-social to consult, and it accepts only limited gestural input. And my watch has a speaker which could emit responses to friends if I’m in a social setting. (The remaining important difference in basic functionality is a camera—but the Ai Pin’s camera doesn’t seem to expand its capacity much. Maybe this difference will become decisive in time; I can certainly imagine that.)

The main problem with the AirPod and Apple Watch combination is that they’re made by Apple, and Apple isn’t presently participating in frontier language model applications. Apple strikes me as exceedingly unlikely to partner with OpenAI, given its privacy stance. I expect that Apple will continue to lose in competition for top-tier ML talent. Maybe they can replicate something like GPT-4 with their internal teams, but given that Anthropic and Google haven’t managed to do so after eight months, I expect this will take some time. Meanwhile, Apple won’t let third-party apps like Dot have enough access to the hardware to create interactions which have as little friction as the Ai Pin’s.

But if that litany weren’t true—if Apple were competitive in the AI race, or if its platforms were more open—it’s hard to imagine that I’d buy an Ai Pin if I already had AirPods and an Apple Watch.

What if I didn’t have those devices already? Well, I’d want Bluetooth headphones of some kind. In Humane’s advertisements, they show people playing music on the Pin’s speakers. I hate when people play personal audio in public places. So I’d want to buy something like the AirPods anyway. As far as the Watch versus the Pin, I suppose it’s mostly a question of form factor. I find the watch form factor to be a less obtrusive placement for a wearable, but I can imagine that others would prefer the pin’s placement. It’s great that the pin’s placement enables vision-based workflows. But I assume that I’d continue to carry my iPhone, even if I had an Ai Pin. And so I can imagine a workflow where (wearing an AirPod), I say “Siri, what’s this?” and then pull my phone out of my pocket to hold it up to some object for a moment. A little more friction, perhaps, but it doesn’t seem like a clear dealbreaker to me. And of course, in this configuration, I can take photos with more intentional composition and a vastly more capable camera.

Taken together, my rough take is that the Ai Pin is compelling as an independent hardware purchase only due to Apple’s cultural problems and platform policies. I don’t think the device hardware itself is really necessary beyond the existing hardware already available to consumers—the unique capabilities are mostly a matter of integrated design execution and software. That’s a scary position to be in!

Does Humane succeed in its stated goal?

Humane aspires to help people remain more connected and present—with each other, in the world, not staring at their screens. Does it succeed? I’ll need to try the device to say, really. My high-level impression is that it offers a less obtrusive avenue to input and output for many standard tasks than a smartphone, but it’s not obvious to me that it performs better in that regard than an AirPod. Or, if it does, that’s in large part only because of Apple’s AI limitations (see previous section). The laser ink display doesn’t strike me as a victory for staying present: it looks uncomfortable and anti-social. Likewise, public audio input and output seems pretty obtrusive to me. For most applications, I’d prefer private audio output and—ideally—a Silent speech interface for input.

I look forward to updating this once I’ve had a chance to try the device!

2023-11-09 marketing impressions

I’m honestly quite shocked by the contrast between Humane’s 10 minute marketing video and Dot’s marketing story. Humane’s is quite technology-centric, thing-centric. The first sentences? “It’s a standalone device and software platform build from the ground up for AI. It comes in three colorways… There’s two pieces, a computer and a battery booster. Now, the battery booster powers a small battery inside the main computer… This is a perpetual power system… Built right in, our own Humane network, connected by T-Mobile… It runs a Qualcomm Snapdragon chip set…” What?!

Tell me stories! Tell me about how this helps me live a better life!Contrast to Dot’s warm and human story of Mei and her journey. And gosh: the delivery is bafflingly low-affect, low-energy. Imran, friend, are you OK? And why are we looking at this device in your extremely cold and sterile lab, rather than out in the world—the one it’s supposed to help us remain connected with? I’m a fan of Humane’s ideas, and of Imran and Bethany and Ken, but I can’t fathom why they made their headline introductory video this way, or why they thought this was an attractive way to present all their hard work.

The marketing web page likewise begins with technology: laser ink display, gestures, touch-and-hold, etc. It improves from there, and ends up in a better place than the 10 minute video, showing some authentic scenarios in which the Ai Mic and other related features would be useful in life.

Humane’s 1 minute video is much better by contrast, much more centered on reality and life. There are some very good bits—particularly a “catch me up” moment, a mother taking a photo of a child, and a bilingual interpreter. It does come off as a bit a grab bag of features, rather than a unified vision, and I don’t always understand the curatorial direction. For instance, it strikes me as odd that “nutritional facts” is the first thing emphasized in the film debuting this device. High wind alerts?

One of the demonstrations is “what are some fun things to do nearby?” I’m not sure that the Ai Pin’s form factor really shines in this query. This task lends itself to laterality and visuals—it’s best answered by quickly scanning a big list, with imagery and information density, and pointing at items of interest to dig into. That’s not something that the voice and laser ink combo do all that well. I think the people in the video would be better off with a traditional smartphone display for this task.