From my notes: first impressions of Humane's Ai Pin
Added 2023-11-14 18:14:20 +0000 UTCI spent some time this weekend writing notes around the Ai Pin and recent related announcements in AI-centric personal computing. This isn't a proper essayâjust some rough notesâbut I thought enough of you might be interested that I'd share.
Humane
From Imran and Bethany and Ken Kocienda and friends, a new personal computer which de-emphasizes screens and direct interaction in favor of Ubiquitous computing ideas and AI-centrism. The companyâs first product is the âAi Pinâ, announced 2023-11-09.
Ai pin
The device is a wearable pin with a camera, microphone, speaker, touch pad, and laser projector, along with various other smaller sensors and components. The intent is to enable a mostly screen-free computing experience, driven primarily by voice.
The primary interaction pattern with the device is to press and hold its face plate, and to speak a query aloud. What distinguishes this interaction from Siri and its ilk? The query is evaluated not via POFAI-style heuristic trees, but rather via GPT-4 (or maybe sometimes 3.5-turbo?), with Retrieval-augmented generation through appropriate contextual information like location and user data (calendar, messages, contacts, recent queries, etc). It also appears to implement something like the Reason+Act pattern, so that for instance information about music or specific locations can be supplied to response generation; the model can output action plans which include things like sending a message, playing a song, and so on.
The device has a camera, but (as of 2023-11-09) its integration into queries appears to be limited to questions about nutrition and food (âHow much protein is in this?â). Otherwise, it can take impromptu photos, though of course the composition will be somewhat unpredictable.
The deviceâs primary output modality is audio, but it also features limited visual output via a monochromatic âlaser inkâ system, projected onto the userâs hand. Weâve been shown a minimal framework of gestural interaction: you can pinch your hand to actuate, tilt your hand to select a secondary function via a pie-menu-like design, and move your hand in Z to access system menus. I presume that it would be uncomfortable to use this for more than a few seconds at a time, but thatâs compatible with the deviceâs intent to keep users present in their environment.
Queries, actions, photos, reminders, and other key items are made available as a stream of memories via the Humane.Center web site.
âWe donât do appsâ
Actions begin with natural-language queries, so thereâs no traditional home screen or app switcher. Instead, the device appears to use something closer to the blackboard pattern: various services can supply information and actions to the query, as contextually appropriate (again, presumably through a combination of Retrieval-augmented generationand something like the Reason+Act pattern).
One can imagine retrieving appropriate information based on oneâs context, so that (for instance) a transit service could supply location-appropriate timing information, or information about a museum youâre visiting. The blackboard pattern also permits distributed sourcing of information and actions: perhaps your local restaurant could have a small Bluetooth device which broadcasts its menu so that it would be available for appropriate local queries.
But âwe donât do appsâ clearly isnât the full story. For example, the music player has an interface on the laser ink display. What defines that interface? And where does the music come from? In this instance, the answer appears to be Tidal, which has partnered with Humane. But I have a Spotify subscription; how might that be made to work? Perhaps Spotify (and other music players) could expose a suitable API which would permit the device to query and play music. Perhaps that API could even transmit declaratively-specified interfaces for the laser ink display.
Likewise, consider the text messaging service. Many of my friends use WhatsApp or Signal. Perhaps on this device we think of those systems as API endpoints which expose suitable information and actions to the deviceâs LLM queries, and perhaps declarative interface specifications for service-specific presentations and interactions. I see from Humaneâs web site that they have a partnership with Slack; I wonder if that necessitates extensive first-party work on their part, or if itâs more of a permission-granting relationship. That is: is it more like the original iOS YouTube app, or like iOS VOIP apps (which must be Apple-blessed IIRC)?
Finally: suppose I want to use the device to think aloud about my research as Iâm walking around. It would be natural to query my research notes, which are stored on my laptop as a big folder of Markdown files. Perhaps this information could be synced into Humaneâs cloud so that appropriate content could be supplied to conversation in the future.
Iâm extremely curious to learn more about the architectural design of the platform. I would guess that for now, in the name of shipping, itâs fairly ad-hoc and special-casedâbut that they have some more systematic plans in mind.
The power of context and deixis
One naive way to look at the Ai Pin is to ask: how is this different from Siri? One first and important difference, of course, is that itâs powered by GPT-4, which is capable of open-ended natural reasoning, while Siri is more like one of those hideous phone tree systems that tries (and often fails) to use simple heuristics and decision trees to map your request onto some limited action space it supports.
But a much more interesting difference is the way you can use words like âthisâ. âHow much protein is in this?â âWhatâs the best way to get to the meeting this afternoon?â âWhen was this park built?â You can point to things, or times, or places, or things, and the deviceâs sensors and prior activity surface the context necessary to make answers possible via Retrieval-augmented generation.
The deviceâs responses to queries which include the word âIâ or âmyâ are likewise enriched: âHave I been there before?â âDonât my notes include something about this?â I imagine that many people will end up giving basically unlimited context about their lives to these modelsâe.g. via Rewind, every piece of text Iâve ever viewed on a screen.
This same logic is what makes Dot interesting to me.
On the astounding luck of Humaneâs timing
Humane was founded in 2018, a year before GPT-2 showed that first glimmer of promise which most technologists (including me) still largely ignored. At the time, when I heard about the companyâs ambitions, the main goal seemed to be to reimagine the personal computer without screens, so that people could remain connected and present in their worlds. The patent and fundraising hype emphasis was on the laser projection display and the wearable pin form factor. Insofar as AI was focused in job reqs, the emphasis seemed to be on the computer vision which enabled the laser ink gestures. We heard talk of voice-based inquiries, but the aspirations seemed closer to something like âSiri++â than to a weak AGI.
But now, in 2023, whatâs interesting about this deviceâto meâbears little resemblance to any of that. As far as I can tell, this device will succeed or fail because it deploys a Siri which actually works, a Siri which uses a frontier LLM and supplies it with gobs of appropriate context. The laser ink display isnât nearly as central as the early hype suggested. I havenât used an Ai Pin, of course, but my impression is that if the laser ink projector werenât present at all, the device would be worse off, but its ultimate success or failure would not change. Likewise, the computer vision doesnât seem terribly essential (yetâperhaps deixis will extend more substantially to vision in the future). And the wearable pin form factor doesnât seem very important either; itâs easy to imagine a very similar device as an earbud.
All this to say: the extraordinary work of many brilliant designers and engineers notwithstanding, I believe this device will owe its success (if it succeeds) to the shocking capabilities of GPT-4. If we were still in 2018âs NLP days, and they could ship something only roughly Siri-level, the Ai Pin would amount to something like an extraordinarily expensive and cumbersome AirPod alternative. And so, my gosh: the timing! They got so, so incredibly lucky! They canât have predicted in 2018 that this much progress was going to happen by 2023; and if they did make that prediction, it would have been a pretty irresponsible bet.
How long might it have been clear that these kinds of results were possible? Maybe a fine-tuned GPT-2 could have achieved a few of their demos; they could have tried that as soon as November 2019. But probably not. GPT-2 had a 1024 token context window, which would have been too small for the contextual awareness which most of these demos rely upon. Maybe they could have made it work with many-shot GPT-3 prompting as soon as its invite-only availability in June 2020. But I doubt they would have gotten far before InstructGPT at the earliest, in January 2022. Thatâs awfully recent! And prior to that release, very few technologists had internalized the astonishing acceleration of language transformersâ capabilities. I doubt the Humane folks had.
In summary: what unbelievably fortunate timing for the Humane team! If InstructGPT had taken two years longer, would they have been able to sustain their funding? Would they have released some âSiri++â-like thing instead? Could they have survived that?
Versus AirPods and an Apple Watch
One problem for the Ai Pin is: if what I find compelling is that it offers an intelligent, context-laden, voice-driven AI assistant⊠then does that justify a significant hardware purchase? Is it a defensible moat?
I often keep an AirPod in all day while walking around. I can access a voice-driven AI assistant via an AirPod. Of course, an AirPod canât show me visualsâthereâs no laser ink displayâand only I can hear the audio responses, which you could argue cuts me off from people around me in a way that the Ai Pin does not. But, OK: letâs consider the combination of the AirPods with an Apple Watch. Now Iâve got a visual display with similar I/O limitations: you only want to use it a few seconds at a time, itâs a bit anti-social to consult, and it accepts only limited gestural input. And my watch has a speaker which could emit responses to friends if Iâm in a social setting. (The remaining important difference in basic functionality is a cameraâbut the Ai Pinâs camera doesnât seem to expand its capacity much. Maybe this difference will become decisive in time; I can certainly imagine that.)
The main problem with the AirPod and Apple Watch combination is that theyâre made by Apple, and Apple isnât presently participating in frontier language model applications. Apple strikes me as exceedingly unlikely to partner with OpenAI, given its privacy stance. I expect that Apple will continue to lose in competition for top-tier ML talent. Maybe they can replicate something like GPT-4 with their internal teams, but given that Anthropic and Google havenât managed to do so after eight months, I expect this will take some time. Meanwhile, Apple wonât let third-party apps like Dot have enough access to the hardware to create interactions which have as little friction as the Ai Pinâs.
But if that litany werenât trueâif Apple were competitive in the AI race, or if its platforms were more openâitâs hard to imagine that Iâd buy an Ai Pin if I already had AirPods and an Apple Watch.
What if I didnât have those devices already? Well, Iâd want Bluetooth headphones of some kind. In Humaneâs advertisements, they show people playing music on the Pinâs speakers. I hate when people play personal audio in public places. So Iâd want to buy something like the AirPods anyway. As far as the Watch versus the Pin, I suppose itâs mostly a question of form factor. I find the watch form factor to be a less obtrusive placement for a wearable, but I can imagine that others would prefer the pinâs placement. Itâs great that the pinâs placement enables vision-based workflows. But I assume that Iâd continue to carry my iPhone, even if I had an Ai Pin. And so I can imagine a workflow where (wearing an AirPod), I say âSiri, whatâs this?â and then pull my phone out of my pocket to hold it up to some object for a moment. A little more friction, perhaps, but it doesnât seem like a clear dealbreaker to me. And of course, in this configuration, I can take photos with more intentional composition and a vastly more capable camera.
Taken together, my rough take is that the Ai Pin is compelling as an independent hardware purchase only due to Appleâs cultural problems and platform policies. I donât think the device hardware itself is really necessary beyond the existing hardware already available to consumersâthe unique capabilities are mostly a matter of integrated design execution and software. Thatâs a scary position to be in!
Does Humane succeed in its stated goal?
Humane aspires to help people remain more connected and presentâwith each other, in the world, not staring at their screens. Does it succeed? Iâll need to try the device to say, really. My high-level impression is that it offers a less obtrusive avenue to input and output for many standard tasks than a smartphone, but itâs not obvious to me that it performs better in that regard than an AirPod. Or, if it does, thatâs in large part only because of Appleâs AI limitations (see previous section). The laser ink display doesnât strike me as a victory for staying present: it looks uncomfortable and anti-social. Likewise, public audio input and output seems pretty obtrusive to me. For most applications, Iâd prefer private audio output andâideallyâa Silent speech interface for input.
I look forward to updating this once Iâve had a chance to try the device!
2023-11-09 marketing impressions
Iâm honestly quite shocked by the contrast between Humaneâs 10 minute marketing video and Dotâs marketing story. Humaneâs is quite technology-centric, thing-centric. The first sentences? âItâs a standalone device and software platform build from the ground up for AI. It comes in three colorways⊠Thereâs two pieces, a computer and a battery booster. Now, the battery booster powers a small battery inside the main computer⊠This is a perpetual power system⊠Built right in, our own Humane network, connected by T-Mobile⊠It runs a Qualcomm Snapdragon chip setâŠâ What?!
Tell me stories! Tell me about how this helps me live a better life!Contrast to Dotâs warm and human story of Mei and her journey. And gosh: the delivery is bafflingly low-affect, low-energy. Imran, friend, are you OK? And why are we looking at this device in your extremely cold and sterile lab, rather than out in the worldâthe one itâs supposed to help us remain connected with? Iâm a fan of Humaneâs ideas, and of Imran and Bethany and Ken, but I canât fathom why they made their headline introductory video this way, or why they thought this was an attractive way to present all their hard work.
The marketing web page likewise begins with technology: laser ink display, gestures, touch-and-hold, etc. It improves from there, and ends up in a better place than the 10 minute video, showing some authentic scenarios in which the Ai Mic and other related features would be useful in life.
Humaneâs 1 minute video is much better by contrast, much more centered on reality and life. There are some very good bitsâparticularly a âcatch me upâ moment, a mother taking a photo of a child, and a bilingual interpreter. It does come off as a bit a grab bag of features, rather than a unified vision, and I donât always understand the curatorial direction. For instance, it strikes me as odd that ânutritional factsâ is the first thing emphasized in the film debuting this device. High wind alerts?
One of the demonstrations is âwhat are some fun things to do nearby?â Iâm not sure that the Ai Pinâs form factor really shines in this query. This task lends itself to laterality and visualsâitâs best answered by quickly scanning a big list, with imagery and information density, and pointing at items of interest to dig into. Thatâs not something that the voice and laser ink combo do all that well. I think the people in the video would be better off with a traditional smartphone display for this task.
Comments
Interestingly, Spotify has demonstrated themselves very willing to open up control in this way! They've done integrations with smart home speakers, controlled by voice assistants. You're right, though, that this seems tricky for many business's models.
Andy Matuschak
2023-11-17 17:53:03 +0000 UTCReally enjoyed your notes on this topic. I was also not moved by the demo and thought along the same lines that much of its capabilities could be achieved through better LLM integration with our current smart phones devices. I have a ChatGPT shortcut on my lock screen on my iPhone and that does the job just fine whenever I need to interact with an LLM I could also see other apps or maybe a super app that is based on LLMâs having the same location on my Lock Screen. The question I have, however, is how willing are companies like Spotify to essentially completely open their API and get rid of their user interfaces while being used by apps like this. So much of the stickiness and Control is handled by the UI and to completely handover this aspect of their business seems unlikely.
Ahmad Rachid El-Bobou
2023-11-17 13:33:01 +0000 UTCAs a fellow computer enthusiast I am also less excited about this than I want to be. Like I should be ordering one (and maybe I still will) but none of the demos made me desire it, and I agree that the interaction model is pretty anti-social, and the marketing was low vibes. Makes me sad because I desperately want new computers...
Andrew Sutherland
2023-11-16 03:03:54 +0000 UTC