Background: researched this space for a graduate degree.
There are a few issues that are unanswered by this video (which isn't intended to be a technical deep dive, but I don't see any related links in the video description):
1. How do these glasses handle multiple simultaneous speakers? Based on the display I saw, it shows the speakers' words sequentially, which starts to fall apart in real-world environments, especially group conversations. This is a big problem, and wider adoption is contingent on handling this elegantly.
2. These appear to be the classic "smart glasses" display style that's pervasive in consumer head-worn displays today, where content is projected at a fixed depth in front of the wearer. Because the captions aren't anchored at the same focal distance as the speaker, the wearer's eyes will swap between the captions and the speaker's faces, which is a tiring activity, and can make the wearer feel like they're not part of the conversation or being rude.
3. As mentioned by another commenter, this is a useful idea for people who lose their hearing later in life. That said, this is less (although certainly still) useful for people who have congenital hearing loss and primarily communicate via ASL.
All in all, it's exciting to see growing interest in this space, as it's easily extendable to people learning a new language or navigating a foreign country. I think offloading the speech-to-text to a tethered mobile device is a good choice (though it would be nice to do low-latency wireless transmission).
1. For multiple simultaneous speakers of comparable volume, it’s only as good as the underlying speech-to-text engines we’ve implemented/integrated, which is currently not very good. It’s active area of research and engineering for us and we believe we’ll make strides to improve things; but, as you rightly point out, solving the crosstalk problem is very difficult. For the more general so-called "cocktail party problem", we can do a good job of filtering out more distant/lower volume voices and other environmental noise. Choosing the right microphone can improve things further, for example by pairing a noise canceling Bluetooth lapel mic.
2. We allow one to project the subtitles at varying depth, within the capabilities of glasses. We're seeing an effective focal depth range for fixed apparent size of about 0.5m to 3m. If one also allows change in apparent size, to simulate perspective scaling, the range is higher.
Seems like the multiple microphone beamforming source separation algorithms are getting pretty good these days, maybe just adding a lot more mics would help?
Could you have an AI model that extracts some characteristics of the speaker's voice for each individual word, then translates that to color and font?
If the model was not confident about a word it could show slightly blurred, if it was loud it could be bold, perhaps(Although there's some stereotype issues) you could use different fonts for different pitches, whispers could be grey, quiet could be transparent.
Maybe there's a language model that can pick up overlapping words if you don't have the constraint of needing to sort them out into who said it, just show all the possibile words that could have been said by anyone stacked together, in a "not sure" color, and maybe the wearer would eventually learn to figure it out without much effort?
You could also try to stay consistent so the same speaker gets the same colors I'd possible, and also not reuse colors for new speakers that have been recently used, to best make use of the limited bits of data in font and color.
Maybe just by showing all the words from every speaker all together like that, the wearer would be able to figure it out even if it made mistakes in the speaker identification?
Regarding additions mics, yes that enabled more advanced spatial processing, especially when arranged in a determined and calibrated geometry. One could imagine glasses with several mics placed at optimal locations on the frames. Such multichannel audio could then be processed into multichannel spatial audio streams.
As for the rest, thank you for wonderful ideas! Everything you propose is technically possible. The difficulties arise first in assessing the increased benefit to users versus the increased complexity of user experience, and second in prioritizing the work versus other features. Over time we do hope to add additional selectable “skins”, which is to say different UI designs, that allow users to choose UI’s from a wide range. Everything from the simple to complex layouts, from accessible to exotic color palettes, from professional to playful themes, etc. I could definitely see more advanced visual representations of transcription uncertainty showing up in such optional skins.
Wow! Thanks for the response. This is exciting work, and I’m pleased to see it being iterated on.
Re: #2. I’m assuming the varying depth is manually-controlled? Or is it automated by some method? If it’s manual, can the adjustment be made while transcription is active? In other words, can I change the focal distance to match the speaker without interrupting the speaker?
All in all, cool stuff! Best of luck with the work.
Indeed, as Dan said, one can change the depth on the fly. In fact, one goal of the development team is to make as much functionality as possible changeable on the fly. For example, you can currently change subtitle depth, pinning, and size on the fly, spoken language, subtitle language, microphones, and audio settings on the fly, etc.
I’d love for every setting and feature to support on the fly changes. That said, some things are currently fixed for a session, such as recording audio, and some third party software we utilize is less dynamic and forgiving of changes on the fly. For better or worse, in our software world of today, the sage advice of The IT Crowd “Have you tried turning on and off again?” still seems to hold with pragmatic force. And it still holds with XRAI ... sometimes ;-)
I imagine it’d substantially increase the compute load, but I’d be curious if you could use multiple microphones and beam forming to separate out the streams of speech and feed them to the TTS algorithm independently.
Single-mic source separation is possible in an unsupervised manner today that could probably work better than beamforming both compute-wise and with regard to implementation difficulty (you'd just need a lot of recordings to represent the space of sounds you want to separate).
Do you have any references for this or a link to a commercial service? I'm currently in the process or trying to extract some background voices in a video (an interview where the faint background conversation is in English and the loud overdub is in Bulgarian). I tried Melodyne but it seems to only separate on pitch, not volume, and the pitchs are too similar (mono, three voices, all female) and words are made of lots of short phonemes which each create a "note" and makes editing impossible. I looked into Izotope RX as well and it does not seem capable of doing this either. There are services that can automatically add subtitles using speech-to-text translators but they are expensive, and I'd prefer to have the background voices rather than the Bulgarian interpretation of them translated back into English. It seems possible to do, in Peter Jackson's The Beatles:Get Back they were able to separate voices from loud foreground musical sounds and other ambient talking, but that technology was custom and doesn't seem to be publicly available yet.
I haven't seen it provided as a commercial service or free model yet, but there is open source code for Mixit that lets you train using the open source / canned FUSS dataset.
Maybe a combination? Even simple beamforming/stereo would be helpful to help display the speaker's location. For example, the "speaker 1" tag could appear on the left, center, or right of the display to give a spatial clue where they are located.
You could only display the speaker's location if you had some way to associate streams with the individuals. So you'd have to train an audio-visual association model.
The thing you could do is train a localizer on the separated audio. Phase is estimated by the source separation process, so you can actually train an ML model provided you have some ground truth (e.g. estimated human locations from camera detections)
Do you think a ChatGPT real time processing engine will soon be able to parse the cross talk for words that are detected and given a probability score that then can then be reordered/reassembled back into a pretty accurate sounding conversation displayed back to the glasses wearer?
This would be such a cool use case for the latest ChatGPT tech when it gets faster in the near future.
> For multiple simultaneous speakers of comparable volume
If you had more microphones placed in multiple spots on the glasses, such as up to 5 microphones - 1 in the center, 2 on the frames/end pieces, and perhaps a final 2 on the arms/temples. Then that would be able to catch conversation coming at a person from behind them, the sides, or directly in front, etc.
This is a classic curbcut in the sense that it will help those with heading as much (if not more than) the D/deaf community. Still very excited for it - agree with all your questions and concerns.
As usual, this marketing seems most directed at normate ideas about what disabled people want / need, but the tech seems very cool and there does seem to be potential. Without looking into the product deeply it seems like there are D/deaf people on the team, which gives me hope.
I do wish that we would just embrace the idea that using machines to make information available in many mediums is something all people can use and appreciate.
To make the comment easier to read for others, it looks like the commenter may have made a slight typo and meant to write "hearing" instead of heading, to convey:
> "This is a classic curbcut in the sense that it will help those with [edit: hearing] as much (if not more than)..."
Is multiple simultaneous speakers at the same volume/distance actually an important problem to solve? I already can't have a conversation if that's happening and my hearing is fine.
Let’s say you’re in a meeting with colleagues who are having a side discussion about something, which you can contribute to. In this current setup, that conversation would not be displayed, while as a hearing person, you could context switch on a dime.
It’s a decently-common occurrence. There’s a significant amount of literature pointing to the fact that people who are DHH avoid group conversations because they can’t follow along (see references). This sort of falls hand-in-hand with speech localization, which is another unsolved problem (and likely is unsolvable without true AR).
I was about to make the same comment: my brain works the same way. A one-on-one conversation with low ambient noise, no problem. A bunch of loud conversations all around me? Forget it, I can't understand anything. I don't see why an aid for hearing-impaired people needs to handle this at all.
This is an extremely common situation at many restaurants and is a reason I don't really enjoy going out to eat with friends without prior investigation.
Is low latency wireless transmission feasible now?
The last time I worked on anything here, there were a number of problems with transmitting highly compressed, low resolution video data. The consumer devices could not handle just sending the packets. In my project we were annotating real time video and only sending back the annotations, but even that would cause devices to overheat and the applications to fail in really interesting ways.
I have a different use case, involving wearing them in the house: listen to what my girlfriend says, use some ML to analyze whether I need to know it and if so, put it on the display.
She has a different "speech mode" than I: she speaks while I'm reading or washing the dishes or whatever, and sometimes it's to herself, sometimes to Siri, and sometimes it's something she wants me to know.
That would still require him to listen to his girlfriend talk. With all the technological innovation available today, nobody should have to listen to their girlfriends talk.
Do you have any idea how hard it is to train partners to use your name once they fall out of practice? It's like there is a negative reinforcement stimulus applied everytime they do use your name.
I noticed that my wife stopped using my name when it became more ambiguous as to who she was talking to because our kids are now older and conversations (really instructions and queries) are at the "adult content" level rather then "child content" level.
There’s a significant amount of evidence pointing to the fact that “anchored” captions (a la speech bubbles, like a comic book) is a preferred way to render captions, assuming the bubbles are at the same focal distance as the speaker. This is solvable with “true” AR, but getting that compute into the lightweight form factor of the original video is an unsolved problem, and is a ways off.
>3. As mentioned by another commenter, this is a useful idea for people who lose their hearing later in life. That said, this is less (although certainly still) useful for people who have congenital hearing loss and primarily communicate via ASL.>
Someone primarily communicates with ASL and then there's me that doesn't know ASL. I can speak to them, and they can read what I've spoken. That works pretty well. They communicate with me via text to speech, or (I guess in the near future) ASL to speech - however that will work.
ASL is heavily inspired by English, so it's usually very easy for people to become basically conversant in it rather quickly. For signs you don't know, there's literal "finger spelling" that's part of the language, so conversational learning is greatly aided by this.
Which.. aside from that, you could do this anyways. When I first started living with a deaf person, I just wrote things down on paper, and they wrote back. The glasses have some benefits over paper, but significant drawbacks as well, and many deaf people I've met have been trained to read lips.
I would disagree! It takes time and resources to learn ASL or even fingerspelling. This would allow people without the time/resources to communicate effortlessly, which I would say is a marked improvement. That said, you could argue it would be a one-way conversation, which is a fair point.
I'm a hearing person and I've spent a summer interning in a 50/50 mixed Deaf and hearing research group.
My take is that this is a huge UI improvement for AI speech to text, which a lot of Deaf people are already using to listen to conversations. It seems particularity great because it allows this technology to provide situational awareness while, for example, walking.
It's important to remember, though that, for conversations where you're trying to include a Deaf person who isn't good at speaking or chooses not to speak, speech to text is a fundamentally unequal communication modality. They will be able to "receive" but they won't be able to "transmit", which makes for extremely lopsided conversations. There is no substitute for taking the time to learn a sign language or to have conversations via writing (no substitute for sign language as it requires a lot of patience from both parties).
We get this wrong even for hearing people in a lot of situations. Conference calls in particular, the people in the room have a different experience from those on the call. Side conversations in a single room can be disruptive, but over a phone the secondary conversation and the primary can turn into an unintelligible mash.
Hard of hearing people have the same problem in person. We aren't really at a place yet where someone wearing hearing aids still has 3d hearing. So like on a conference call, they can't figure out what's going on when four people are talking at once.
I know a partially deaf kid who prefers socializing on Discord, because of this. Everyone is equal and all conversations have to be 2 people at a time.
Translating sign languages to spoken languages is really, really hard. Sign languages have a lot of features that don't really show up in spoken languages.
Let's take an example: the sign for "send email" in ASL (http://www.lifeprint.com/asl101/pages-signs/e/email.htm). If I point at you at the end of the sign it could mean "I will send you an email." If I point at myself it could mean "You should send me an email" or "Did you send me an email" depending on my facial expression. If I point off into space it could mean "I'm sending an email." If I start by pointing at you and then end the sign by pointing off into space it could mean "You should send an email." So your translator AI needs not only to understand the facial expressions and movements of the signer, but also the spacial relationships of everyone in the conversation. And that is just one aspect of the difficulty - there are many other features of sign languages that are just as hard to translate.
Perhaps this is the sort of thing that future AI systems could do. But it is quite complex.
Not an expert in either technology but I would imagine a lot of the difficulties of speech to text (bad recording conditions, variety of accents & pronunciation differences, etc) also have analogues in computer recognition of hand signs (camera alignment is bad, cut off, lighting is bad, someone's hand signs are lazily performed or slightly different than textbook ASL, etc).
Speech to text was "solved" a long time ago but I've seen it take many years to become as usable as it has recently. And it still regularly is frustrating to use for me!
My experience of speech to text has been terrible. As soon as someone has a bit of an accent, it becomes completely incoherent, even if all participants understand the accent perfectly. Even for good accents, using any kind of technical terms or proper names throws it off. And even when none of these are a problem, it still has at least a few percent error, even for English.
Microsoft Research had early implementation for this kind of visual sign language translation with Kinect & ASL recognition in 2013 or so. I expect that with the death of Kinect in the market it stayed lab-bound.
Sign language (speaking with ASL in mind, unfamiliar with others) uses a different grammatical structure from American English. To boot, a signer can adopt different body languages to imbue their message with different tonality. Current ML models (as far as I’m aware) struggle with capturing that tonality in speech-to-text, and there’s _far_ less data for sign language recognition, regrettably.
Sign language is a complex language, so something has to learn yet another language. Perhaps made all the more complex as it's spatial and visual instead of verbal or written.
Not every deaf person uses sign language and there are many different sign languages in the world. American Sign Language(ASL) is but one of these languages.
Specifically, sign languages are not visual representations of existing languages (e.g. ASL and English) but completely different languages altogether.
I was waiting for someone to say this. I have taught Humanities in a program which included a significant number of deaf students. Communication was enabled via translators. The difference in language culture plus the fact that everything was mediated through a third party made the whole experience profoundly unsatisfactory. Nuance was completely lost. Perhaps if I had recovered training in sign language things world have been different.
Exactly. Though Visual "representations"(for lack of a better word) of English do exist in dialects like SEE(Signed Exact English), which is obviously not the same as ASL.
I'm sure other signed languages have rough equivalents in their regional areas as well. I know Mexico has a few different signed languages, though I only have passing familiarity with 1 of their signed languages and it's definitely not a representation of Spanish.
I'm sure that's something someone could do! And it sounds like a very fun project. It probably hasn't already been done because there already exist very good speech-to-text models and not many (any?) sign-language-to-{text,audio} models
Hi there: Some feedback from a sing language interpreter, my wife, as I showed her this.
> 4) You don't need glasses to test it, just an Android 12+ phone.
This exactly pointed her. Some deaf people will use any dictation software on the phone, and look at the phone when needed. This glasses instead will cover some field of view. Note that view for deaf people is more important and used than for the rest of us. She couldn't see the improvement of using bulky glasses instead of lowering your eyes to the phone.
Personally I think the endeavor is admirable and wish you best of luck. Also, as other comments say, I think this product might be more desirable for HOF and late in life hearing loss sectors than born deaf people.
The advantage of using AR glasses is you can still look at the person, see their facial expressions, the reactions etc without always having to look down at your phone. The glasses aren't very bulky or heavy. We're just providing an option for those that want it. It's a magical experience :)
What are the privacy/security issues with this? Does this mean every conversation a person wearing these has (or that occurs within earshot) is being collected and harvested by someone? Will the AR be used to insert ads into people's conversations or plaster images of ads all over the place? Will certain words or phrases be automatically censored?
This is cool tech, that could be used to help people, but it comes with lots of potential for new forms of evil that were not possible without it. Considering that I can't remember the last time I bought a product using a new technology that wasn't also designed to work against my interests, I'm immediately skeptical of any device that can't be used offline and especially one that requires being connected to cell phone apps.
We've taken a privacy-first approach here. All data is only ever stored on the device, owned by the user, inaccessible to us. It's only ever transcribing when a user asks it to and only stored if the user asks it to be. We don't censor anything. We are soon to release purely on-device transcription, but the quality of this is still not as good as the cloud providers offer. The app itself is what powers the glasses, they are just output devices.
I hadn't thought that you would be, but I don't know anyone who hasn't had at least one service they used change their terms for the worse over time, especially once it becomes successful or ownership/management changes. Better to safeguard against such problems before they are problems than come to depend on a device like this only to find yourself stuck when the rug gets pulled out from under you.
> We are soon to release purely on-device transcription
This is really great! It's so important that people don't have to worry that things said between them and a lover, or a doctor, or a client, or a therapist is going to end up exposed to anyone else.
If you really want this device to be able to help people without opening them up to exploitation, please try to develop it to be as open as possible. Making it easy for developers to use it with whatever other hardware they like (a PC vs a cell phone), or even allowing them to extend it by adding new or customized functionality would be ideal and could help people in ways you hadn't even considered.
This is a cool product and a great use of the technology, and I'd love to be able to recommend it without reservation.
We are soon releasing versions for iOS, Mac and Windows. We are also hoping to release an extensibility model to allow others to build skins / modules. The glasses will only work properly on Android though.
It's not that easy to pick up every speech utterance in a wide range and separate it by speaker. The further away the sound is, the less intelligible it is. An artificial non-directional microphone is unlikely to pick up with the same clarity or distance as with your own ears. There should not be any privacy concern with having a microphone ear vs a biological ear. If there is a concern the best way to manage it is to not talk about confidential topics with other people in the room.
There won't be any ads, because people would just use their phones instead which don't have ads. Specifically Google Live Transcribe, Otter and the like. Those require a data connection to the network, but there are versions that don't need the network at all. E.g. Chrome's Live Caption option. Eventually as technology becomes more power efficient and miniaturized it won't need to be paired to a phone.
The advantage of glasses is that people find it very distracting seeing a phone scrolling away, my GP stares at the phone instead of me because he is fascinated by it. Sometimes you can't be holding a phone up if something is being worked on. The glasses would also allow for a bit more directionality. It's a promising tool depending on how well it is implemented.
The likelihood that what you say "in person" is recorded and stored by someone easily subpoenaed increases quite a bit. While mobile apps might record and upload without the owner of the phone knowing, the odds seem low; this changes the situation significantly.
> I don't see how this is inherently worse than mobile phones,
the problem isn't that it's worse than a cell phone. The problem is that it isn't any better. All of the privacy and security problems that exist with cell phones now exist with this product. You're locked in to using a cell phone which, for most people, is a hardware platform that Google controls and exploits at your expense.
You can't secure your cell phone and keep it private, and so nothing on your cell phone should be assumed to be secure or private.
If I'm in my doctors office, or working with a client, or doing anything where I don't want Google and who knows who else listening I can turn off my phone, or put it away, or leave it at home, and be reasonably sure that I'm not being eavesdropped on (there's always some risk of three letter agencies listening, but you can't do anything about that), but that's not an option if I need my phone to transcribe every word being said to me and even in situations where I currently leave my phone on it isn't normally sending every word it hears to the cloud to transcribe and send back to me either.
I like that this can work with a cell phone app, but it'd be much better if it could be used entirely offline, and ideally on different hardware as well, including a PC.
Imagine being able to run something like this entirely offline connected to something the size of an ipod or even a nano running linux. Imagine being able to build your own tools to interface with it. Add your own substitutions/annotations to real time text. Add dictionaries, or translation features.
This could be really cool even life-changing tech for people, or it could just be one more technology that's convenient but ultimately used to exploit people.
The key question is always a painfully simple one: how much are people willing to pay as a privacy-first premium?
The answers to this determine everything. Treating privacy-first as the moral and ethical default we should expect everyone to start from is a wonderful idea, rooted in compassion, kindness, and a foundational respect for human rights. It has also been an abject failure to date.
We should not expect the future to be different unless we are willing to be realistic about the economics at work. Otherwise the market gap will remain in the realm of the wonderfully hypothetical forever.
> The key question is always a painfully simple one: how much are people willing to pay as a privacy-first premium?
the fact is that no amount of money a customer can pay will ever be worth more than taking that customer's money and then also selling every scrap of data that your product can get its hands on on top of it.
We need to stop accepting "But I can make more money by screwing you" as a valid excuse for how things are. It's true, but still not okay. If the economics are always going to favor exploiting people, than I guess if we don't want to be exploited we need some very powerful regulations and all of the oversight and enforcement that requires in place to change that situation. I'd prefer that to blind faith and optimism. Until then, there are a number of things we should be insisting on in products as consumers to help protect ourselves. "works offline" is a really good one in a lot of situations.
The worry is that if this takes off, eventually somebody (maybe future shareholders) is going to say "why are we leaving all this money on the table"
This product is already targeting some specific demographics. I'm sure a lot of people would be willing to fork over money for access to those eyes just as I'm sure a lot of people will want access to what's being said or looked at. It'd be very nice if we didn't have to worry about that sort of thing at all, but here we are.
The implications are approximately the same as https://www.amazon.com/tape-recorder/s?k=tape+recorder except with compute power to make storage and search more convenient. Except that the glasses/app don't save anything by default nor do they need to use the internet.
I suppose it would be a nice feature if they saved all of your conversations for later? The translations are too imperfect for any legal matter use.
Bad choice of example, because a tape recorder has no network connection and requires some effort to capture into a digital format that can easily be shared on a network. Unlike tape recorders, recent digital devices and apps often harvest your info behind the scenes.
Reading the comments here, I think people are missing how many people are losing hearing while aging and how alienating it is. Even if it only works with one speaker at a time, it could mean a massive quality of life improvement for a large and growing share of the population.
Sure it won’t solve the issues faced by the deaf community but that’s only a tiny portion of the people handicapped by difficulty hearing.
All these "aids" always have to be born by the Deaf people, not by the hearing. At least these have some chance of actually working, I suppose, unlike those ridiculous gloves one sometimes sees celebrated.
I hope (reatively immature) solutions like this will not be used as an excuse to remove accessible infrastructure from the world (e.g. captions at the cinema, live subtitles at the theatre, text displays on public transport).
There is a huge population group that I am hoping will demand and make accessibility much more refined. Everything from the size of text to lighting levels to subtitles
Many countries have aging populations, so the proportion of people who have a personal reason to care about issues like sight degeneration is going to increase significantly over time.
The number one thing these glasses/software need to solve is that the words match the speech for a one to one conversation in a quiet environment. Eg doctors visit. I think they are very close.
We just got the Nreal/Xrai setup a few days ago for deaf from birth wife (hearing husband) She grew up lipreading but integrated more with signing and deaf community as an adult. She has a cochlear implant but can not understand language from sound alone. And really doesn't enjoy hearing that much unless we are watching a movie etc where the sound is 100% linked to the visual.
Initial reaction to the setup is.
1. Impressed, hopeful, excited
2. A bit complicated technically. More stuff to deal with. not an everyday thing.
3. Phone battery usage high. Maybe 3 - 4 hours
4. In the right situation they will be really powerful.
5. Need more control over the interface eg show/hide 'listening' icon etc. Can be distracting. Move subtitle position (maybe you can)
6. Processing delay can make you more an observer of the conversation. Response time is delayed enough to interrupt the flow a conversation. (satellite tv connection interviews)
The number one barrier to using them is having everything ready for the moment they are needed. You need to plan ahead. Takes a few mins to set up.
All the other high end ideas can be set aside while the core function is dialed in.
We really appreciate the effort and hope to contribute.
Thank you for this feedback! Btw, we'll support better adjustment of the subtitle position very soon. We did just add many additional font size options as well. If you haven't already, please consider joining our Discord server to provide feedback at any time: https://discord.gg/7HjyDJ3JAz
The current generation of glasses (such as these Nreal Airs) use a birdbath technology which requires a tint. The next generation of waveguide glasses won't require this but they are currently not as good visual quality, especially for reading text. They are also a lot more money. Everything is a trade off right now. The next couple of years will be transformational.
I am pretty deaf yet somehow have wound up interviewing people for a living (go figure).
I currently use googlemeet and recently switched to Otter.ai for recording/transcription. Unlike the previous transcription tools for journalists that I used in the past - Otter.ai generates the text live on the laptop screen while we are talking and even corrects itself to make sense as the speaker reveals contextual clues.
It is a huge help, and I had wished there was something for real life conversations like this is for screen conversation.
Good news for deaf people. You only have to watch newscasters speaking through a pasted-on permanent grin - as if every word is ee - to know that lip reading is garbage.
Very cool, especially for people who lose their hearing later in life. For other deaf people it's important to remember that written english is not a form of their native sign language¹, so this would be like (because it is) reading captions in a second language. Still potentially useful but with more limitations. Not that there's necessarily a technological way around those limitations either.
¹ Afaik this applies to all other sign languages outside english too. Signed Exact English exists and probably other-language equivalents too but I've never met a native speaker.
It is true and something more people should understand that ASL is not a signed version English, but most ASL speakers are pretty much bilingual. They are taught to read English, and most places also encourage learning to speak English, generally with speech-language pathologists, though some in the community are understandably reluctant.
Why isn't it more than 70%? I'm assuming this is in an english language majority language, so everyone that can should want to be able to read english.
Because learning written English isn't easy when the sounds it represents are not part of your sensory experience. It's more like learning Egyptian hieroglyphics. Plus many parents either can't or won't invest the effort/money to make it happen. Lots of parents never even learn ASL to communicate with their own kids. Met a mother in a beginner ASL class finally deciding to learn enough to communicate with her 20 year old daughter (Who was born deaf and illiterate FTR). It broke my heart when I met the daughter and found out she basically had no exposure to other signers too. Deaf all her life and she didn't have a name-sign.
There's "cognitive impairment" and then there's "nobody bothered to communicate enough for a kid to build communication skills."
Right, which is why I'm pointing out the second language nature of it since a lot of people are bilingual and I think have an intuitive grasp of the difference in ease between captions in your native language and captions in your second.
The problem with ASL compared to lip reading is that it's a form of self-segregation, limiting the deaf person to primarily communicating only with other people who know ASL. If these glasses are effective, it could help bridge that gap.
I suspect (and hope) you had no ill intent but this comment is really ignorant. There is a terrible history of Deaf people being discriminated against and forced to “lip read” rather than communicate through ASL. And by “forced” I mean they were more or less mentally and physically tortured into compliance.
Your comment not only perpetuates this totally false narrative that there’s a “problem” with ASL but it makes it sound like Deaf people have chosen only to socialize among themselves when the reality is that we have built a world that makes communication difficult for Deaf people. It doesn’t have to be that way: https://icyseas.org/2014/01/12/marthas-vineyard-deaf-people-...
Unfortunately, there are very few Deaf people, so there is no world in which everyone learns ASL just in case they meet a Deaf person. It just doesn't scale, and so what you end up with is isolating them even further. The best approach is to bridge the gap: similar glasses can be used by hearing folks to translate sign language, and it theoretically generalizes to all foreign languages.
Yeah that's what I'm replying to. Most of the country is not like that so ASL is not going to be a shared language, and expecting them to move is not feasible either.
I feel like you're making my comment out to be a lot more hostile than it was intended.
I don't see any "problem" with deaf people. I see ASL as, effectively, a foreign language; and it makes sense to me that in general, you're able to live more effectively when you can speak the same language as the people around you.
When I traveled overseas to a spanish-speaking country, I learned spanish so that I could communicate with the people there. It would be unreasonable for me to show up as the cliche american tourist and expect the spanish-speaking people there to learn english so that I could communicate with them.
I think you're misunderstanding the parent poster: You don't have to be intentionally hostile to perpetuate harmful and ignorant falsehoods. In fact the parent poster states they don't believe you're intentionally perpetuating ignorance. But, it's reasonable for someone to firmly rebut harmful ignorance in a manner, even if that level of firmness doesn't match your self-perception of your ignorance.
I did read the links. And my response incorporated what I'd understood from those. I was actually kinda curious how the parent would respond to the analogy with foreign travel.
Rebuttals aren't just one-and-done. You have to be able to actually defend your position, and that requires a lot more than just one statement.
Wait are you implying you think the main way people learn things is by spewing bullshit on the internet until someone contests it? Read a book ya dingus. wikipedia. christ.
I’m not sure how you can read hostility into my comment when I plainly stated I didn’t think you were commenting with ill intent. Ignorance doesn’t require intent though.
ASL isn’t a “foreign” language. It’s an American language. In fact, it’s more “American” than English. People who communicate via ASL aren’t foreigners on holiday. They’re our friends and family and neighbors. You are absolutely correct in your assumption that being able to communicate with those around you is important!
Imagine if everyone around you just refused to engage with you verbally and would only communicate via text messages. If you’re speaking they completely ignore you until you write it down. If they’re speaking while you’re around it’s always in whispers so that you can really only get bits and pieces. How would you feel? Annoyed? Excluded? Like you’re not able to fully understand conversations?
Your travel analogy doesn’t hold water because (in addition to Deaf people not being foreigners or guests!) you seem unwilling or unable to understand that (1) hearing is a critical component of effective verbal communication and (2) Deaf people can’t hear. Again, I’ll chalk it up to ignorance rather than ill intent but your analogy as a whole is pretty gross to paint Deaf people as entitled, unreasonable, and demanding. Your analogy is akin to suggesting it’s unreasonable for me, a person who can walk, to rollerskate everywhere I go and it would be unreasonable for me to expect curb cuts just so I could rollerskate everywhere so therefore it’s unreasonable for people who use a wheelchair to expect curb cuts. If someone who uses a wheelchair wants to use the sidewalk they should just try fucking walking right? I don’t know why no one has thought of that!
That's fair. It's easy to read hostility where there isn't any.
> ASL isn’t a “foreign” language. It’s an American language.
Elsewhere in this thread you mentioned that "ASL is a complete and distinct natural language in its own right," and that's the sense I was trying to convey by calling it a "foreign" language. It's as hard to acquire as a foreign language, and carries the same benefits. And it fit the analogy.
I appreciate both of your analogies, but I think they aren't quite fair. Not going to the massive effort of learning a new language isn't really the same as 'refusing to engage'. And the analogies seem to understate the effort involved by everyone who is not deaf to learn ASL. It's just not as easy as installing curb cuts (I could probably do one of those in an afternoon...become conversational in ASL? not so much).
I guess, to me, it comes down to the level of effort being asked of me. I'm happy to be accommodating, up to a point. Let's engage in whatever medium is most suited. But asking me to put in the massive time and intellectual commitment of learning a new language is past that point. And I think that's the case for a lot of other people as well; maybe most.
> If someone who uses a wheelchair wants to use the sidewalk they should just try fucking walking right? I don’t know why no one has thought of that!
I had a relative who was deaf. An engineer. He learned to lip read, and that gave him more freedom to work with others and be effective in communities that did not know ASL.
Do you realize that “learn to lip read” isn’t feasible for a lot of people? That’s great that lip reading worked for the one Deaf person you know. But Deaf people, like the population at large, are all different people. It’s cruel and ridiculous to suggest all Deaf people should just learn to lip read.
I tried to explain the cruel history behind your callous remark.
I tried to give you an opportunity to empathize with how a Deaf might feel.
I tried to offer some perspective as to why the views you’ve expressed are narrow-minded.
I tried to educate you.
But you’re just here to argue. You don’t want to put in the effort so it’s the Deaf community’s fault they can’t effectively communicate with you and people like you? Spanish is easy enough to learn for your trip but ASL is too difficult to learn? You know one person who lip reads so everyone must be able to?
You’re not even willing to consider for a moment that your gut-opinion or the second-hand experience of the ONE person you know might not be universal. I don’t have anything else to say to you except that I hope you take a few minutes to read about ASL and Deaf culture and history and take a few minutes to reflect.
>It’s cruel and ridiculous to suggest all Deaf people should just learn to lip read.
Yet you seem to think everyone is obligated to learn sign language...
>Spanish is easy enough to learn for your trip but ASL is too difficult to learn?
Spanish is NOT "easy enough to learn"; perhaps some basic phrases for tourists, but certainly not fluency. It takes years to master a new language for most people, if ever. Spanish is relatively easier for English speakers than many other languages, such as Japanese, but it's not something you're going to become conversational in quickly unless you're gifted at language learning.
The idea that people are somehow "cruel" and "narrow-minded" for not learning a particular language that isn't the dominant language in their region is the most ridiculous thing I've read all day, and that includes the comments from North Korea supporters here.
There’s nothing inherent to the world we have built that’s stopping deaf people using the Internet in their own languages. Yet there’s no ASL Wikipedia. Why not?
There is in fact a test ASL Wikipedia on the Wikimedia Incubator, written in Sutton SignWriting. One of the issues is that Sutton SignWriting is not yet fully supported by Unicode.
This is stupid and meaningless pedantry. Like when you ask your teacher if you can go to the bathroom and they say “I don’t know, can you?”
A system of writing can be used for any language. (Have you ever wondered why English, a Germanic language, or Vietnamese, an Austroasiatic language, use a writing system based on the Latin alphabet?) Any language that doesn’t use a system of writing can use a system of writing. It’s not like it’s some sort of existential impossibility.
You miss the point, and your aggressive tone is not conducive to civil discussion. As you say, any language can use a system of writing, so if one doesn't, it's not because its users can't write, but because they don't want to write. Writing is literal stone-age technology, after all.
Why would you compare an entire language to a single technique like that? There is no "the problem" with ASL any more than there is "the problem" with english or any other language. Yes communicating across the barrier can be a challenge but that's just the nature of having more than one language in use.
> that's just the nature of having more than one language in use
Well, yeah. I guess my point is that ASL is pretty much a foreign language.
I'd compare it to the situation with the english language worldwide - since english is the lingua franca, so to speak, many countries around the world teach it as a second language. If you don't learn english, then (generally speaking) you're at a disadvantage because you can only communicate with a subset of your population.
I'm not saying there's a problem with deaf people, any more than there's a problem with anyone else who simply doesn't happen to know the languages of some of the people around them.
Only being able to see ASL is different from speaking a different language – it's still English. Not being able to hear is more like not being able to read.
It's the same language you already know, but you're missing out on one of the primary ways people use it.
Oh you should definitely look it up then because this is completely incorrect. Sign languages are fully distinct languages with their own histories and influences.
For example the sign languages spoken in the US and in the UK have different ancestries and are not mutually comprehensible, despite both countries using english as spoken languages.
ASL is not based on french either. It's related to LSF, the sign language used in france, but that also isn't based on french. The modern sign languages emerged among deaf populations and have completely different grammar and morphology from the spoken languages of the cultures surrounding their origins.
Spoken languages are linear. Sign languages are not. The grammar is very different.
Sign languages have multiple articulators: two hands, face, eye gaze direction, shoulders, trunk. These can all work together to show multiple things at the same time. Spoken languages can really do only one thing at a time (with a few minor suprasegmentals such as tone).
You can construct a signed version of a spoken language, which may be useful for things like quoting book titles and other cases where you need to represent the exact words of a spoken language in signed form, but it's not common to use that for everyday communication, because the hands move a lot slower than the small muscles of the mouth and throat.
(Linguistics is a fascination of mine. Sign language linguistics are especially interesting.)
Fingerspelling is just a path for borrowing individual words from English. It is not a part of native ASL vocabulary or grammar; that is to say, ASL does not consist of fingerspelled English sentences.
Here's a fun example. ASL allows, maybe even requires, negation after the statement. An interpreter friend of mine was interpreting Wayne's World in a mixed crowd. The whole "<statement>... NOT!" joke gets laughs from the hearing audience and the Deaf audience doesn't understand why.
I think that could be interpreted. Statement + NO is the standard word order in ASL, but there would usually be a suprasegmental element. That is, the negation is also (or, sometimes, only) shown with a headshake which spreads over the entire length of the statement. Leaving out the suprasegmental, making it a flat statement, and then pausing before the NO might perhaps work. Maybe.
However, ASL also makes much, much heavier use of rhetorical questions than English does. You might even introduce yourself with "MY NAME WHAT? [NAME]" (i.e., "What is my name? [Name]"). So perhaps it would just look like you're doing that.
(Disclaimer: I don't know ASL. I know some Irish Sign Language, which is related, but dropped out before completing my interpreter training. I have a bit of a fascination with sign language linguistics, but I'm no expert.)
I have only basic ASL and am by no means an authority. But I think between native ASL speakers its use would be very rare, mostly just to clarify an exact spelling for something that was going to be written down in english.
Native ASL speakers who are completely illiterate in english certainly exist, and I'm not sure at all if they know or use finger spelling.
I admittedly don’t have much experience with deaf people; I had one acquaintance in high school who was deaf. Hanging out with him made me very aware of how isolating it can be to only be able to participate in conversations where you are actively trying to pay attention.
If these can let people hang out and participate without having to actively track each speaker in a group setting it will go a long way.
This has the potential to be the killer app for mixed reality once instant / realtime translation is possible. Imagine being able to understand every language in the world - and if two users of this product meet, being able to converse without learning each other's language.
Anyone know the status of these Google AR glasses?
The form factor of Google's AR glasses look much much closer to a normal pair of glasses than the glasses in the top video (which look like heavy sunglasses with a wire connecting to your phone)
In the description it says: "This device has not been authorized as required by the rules of the Federal Communications Commission. This device is not, and may not be, offered for sale or lease, or sold or leased, until authorization is obtained."
It's powered by an app apparently. Could be interesting to connect it to GPT-4 (e.g., someone asks you a question and then you could just tell them the answer from GPT.)
Or, if it had OCR capabilities, you could just hold a sheet of paper in front of yourself and say "what's this?" and it would explain the text to you.
So I dug around a little bit and figured I'd just ask.
As just some guy on the internet, can I buy one of these and write a hello world to have text show up in front of my eyes of my own choosing? Does it have an API or will it?
This is the sort of thing I expected Google Glass to be able to do. But Google apparently lacks sufficient imagination, so they just unceremoniously canned it. Really hope this takes off.
I was so excited for the possibilities of AR when I first heard about Google Glass. I imagined navigating foreign cities with signs auto-translated to English, turn by turn directions, translated subtitles, etc
Bought a pair of these glasses, Nreal Air, a few months ago. I find them useful for laptop coding without straining my neck. It's awesome to see more creative use cases for them!
The video gives the general idea, but it only shows a graphical simulation of what the wearer sees. I'm curious to see an actual photo or video of the text appearing on the lens of the glasses, in reality, with no special effects.
I don't understand this page
https://www.nreal.ai/compatibility-list/ What do the glasses need to be compatible with? I don't recognize or own any of these products listed.
I have an iPhone. How does it work with a device as part of the deal?
Unfortunately Apple users will have to wait for the Apple glasses which are being released some point after 2025/2026. The specific things the glasses use are HDMI over DisplayPort Alt Mode (USB-C) and Qualcomm chips in the phone.
1. They don't have the same kind of budget as Google to get the scale down as effectively.
2. If these are the ones I have heard about before, all the speech to text is done on the device not in the cloud. This is for privacy reasons. Means it needs a bit more bulk for the gear to work.
What apps on the phone do you use for live conversation captioning?
I'm thinking a tablet that does this would be ideal for in-home use by some elderly relatives. AR glasses won't be suitable for some time for the elderly, as it seems eyesight starts going before hearing.
Compared to contemporary AR glasses ($2k+), these Nreal Airs are fairly cost effective at $379. We plan to support lots more glasses as they come out, especially ones with waveguide technology that doesn't require the shared lenses.
Would love a version for translation. I don't need the glasses, just read the translation to me and I'll wear an ear bud. And if you could make it work at a loud party, that'd be just perfect..
Speech to text barely works in the best conditions. Combine it with bad audio, automatic translation, and text to speech, and you'll be lucky if you understand 10% of what's being said.
There are a few issues that are unanswered by this video (which isn't intended to be a technical deep dive, but I don't see any related links in the video description):
1. How do these glasses handle multiple simultaneous speakers? Based on the display I saw, it shows the speakers' words sequentially, which starts to fall apart in real-world environments, especially group conversations. This is a big problem, and wider adoption is contingent on handling this elegantly.
2. These appear to be the classic "smart glasses" display style that's pervasive in consumer head-worn displays today, where content is projected at a fixed depth in front of the wearer. Because the captions aren't anchored at the same focal distance as the speaker, the wearer's eyes will swap between the captions and the speaker's faces, which is a tiring activity, and can make the wearer feel like they're not part of the conversation or being rude.
3. As mentioned by another commenter, this is a useful idea for people who lose their hearing later in life. That said, this is less (although certainly still) useful for people who have congenital hearing loss and primarily communicate via ASL.
All in all, it's exciting to see growing interest in this space, as it's easily extendable to people learning a new language or navigating a foreign country. I think offloading the speech-to-text to a tethered mobile device is a good choice (though it would be nice to do low-latency wireless transmission).