ChatGPT has just landed. It can generate text that seems fluid, plausible, and (surprisingly) not total nonsense. It’s got a lot of people wondering what’s left for humans — and for the field of Natural Language Processing. Here to help us is computational linguist Daan van Esch.
Listen to this episode
Patreon supporters
Huge thanks to all our great patrons! Your support means a lot to us. Special thanks to:
- Iztin
- Termy
- Elías
- Matt
- Whitney
- Helen
- Jack
- PharaohKatt
- Lord Mortis
- gramaryen
- Larry
- Kristofer
- Andy B
- James S
- Nigel
- Meredith
- Kate
- Nasrin
- Ayesha
- Moe
- Steele
- Manú
- Rodger
- Rhian
- Colleen
- Ignacio
- Sonic Snejhog
- Kevin
- Jeff
- Andy from Logophilius
- Stan
- Kathy
- Rach
- Cheyenne
- Felicity S
- Amir
- Canny Archer
- O Tim
- Alyssa
- Chris W
- Felicity G
Become a Patreon supporter yourself and get access to bonus episodes and more!
Become a Patron!Show notes
Building Machine Translation Systems for the Next Thousand Languages
https://research.google/pubs/pub51503/
van Esch et al., 2022. Writing System and Speaker Metadata for 2,800+ Language Varieties
https://aclanthology.org/2022.lrec-1.538/
All Daan van Esch’s ACL Anthology papers
https://aclanthology.org/people/d/daan-van-esch/
Elpis — Accelerating transcription
https://elpis.net.au
Skirgård et al., 2022. Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss.
https://osf.io/preprints/socarxiv/grh45/
Glottobank
https://glottobank.org/
The World Atlas of Language Structures (WALS)
https://wals.info
Sign Language Processing
https://sign-language-processing.github.io
Intelligent Automatic Sign Language Translation | European Commission
https://cordis.europa.eu/project/id/101016982
The sound of swearing: Are there universal patterns in profanity?
https://link.springer.com/article/10.3758/s13423-022-02202-0
Swear words: we studied speakers of languages from Hindi to Hungarian to find out why obscenities sound the way they do
https://theconversation.com/swear-words-we-studied-speakers-of-languages-from-hindi-to-hungarian-to-find-out-why-obscenities-sound-the-way-they-do-192473
Sh*t Towns Of Australia: The Great Aussie Road Trip | QBD Books
https://www.qbd.com.au/sht-towns-of-australia-the-great-aussie-road-trip/rick-furphy-geoff-rissole/9781988547763/
ChatGPT (OpenAI account required)
https://chat.openai.com/chat
Transcript
[Transcript provided by SpeechDocs Podcast Transcription]
Daniel: Daan, do you see what I put up with here?
Hedvig: Sorry, we’re just getting warmed up.
Ben: We’ve got a renowned artificial intelligence expert here and we’re making this person listen to our pointless ramblings. Is that what’s happening?
Daan: I wouldn’t say renowned, but…
[laughter]
Hedvig: Yeah, I was going to say… I don’t know if Daan would describe himself that way.
Ben: I notice he did not push back on “pointless ramblings”.
[Because Language theme]
Daniel: Hello, and welcome to Because Language, a show about linguistics, the science of language. I’m Daniel Midgley. Let’s meet the team. These intros were written by ChatGPT, which is a large language model that if you give it a prompt, it’ll generate text for you. And here was the prompt. “My name is Daniel and I have to introduce my cohost, Hedvig, a linguist, and Ben, not a linguist, on our podcast. I have to introduce our guest, Daan van Esch, who is an AI expert. Can you write an introduction for me?”
So, I’ll start with Hedvig because that’s what the computer came up with. “Hedvig is a linguist and language enthusiast with a passion for all things related to words and how they’re used. She’ll be bringing her expertise and insights to the show. So, be sure to listen closely for her gems of wisdom. It’s Hedvig Skirgård.”
Ben: Solid.
Hedvig: Yeah.
Ben: Generic, but solid.
Hedvig: Sort of bit waffling, sort of samey things several times. And dare I say, a bit American in that like, “She has a passion,” blah, blah, blah. For European, that’s a bunch of like empty stuff.
Daan: For words, no less.
Daniel: [laughs] For words. I’m thinking of the one thing I can think of for language and it’s words. I came up with words.
Hedvig: I mean, I actually don’t really like words as much as other parts of language.
Daniel: Which we’re going to find out about today.
Hedvig: Yeah, fine.
Ben: Way to pull the “I’m not like other linguists” card.
Daniel: [laughs]
Hedvig: Yeah, I’m a bigly linguist.
Daniel: I’m not really into words.
Daan: You even believe words exist, Hedvig? I mean, that’s an existential question.
Hedvig: I’m not even sure I do, honestly.
Ben: We are one introduction out of three in and we are like, “Do words exist?”
Hedvig: Yeah.
Ben: Do me, do me!
Daniel: Okay. “Ben may not be a linguist, but he’s a smart guy with a lot to contribute to the conversation.”
Hedvig: This is true.
Daniel: It’s true.
Ben: Hang on, is that like full stop? Is that the introduction?
Daniel: No, we’re just getting started. Get this, “He’s a polyglot and language learner.” There you go, Ben, you’re a polyglot. “With experience studying and speaking multiple languages. He’ll be sharing his tips and tricks for learning languages, as well as his own personal language learning journey. It’s Ben Ainslie.” Come on, Ben, let’s hear about your personal language learning journey.
Ben: Yeah, I’m learnen gooden and…
[laughter]
Ben: It is one of, not the, but certainly one of the great shames of my life that I’ve been on a linguistics podcast for 11 years and I don’t even know how to speak my language particularly well.
Daniel: Oh, that’s a controversial view.
Hedvig: I hate when people say they’re not good speakers of their own native language. You are, but you are not famous for being a polyglot. I think it’s really funny that the AI was like, “Oh, someone who’s next to a linguist and is described as a non-linguist must be a polyglot.”
Daniel: You must be a polyglot.
Ben: The only reasonable explanation.
Daniel: Well, at least they didn’t make that mistake of saying, “Oh, you’re a linguist. You must be a polyglot.”
Ben: True. Yeah, true.
Daniel: We got the reverse.
Daan: I always find it an interesting question, what makes you a polyglot? I speak five languages, but I’m not a polyglot. At what point does anybody get called a polyglot?
Ben: Just a few, many, some. Where do these lines get drawn?
Hedvig: It’s also when you start going on YouTube, shooting hidden cam videos with you, going into Cantonese grocery stores and filming how surprised the ladies are.
Ben: Oh, I find that shit so, so odious.
Hedvig: Some of it I find cute. Some of it is bit cringey.
Ben: I’m a hard pass on it, and I’ll tell you why, because it’s so rooted in whiteness. It’s like, “Ooh, watch this. White guy speaks Cantonese.” Imagine that in reverse of, like, random East Asian looking person just speaking English in a shop. Zero views. No one will give any fucks because so many people from those communities have done that. When a white person does it, we’re like, “Oh, my God. What magnificent…”
Daniel: It was like Baby Princess Charlotte who speaks two languages. “Oh, that’s amazing.” “Oh, here’s a non-white child who speaks two languages. Ooh, that’s a problem.”
Ben: It’s not even that. What I love is, if you watch enough of those videos… because I fully fell down that rabbit hole, I must admit. If you watch enough of those videos, a lot of the people who interact with this dude when he switches to Cantonese are like, “Uh, so do you want the hot sauce or not?”
[laughter]
Ben: They’re very, very unimpressed. They’re just like, “Okay, yeah, cool. You do the thing, but I’ve got a job to do, so if you could hurry the fuck up, I’d appreciate it.”
[chuckles]
Daniel: We also have a special guest with us today, Daan van Esch. “Daan is an AI expert, and he’ll be joining us to discuss the role of artificial intelligence in language and linguistics. It’s sure to be a fascinating conversation, so stay tuned.” Sorry, that was still ChatGPT. Sorry about that. Did you realize?
Daan: That was fairly accurate.
Daniel: Yeah, I think so. Stay tuned.
Hedvig: It was pretty good.
Ben: Can I just say, I have been hanging out for this episode?
Daniel: Oh, yeah.
Ben: I will be honest, I would not say that about most episodes. I’m happy to do Talk the Talk, and I’m always interested and curious, but this one in particular, I’m like, “When are we going to do the AI? Where’s the AI episode?” So, I’m pumped. [crosstalk]
Daan: I’m pumped.
Daniel: [laughs]
Hedvig: I don’t know, because I feel a bit responsible because I’m the one who brought Daan in here. Daan and I know each other from Australia from the CoEDL ARC. And Daan works at Google. I’m worried that we’re going to be like, “Daan now has the answer to all of our AI questions that’s ever popped into our head.” Because I’ve done the same to my friend who works at speech recognition and I’m like, “My Google Nest doesn’t understand when I say this and that,” and he’s like, “I don’t know what to tell you. I work at Apple.”
Ben: [laughs] Like, it’s your friend’s specific fault. “If you could just please go and back into this, that would be really helpful.”
Hedvig: Right. So that Daan doesn’t give too wide a range, Daan, do you want to be more specific about what you’ve been doing?
Daan: Yeah, sure. I mean, I have a passion for words, obviously.
[laughter]
Daan: I want to start out by saying.
Daniel: At least one of us does.
Daan: Exactly. Yeah. It’s quite something to wake up at 08:00 AM and discuss anything other than words for me. I just basically wake up at 08:00 AM to discuss words.
[laughter]
Daniel: Passion.
Daan: AI and NLT, Natural Language Processing, they’re pretty massive fields. Especially these days, if you look at them, it’s like linguistics. You have people who study phonology, study of sounds. You’ve got people who study grammar, syntax. You’ve got people who study how people learn these languages. The same way that you don’t really invite a generic linguist and then just ask them any questions that come into your mind about linguistics. It’s the same with AI and NLP to some extent. Mostly what I work on is the really applied side of things. So, for example, autocorrect on your phone. If the keyboard makes all these strange autocorrections that you’re not too keen on, yes, that would be my fault. Actually, those autocorrect mechanisms are quite accurate and they really do help prevent a lot of typos and make it faster to type on your phone.
Ben: Definitely.
Daan: That’s one thing that I’ve worked a lot on.
Daniel: I tried to say hello to my friend Sarah and it said, “Hi, Satan.”
Daan: Yeah, those are the unfortunate ones that… there are always the errors that systems make that are especially salient. 99 out of 100 times, it’s entirely unremarkable. It just makes the correct change and you don’t ever think about it. But then, that one time, you have a story that lasts a lifetime.
Ben: I’ve always imagined, Daan, that what you do is part of the deeply unenviable pantheon of jobs of many different varieties, where you’re absolutely fucking crushing it, it means you’re invisible. You’re absolutely annihilating your job. I feel like epidemiologists during COVID is the same thing. When the epidemiologist is doing everything perfectly, everyone’s like, “Why do we even fucking need masks and shit?” I feel your job is the same, because I look at my phone sometimes as I’m typing and I’m like, “Good God.” I genuinely believe drunken monkeys could do a better job of the input that I’m providing. You must see in your work some horrific input states. And yeah, most people’s text messages are legible, which I think is a true hero job, well done.
Hedvig: You can really tell, or at least I can really tell whenever I’m sitting down and trying to write by pen and paper.
Ben: [laughs]
Hedvig: My brain does the same scrambling it does when it’s typing on a keyboard, and I’m like, “Wait, those letters should swap places.”
Daan: I’ve worked in linguistics and language technology for more than 10 years and I still can’t spell the word ‘language’ correctly. I think it’s just burned into my brain. The vowels are the wrong way around.
Hedvig: No, actually, that’s how you know. That’s the badge of honor. That’s how you know you’re actually a linguist, is if you can’t spell language.
Daan: But, yeah, I mean, that’s totally true. If you actually turned off autocorrect on your phone, or even worse, if you actually took your phone settings and changed them to another language… and this is honestly the experience for a lot of people that first start using a smartphone, the experience is really rather dreadful if you don’t have that autocorrected spell correction that you’re so used to it. I think, Ben, you were saying you only speak English. Just for the sake of argument, go into your phone settings after the show and change the settings to German and then have a good time texting all day long.
Ben: It would be incredibly hard.
Daan: Yeah. That’s everyday life. At least over the last years, that’s improved quite a bit but that’s been everyday life for a lot of people using their phones. So, yeah, things like autocorrect on phone, spellchecking, but also speech recognition, Google Translate. Again, Google Translate, it’s one of those things. Obviously, it makes mistakes sometimes, but these days between, let’s say, Dutch and English, or Dutch and French, which are the languages that we use the most regularly, it’s reasonably decent. But you still need to have that human knowledge to look over it and say, “Hey, what’s going on here? Has it done it correctly? Has it taken any ambiguous cases and translated them incorrectly?” My grandfather always used to make me send letters, like physical letters to hotels in Germany and France to make reservations for his wine-buying trips. He was an old-fashioned man.
I used to sit there and brush up my German and be like, “How does this work again? I need to insert some case here, but, hmm, do I want [unintelligible [00:11:41]?” I have no idea. I mean, you can probably tell us. You’ve been, I assume, spending a lot of time brushing up on your German.
Hedvig: I have, but I can’t tell you. I think it’s [unintelligible [00:11:54] because it’s neutral. I use Google Translate a lot because I know a fair amount of German but in order to get that sentence entirely correct, I sometimes use the tool of Google Translate definitely.
Ben: Should we use that as a handy segue into our news items? Because we’ve got a news item about Google Translate.
Hedvig: Yeah, why don’t we do that?
Daniel: Before we do, let’s talk about our last episode. It was a bonus episode with Chase of the US Naval History Podcast. He was really fun to chat with. The cover image for his podcast is the most asplodey I’ve ever seen. That’s how you know you’ve got it. We talked about nautical expressions like wheelhouse, all the cat expressions, like enough room to swing a cat or letting the cat out of the bag. Ben, we even did log.
Ben: Yes. There we go.
Daniel: We found out that it was an actual log. You told us that it was a log but some doubts arose and we quelled them because it is an actual log.
Ben: Can I ask, because I was sadly absent, did we talk about “cold enough to freeze the balls off a brass monkey”?
Daniel: We did.
Ben: Yes. That’s one of my favs.
Daniel: You can listen to that bonus episode by becoming a patron. Or, secretly you can listen for free by just going to Chase’s podcast because he’s got the episode up on his podcast as well. It’s the US Naval History Podcast link on our website, becauselanguage.com. By the way, if you are a patron, we are working on our yearly mailout. Everyone gets some goodies in your actual mailbox. So, please make sure that your address is correct with Patreon. That’s patreon.com/becauselangpod. A huge thanks to all of our patrons. All right, should we get going?
Ben: Let’s do this. I know it’s not first on the run sheet, but let’s do the Google Translate one and then we’ll do the Grambank one second.
Daniel: All right. This is new work by Daan van Esch and a team working at Google Translate. It’s called Building Machine Translation Systems for the Next Thousand Languages. Daan, tell us what this is.
Daan: Like were just saying, a lot of the time these days, folks that are buying smartphones… it’s often smartphones. We often think of computers as the first device that people access the internet with but actually, it’s mostly smartphones these days for folks that haven’t been using the internet as much lately, or for folks that haven’t been using the internet as much yet, it’s really smartphones that are their gateway to coming online. What you’ve seen in lots of extremely multilingual places around the world over the last few years is folks coming online in all these different places that speak all sorts of languages that technology historically hasn’t supported yet. So, we’re all used to computers. We’re sitting here, recording on laptops, desktop computers, but those are actually not very common in most places that are coming online for the first time in the last few years.
Those folks in those places often speak languages that haven’t historically been supported. One thing that we’ve been trying to do at Google has been to extend support in technology for those languages. We support nearly thousand languages in the keyboard for Android. We’ve got spellchecking and next word prediction and so on. We worked on that a few years ago. And then, the next step beyond that, oftentimes people will say, “Well, great, now I can chat with my friends in whatever chat app I might use. I can write messages. It doesn’t autocorrect me away to another language.” But then, the next thing that they find is actually there isn’t just too much content on the internet in most languages. And not just on the internet, there’s often also not even newspapers or books or whatever, like physical publications in the real world. One big barrier that people find once they have gotten to that stage of, “Oh yeah, now I can text my friends and I can type in my language,” is that then they find, “Hey, actually there isn’t too much content at all that I can read in my language.”
So, we decided to look at that and see if we could extend Google Translate, which historically has supported like hundred or so languages, to basically 10x. That’s always the nice, ambitious goal that you hear in tech people, like the number 10x. That’s a phrase that whenever you hear the word 10x, I think it’s tech, really. It does actually make sense because if you look at it, I think a lot of people would know that there are 7000 or so languages in the world. A lot of people don’t really realize that out of those 7000, there’s actually more than 1000 with quite a large number of speakers. There’s also quite a few languages, and I’m sure you’ve covered them here on the podcast with maybe a few dozen, a few hundred speakers left. There’s also really like more than a thousand languages with hundreds of thousands of speakers, including. I mean, I studied at Leiden University, which definitely does a lot of linguistics research on lots of languages of the world.
There were lots of languages, even with millions of speakers that we tackled in the research project that I had never even heard of. I think that’s something, for us as linguists, that we’re not really necessarily aware of that diversity. I guess, Hedvig, you’ve just gone through, what was it, 2400 languages in your most recent research paper? That’s also quite a large scope that normally linguistics tackles like one, two, three languages in a paper, and then you’ve got all these like large sets.
Yeah, anyway, we were able to build machine translation systems in thousand plus languages. The interesting thing is there are two things. One question is, “Okay, how do you train these systems?” You need to have some training data that you can use to actually teach these models how they might translate between different languages. The other question is, “Okay, now you’ve built them, are they any good?” How do you check that they’re actually good enough? They’re never perfect. How do you actually check that they are good enough to put them into Google Translate?
Earlier this year, we were able to add 24 of them to Google Translate. It’s not a thousand, obviously, but it brought up the number from 100 to about 130. So that’s a pretty sizable jump. Yeah, the question of how do you actually train these models is really rather interesting because historically what you’ve needed to train a machine translation system is that you would need what’s called a parallel corpus. Meaning, basically, just imagine an Excel spreadsheet. Two columns, and you’ve got the translations in column A, column B, and then you get such a parallel database that doesn’t necessarily exist for most languages.
Hedvig: This is where many sources like Wikipedia can be really useful, but also really tricky, like we’ve covered before on the show, there was this enthusiastic guy who just wrote lots of Wikipedia article, and was it Scots Gaelic?
Daniel: It was Scots.
Daan: It wasn’t Scots Gaelic. It was just Scots.
Hedvig: It was just Scots. Then, later a lot of people said, “No, this guy is not a native speaker. This isn’t very good material.” Wikipedia is great in that way. If you can trust what’s in there. What other input as parallel corpus is exists out there besides Wikipedia?
Daniel: Yeah, in terms of parallel corpora, you’ve got things like the European Parliament, United Nations, they translate quite a bit of their material, but obviously not into like thousand languages. The big innovation in this paper recently was that we actually showed, “You don’t need the parallel corpora, you can just have monolingual corpora.” Databases of text in just one language and you can just get those for quite a few languages. In addition, for the languages where there are parallel corpora, you obviously also get those and include them in the mix.
Interestingly, when you train these models, you can actually get them to understand the underlying similarities in the structure between languages, even without seeing any parallel data. So, they’re actually quite capable of learning. This is how you would infect a verb in the present tense. It just looks at a whole bunch of text in one language without a translation and then figures out, “Oh, okay, there’s some grammatical pattern going on here.” It doesn’t know that it’s like the present tense or whatever, but it just says, “Here’s some grammatical pattern.” In another language, it says, “Huh, this thing seems to function the same way.” It’s able to like, figure out, “Hey, this seems to function the same way.” Since you do have parallel corpora for a few languages, you can get it to figure out, “Oh, okay, so those things are actually all the same.” Now, it’s figured out how the present tense works in this language, even though it doesn’t know that it’s the present tense.
Daniel: Okay. I’ve got a whole ton of data in a language which I just made up called Danielish, and there it is. It’s going into the database, and it’s got patterns. I can see how you could work out some things about tense and suffixes and things. How does it know that this thing is apple in Danielish? How does it get the meanings of words?
Daan: Yeah, that’s actually the most interesting part of it. A lot of it… what’s the joke in linguistics again? A word gets its meaning by context. I’m sure, Hedvig, you know the actual reference.
Hedvig: Is there a joke? I don’t know.
Ben: It’s like you know a word by its neigbors.
Daan: You know a word by the company it keeps. Something like that.
Hedvig: Oh, yeah, word by the company keeps. That one, I do know.
Daan: Basically, as they say, a word, you know it by the company that it keeps, it doesn’t necessarily know that apple and orange and all these kinds of fruit are individually like this is an apple, this is an orange, and so on. But it does figure out, “Oh, okay, well, there’s some class of objects in this language, and they all happen together with this verb,” which eventually it figures out, “Oh, this verb must mean ‘to eat.'” Maybe the word apple cooccurs with green, so then that wouldn’t be an orange. It tries to figure this out. This is actually one of the most interesting problems with this entire approach, because one thing that we found is it often doesn’t realize the difference between tigers and crocodiles, for example. They didn’t occur very often. It’s very difficult from context. If you don’t know anything, if you don’t have photos or images, they’re clearly scary animals.
Hedvig: They bite, they chase you.
Daniel: They eat people.
Daan: They bite, they chase you. But what is a tiger? What is a crocodile? For those sorts of situations, the approach doesn’t really work too well. What you really need there is you need to get maybe some dictionary, which is like a parallel corpus, except it’s a list of words. You just basically go look at some dictionary and you’ll have a definition, obviously, of tiger and maybe it’ll give the translation into English. That’s the next area, I think, to see if you can fold that in. But even without that, it actually works surprisingly well. It’s amazing.
Hedvig: Because it works like that based on making a bunch of assumptions, does that mean that the data you feed it is sensitive towards bias?
Daan: Well, it does matter quite a bit if the data is on the same topic. If you’re looking at data that has wildly different topics that say in one language it’s talking about like nuclear fusion and in another language, it’s talking about like rock art, obviously both very deep topics with a lot of nuance to them, but it wouldn’t be able to figure out, “Hey, ochre, this color is actually this word in that language,” because you’ve never seen anything parallel.
Hedvig: No, but I meant more like if it’s doing that thing of like, “Oh, there’s a word here that always occurs after the verb. I think that verb maybe means ‘eat’, and I think that thing maybe is food,” as we’ve seen with a lot of AI-generated content, you can get into some ethical trouble, where it’s like, “Oh, I see this class of names, it’s often associated with this crime and then it starts.” Is that an issue you’ve thought about for this paper?
Daan: We’ve done pretty extensive evaluations for this paper and the output that it produces because honestly, we were surprised that it even worked. I don’t think that specific thing is something that we saw at all in these evaluations here. It’s definitely something that you need to pay careful attention to for AI models, more generally speaking. I don’t think it came up at all in this context. Possibly, you would see it if you could get rid of the class of errors that was predominant right now, which was more like, “Hey, this is a tiger, that’s a crocodile, and those things are wrong,” like you can’t translate tiger into crocodile.
The other thing that’s really interesting is it would identify culturally equivalent things, so to speak. Let’s say you would have the sentence in English like, “I’m having Australian English, I’m having avocado toast for breakfast.” It would produce like I’m having Dan Bing in Chinese, which is like some sort of egg pancake because that is what you have for breakfast. And so, is that bias? Well, I mean, it is biased in a sense because you’re biased to say, “Hey, this is culturally appropriate in this language, so I’m going to translate it to that.” This is the pattern that I would expect but it is wrong. If you’re factually asserting, “Hey, I’m having avocado toast for breakfast,” the translation should not be, “Oh, I’m having some egg pancake for breakfast,” even though it’s culturally the same thing. You can see how it does that.
Hedvig: I think some translators, especially of, translators of fiction and poems would do sometimes do that thing. They choose that actively because they’re like, “Our audience, why is someone having avocado and toast? That doesn’t make any sense. I’m going to adjust this a little bit.” Wow, that’s so cool. A thousand languages, does that mean that… you said that about 24 have been added to Google Translate, so we’re up to about 130. Does that mean we’re going to get to 1130 languages on Google Translate?
Daan: I think the overall sum, if you eventually got all of them to a state where they would be ready to launch, would be slightly over thousand, because we already have some of those in that model. Yeah, it is really a matter of, like I said, there are two problems. One is how do you actually get the data to train that model? The other question is, “Okay, how do you now verify that it’s working well enough?” And something that you actually want to put out there and say, “Hey, this thing here is something that we really want to vouch for,” and say, “Hey, this is good.” That’s actually been, like, an interesting problem because how do you check that those translations are relatively decent, the usual quality bar. You sort of need a parallel corpus. That’s historically how you do it. Like, you take some parallel corpus and you don’t show it to the machine when it’s being trained. Later on you go back and say, “Okay, show me how you did on this parallel corpus.”
Given that you don’t have one, it’s tricky. You have to work like we did for those 24. Actually, we checked quite a few more than 24. Those 24 were the initial wave where you felt really confident that it was decent enough. You have to work with linguists, folks at universities, language activists, and really ask them, “Hey, here’s the beta version. Would you give it a try? Send us your feedback, and we’ll iterate from there.” That obviously is super valuable, super important, but it’s also not something you can super easily do across thousand languages. I mean, my inbox would explode if I had to send thousands of emails and keep track of all that. That’s actually something we’re thinking about right now. How do you do that in a way where you can actually work on lots of different languages?
Daniel: We haven’t really solved the problem of scaling. It’s still hard.
Daan: Yeah, I think that’s always the case. I think a lot of people will say, “Oh, AI, ML, machine learning, natural language processing,” it’s not mutually exclusive with linguistics. There’s a lot of people who work at universities, who work in language organizations, who have that unique expertise. No tech company in-house has people that know about thousand languages, that has that expertise of all that knowledge. I think if you look at it, eventually what happens is you’ll see that we’ll have to figure out ways, how do linguists, language activists, large tech companies, how does everybody work together?
Daniel: Well, I’m amazed that it works. I’m also surprised that you’re amazed that it works. I don’t know which one is more surprising, but I feel I understand that little bit better. So, thank you. Let’s move on to Hedvig’s new project with Grambank. Tell us, with a cast of thousands, what is this thing?
Hedvig: I actually didn’t notice until we saw a recording this morning. I looked at the run sheet that this was on here because I thought…[crosstalk]
Daniel: Do you want to tell us about it or…? [crosstalk]
Hedvig: No, I can tell you about it. I can tell you more about it later as well.
Daniel: So, here’s my current understanding. I’ve used WALS to look at different features across different languages.
Hedvig: And what is WALS, Daniel?
Daniel: WALS is the World Atlas of Linguistic Structures. I can see which languages have gender on their pronouns or whether they have clusivity on their pronouns or whatever, stuff like that. Is this like that?
Hedvig: It is like that, yeah.
Daniel: But bigger.
Hedvig: But bigger. I’ve been working on this project since 2013, so next year I’m coming up on 10 years.
Daniel: Cool.
Ben: Impressive.
Hedvig: Is it? [chuckles] I don’t know. Yeah, maybe. No, honestly, it is really fun. It’s really amazing. What Grambank is, is a set of 195 questions that are almost all binary, that ask things like, “Do you have prepositions? Do you have conjugation classes on your verbs?” Blah, blah, blah. We answer that question by reading grammars or other literature published on languages. In some cases, when we can, we also reach out to linguistic experts, either speakers or people who have described language, and we ask them follow-up questions. Then, we collate all this into a big Grambank database. We’ve got 2430 languages in the first release, and that release was derived last year. So, we’re actually going to have more for 2.1, which is going to be really fun. In order to launch our big, beautiful database, we’ve written a paper that we’ve submitted to a journal. The reason why Daniel knows about this is because we decided to publish what’s called a preprint which is, in academic publishing, when you release a draft of your paper before it’s been approved by a journal.
Ben: What we might call a teaser trailer in other forms and places.
Daniel: Ooh. [chuckles] Yes, indeed.
Hedvig: Right. This preprint exists and anyone can read it. You can go to this link that’s going to be in our show notes on SocArXiv, and you can read it and we try to showcase all the different things you can do with Grambank. Primarily, we’re focusing on different ways of viewing the global language diversity and what’s out there and how grammar can tell us about history and areal effects. Also, we looked into what will happen… we know that a lot of languages in the world are endangered and when they are no longer spoken and when people who know about them pass away, we will lose a lot of knowledge. And it’s a lot of languages, we also had a look at what will happen to the grammatical diversity because we’re not losing languages randomly over the world. We’re actually losing in certain areas more, which means that our diversity actually will go down at a different rate than just the number of languages. So, we showcase that as well.
Daniel: We can use it to figure out things about language relatedness, about areal features, about how the diversity of Earth’s languages is going to change in the next little while.
Hedvig: Yeah, that’s the kind of thing we focused on for the release paper. There are many more things you can do with it. Once the paper has been accepted to a journal, the database is going to be public and then anyone can do whatever they want with it. Whatever ideas people have, probably a lot of the similar things people have done with WALS, hey.
Daniel: Yeah. Hedvig, you’ve told me about some of the questions that you did, like do you have conjugations on your verbs? Were there any of the categories that you found especially cool or amusing? [crosstalk]
Hedvig: Oh. Well, I’ve worked on this database for a really long time now, so for some of them I thought were fascinating nine years ago and now I think I’m really blasé. So, sorry. The one that I’ve always really liked is what’s called logophoric pronouns.
Daniel: Logophoric pronouns.
Hedvig: Logophoric pronouns. This is the thing that happens in lots of languages. It’s well known in West African languages, but it happens in lots of other places. It’s when you say something like, “Daniel sipped his sugar-free, coffee-free drink,” then some languages can use a different word for ‘his’ there to denote if it’s his own or someone else. If you, Ben, are in the same room in English, technically, if you say ‘his’, it could actually be ambiguous.
Daniel: Daniel sipped his drink.
Hedvig: A lot of languages use context, use knowledge from before and everything. If it’s like, “Ben was holding his drink. Daniel drank his drink,” then maybe you’d assume it was Ben’s drink, blah, blah, blah. In some languages, you can use a different word there either for the possessive or also for the pronoun itself. Swedish has it for just the possessive, but some languages also have it for like, “He said that he’s going to come by next week.”
Daniel: Oh, he himself.
Hedvig: Yeah.
Daniel: Okay, cool.
Hedvig: It’s like the same person is the first one. Those are called logophoric pronouns. I just think they’re neat.
Daniel: That is cool. We need that in English. Well, no, we don’t.
Hedvig: Well, we talked about it before. We talked about whether ambiguity is good or bad. I just think this is a neat thing because you can really see the use case for it in certain circumstances, but you can also see how pragmatics and context could help you tell them apart. Only some languages choose to codify this into their grammar because as linguists often say, “Grammar is not about what you can do, it’s about what you have to do.”
Daniel: Let me ask a question to both of you, because these resources are amazing and they use lots of languages, but I keep thinking about signed languages.
Hedvig: Mm-hmm. We have two in Grambank.
Daniel: You got two?
Hedvig: Yeah.
Daniel: Okay, cool. Which two?
Hedvig: Finnish and Japanese sign. I’m pretty sure, yeah.
Daniel: Are they special things that need completely new everything, or can we use existing techniques to include them? This is for Daan and for Hedvig as well.
Hedvig: Maybe I’ll answer first for a grammatical survey and then Daan can answer for translation. For a questionnaire like Grambank, we definitely were struggling a bit. So, we had the help of Hannah Lutzenberger of Radboud University to test out like, “Okay, we have this questionnaire of 195 questions. Have a go and see how we would go if we fill it in for sign languages.” We found that there’s a lot of things like what does it mean to… there are certain concepts that you can translate. So, there are actually some things that sign language sometimes talks about as suffixes and stuff like that. There’s a lot of things to do in sign languages that we don’t have a way of expressing in the same way. So, for example, they could take a verb and make a sign with their hands and then direct it as someone and direct it in different directions, and that means different things for what’s happening in the sentence. Is that conjugation? I guess sort of.
But you have to sit and think about all these things, so it’s not so obvious. There are people like Ulrike Zeshan at, I think she’s at Lancaster University who work on databases specifically targeted at things that have to do with sign. I think, yeah, there’s some overlap, but probably they need their own treatment.
Daniel: How about you, Daan, what do you think?
Daan: Yeah, for natural language processing, I think there’s actually been an increase interest over the last few years in tackling this topic. There’s been some folks, I think, at the University of Zurich that recently actually put out a paper saying, “Hey, everybody should look at this more, and here are some interesting ways that you could do it.” The main challenge, I think, has been in addition to what Hedvig said about the phenomena that don’t really occur in spoken languages in a way that you can easily map to. The main challenge has been historically, Natural Language Processing has really focused on text-based tasks. So you’ve got everything in text. It’s all encoded in this system called Unicode, so it’s easily parsed by a computer and it’s all pretty straightforward to handle. Sign content, of course, has all sorts of extra signals that you don’t have in text. If you think about text, it’s a very lossy compression, so to speak, of the speech signal. Like you don’t hear somebody’s accent when you read text necessarily. I mean, sometimes you can imagine it in your mind, but it’s not there.
Whereas if you have sign language, it’s obviously got video signal, not just audio, but video signal. Very rich signal, which is very unique from one signer to the next. I mean, it’s all the same sign language if they are signing in the same language, but people’s hands have different shapes and so all of those things are rather out of the area that Natural Language Processing has historically been tackling. What’s been happening recently has been maybe a great topic for a follow up episode, some folks did some work to translate visual recordings of sign language into sign writing, which is this writing system for sign language that is actually encoded in Unicode, which then makes it so that you can actually do NLP again using all the standard techniques, because now it’s in Unicode, and you can go that way and tackle it in that.
I forget exactly who the authors were, but I think they would have sourced University of Zurich. I can send you the details, but it’s quite interesting, because if you can translate it into Unicode, then you’ve got sign writing in Unicode, and now you can apply all the usual techniques. There’s also a project funded by the European Union, actually, which is doing quite a bit of work on enabling sign language translation. Today, Google Translate doesn’t have any of them just because it’s not been something that’s been quite as easy to do with the existing techniques. Obviously, you do want to produce something that works well enough that you want to include it in the product rather than saying, “Hey, here’s some prototype that we’re testing.” Of course, we’ve done some work to do some prototypes and so on, but it does seem to be converging on an approach that you might imagine over the next few years will actually make it so that you can fold this in and get reasonable accuracy and everything that you need to actually produce a product.
Daniel: That sounds amazing. The thing is happening where my head is spinning over the sheer breadth of languages that we’re looking at. I think it’s time to bring it back down for some swearing. This story was suggested by Ben Not the Host One, PharaohKatt, and Jack. This is work from Ryan McKay, who’s a psychologist at the Royal Holloway University of London and a team published in Psychonomic Bulletin and Review. Let me ask, and this could be true for any language that you speak. Think about all the swear words you can think of in your language.
Hedvig: Got it.
Daniel: Which sounds do you think are very likely to appear? Do you have any guesses?
Ben: What’s the linguistic word for the phonemes that make really hard, interrupted sounds, like ‘kah’ and ‘gah’ and that sort of thing?
Daniel: They are plosives.
Hedvig: Plosives.
Ben: Plosives. I’m going to guess the swearing is just motherfucking full of plosives. Just plosives out the wazoo.
Daniel: Okay. Daan and Hedvig for the non-English view.
Daan: Well, I think both of us are going to give a Germanic-focused view still. I would be curious to ask somebody who speaks… I mean, if I think about Dutch, I mean, it’s true for the loanwords that we’ve borrowed from English, because that’s a very popular way to swear in Dutch, but there are also Dutch words that don’t necessarily follow that convention. Again, Dutch is very close to English and if you don’t mind me saying so, Hedvig, the same is true for Swedish-
Hedvig: Oh, yeah, very.
Daan: -at least on a linguistic scale of diversity. So, I would be very curious, actually to ask somebody that spoke a completely unrelated language and that maybe [unintelligible 00:40:36] to keep it to one European language that’s isolated, unrelated to any other language. I’d be curious what they think.
Hedvig: I would guess that it would be like a lot of fricatives and approximants. Like ‘jah’, ‘zah’, I don’t know why, but most of the words I could think of had that in it.
Ben: Were you thinking about swearing in French? Like that scene of The Merovingian in the second Matrix movie where he’s like, “I’ve sampled all of the languages and French is the best.”
[chuckles]
Daniel: Let’s talk about approximants for a second. Hedvig, since you mentioned them. We have four that we use in English. They are ‘ul’, ‘er’, ‘wa’, and ‘ya’. The reason why they’re called approximants is because you don’t contact bits of your mouth, like with ‘kah’ and ‘puh’. Instead, they brush through, like ‘wul’, ‘ur’. Now let me go to a different question. This is a not specific to English question, because I’m going to give you two words that have just been made up, and I want you to tell me which one you think might be most likely to be the swear word. Obviously, neither one is because they’re just made up, but yemic, chemic. Between yemic and chemic, which one is the swear?
Ben: Chemic.
Hedvig: Chemic.
Daan: Yeah.
Ben: Definitely chemic.
Daan: Agreed.
Daniel: Yeah. Okay.
Ben: [laughs]
Daniel: Here’s what McKay and the team found. They did not find that plosives were that common in swear words.
Ben: Ah, boo.
Maybe in English a bit, but not other languages, like Hebrew, Hindi, Hungarian, Korean, they’re your isolate, and Russian. What they did find was that swear words are not very likely to have those approximants like ‘ul’ and ‘er’ and ‘wa’ and ‘ya’.
[laughter]
Ben: I like the idea that when you’re angry, you just can’t have it, but you can’t nearly make a sound, you’ve got to make a sound.
Daniel: You’ve got to make a sound. In fact, if you want to make a minced oath in English, if you don’t want to say fuck, like fucking, you can add an approximant, friggin.
Hedvig: Fudge.
Daniel: Er, or fudge and suddenly you’ve got a non-swear.
Ben: Ah. It mellows it out.
Daniel: It seems to.
Ben: It’s got a moderating effect.
Daniel: That was an interesting study that looked at a few different languages. Not 1 or 2000, but just a few. There do seem to be some interesting patterns there.
Ben: There you go. I’m a bit bitter about the plosives. I was sure I was onto a winner with that one.
Daniel: I know. In English, you’re right, but not cross-lingually. I thought, “Why is this a story?” Because this one’s all over the place.
Ben: Because swearing is heaps of fun, obviously.
Daniel: That’s in my notes.
Ben: [chuckles] It’s because of rules. It’s like, “Why is pizza good?” “Because it’s pizza, bro.”
Daniel: But also.
Hedvig: Swearing and things like that are fun in linguistic research because it’s something that is less… it’s not really in dictionaries and people don’t really write it. So, you learn it by context and talking. So, you can get these fun patterns that you don’t get with other words. There’s a lot of snow here in Europe right now, and in Swedish, there’s a verb to take snow and put it in someone else’s face.
Ben: Snow shower?
Hedvig: Yeah, sure. But we have a monomorphemic word, [unintelligible [00:43:41]. But that word, it doesn’t really occur in text much. So, it’s subject to a lot of more fun regional variation than other words because your teacher is not going to be like, “And now, we’re going to learn how to spell this word,” because they’re just going to shout at you, “Don’t do that.” I think it’s similar for swear words that you get like fun patterns because you never get that kind of normative. You get all the taboo, and people tell not to swear, but you don’t get taught how to swear. You just learn it by observing.
Ben: No one sits you down and be like, “No, no, no, it’s motherfuck-er.” [crosstalk]
Hedvig: Yeah, exactly. Which I think makes this taboo language fun.
Daan: I mean, it’s pretty much the only place, I think, in the English language where infixation occurs as a grammatical phenomenon, right, Hedvig? I can’t really think of any others. You’re the typologist here, but…
Hedvig: In English, I can’t think of any other. What Daan’s referring to is like “abso-fucking-lutely”. But I am worried because we’re always saying abso-fucking-lutely. I’ve never heard anyone that isn’t absolute and fucking.
Ben: Okay, hold on. Hold my beer.
Hedvig: Yeah.
Daniel: [laughs]
Hedvig: The Australian is joining the chat.
Ben: Hold on.
Daniel: I am so sick of [crosstalk] fucking.
Ben: Oh, no. It’s got to be a word that you break up, doesn’t it?
Hedvig: Yep. It’s got to be a word.
Daniel: And it’s got to have a heavy bit at the beginning. So, serial doesn’t work. Seri-fucking-al, no.
Ben: Tre-fucking-mendous.
Daniel: Uh, maybe.
Hedvig: Tre-fucking-mendous, I think, is pretty good. But you had to really press to get that one.
Ben: Oh, dear, I had to stretch. I had to limber up the old improv muscles on that one.
Daniel: How about a suburb like Kalamunda? “I am so sick of going to Kala-fucking-munda.”
Ben: Yeah, that works great.
Daniel: Okay.
Ben: That’s clean. I like that.
Daan: Oh, you’ve been reading this book that I got on my recent trip to Australia called Shit Towns of Australia. I don’t know if you’re familiar with this excellent reference publication, but it’s been teaching me a lot of Australian slang that I wasn’t previously aware of.
Hedvig: Oh, that’s very good.
Ben: We got some doozies.
Daniel: I think the other reason why we’re interested in this kind of thing is because, on this show at least, we’re fascinated by the idea that language takes the form it does because of some non-arbitrary motivation. We talk about language is arbitrary a lot, but we also like to know how language is not arbitrary. It matters that language happens in a situation. Linguistics, for a long time, has been trying to abstract away from situations in the world, trying to get to this notion of language in the mind as, like some Aristotelian ideal or something. What we’re seeing, especially this year in our show, is that language happens in words, if they exist, take the form they do because of stuff that’s going on in the world and the reasons we have for communicating our relationships, our attention, our cognition, our physical apparatus. If we can say, “Well, swearing takes the form it does because of the way the sounds are.” Then, that’s cool, and that helps us to ground language in a situation.
Hedvig: Yeah.
Daniel: That article is open access, so we’ll drop a link on the page for this episode, becauselanguage.com.
[interview begins]
Ben: Can we talk about AI now, please? Please? I have been so patient.
Daniel: Yes, let’s do. We are talking with Daan van Esch about the state of computational linguistics, AI, NLP and all that. We’ve talked a bit about your work with Google Translate, and we’re going to talk more about your work with a transcriber called Elpis. Just to change gears, I want to talk about something that has got a lot of people excited and nervous. It’s called ChatGPT by OpenAI, and humans are using it to generate text that is very fluid, very plausible, but factually, not all there. Like we said, Ben, it’s the tropiest of tropes.
Ben: Yeah. Just to be clear, for our listeners, this was the piece of software that was… I saw a lot of articles about it being used for plagiaristy kind of reasons. But I thought the more interesting application, if I’ve got the right software here, was writing, like fiction writing.
Daniel: Okay.
Ben: There is a massive cottage industry of direct-to-eReader authors out there who are just beavering away, releasing like one or two novels a year, and they’re starting to use this software to essentially augment their efforts. Then, they go back and just do a final pass and go, “Yeah, most of this is pretty usable. I’ll just put a little flare here and there.” And that blew my mind. I was like, “Whoa,” it’s crazy. If we’re talking about that, that’s the kind of software we’re talking about, where you can write like a human being by providing relatively small amounts of input.
Daniel: For ChatGPT, you wouldn’t use it to write a whole novel. It only gives you a few paragraphs back. But what comes out of it is much better than other text generators. Everybody’s been playing with it. I was just wondering, anybody have any favorite examples that you’ve seen of what ChatGPT is doing? I’ve got a couple but do you have any?
Daan: I haven’t actually been playing with it myself. I think you need to have an OpenAI account. I think my favorite thing with any of these kinds of models… ChatGPT only just came out before this was recorded. My favorite thing is always to see where they go rogue. I don’t know about ChatGPT specifically, but the previous generation I recall a great talk by Anna Rogers who was talking about, “What if you asked it how the Egyptians carried the pyramids across the Golden Gate Bridge?” Because that system is basically designed to just produce fluent human, like text, it doesn’t actually have that factual knowledge. Coming back to what you were just saying about how language is grounded in the real world, it’s grounded in context, it doesn’t have that knowledge that says, “Oh, the Egyptians never carried the pyramids across the Golden Gate Bridge.” They’ll just say, “Oh, it took them a lot of manpower to carry the pyramids across the Golden Gate Bridge,” because that’s what a grammatical sentence would be.
Similarly, actually, the introductions that they use ChatGPT to write here, they sound like reasonable statements. They are grammatical sentences. It’s very different from what you used to have maybe 5, 10 years ago when you would generate these completely ungrammatical sentences. It sounded very reasonable. Like Hedvig has a passion for words. Is it actually the case? That’s a very different question. I think you have to have those two topics and think about them separately. Is it grammatical? Is it English? Does it look like what a human might produce? Is it actually grounded in the real world?
Ben: This is the thing that I’ve been dying to want to explore in the show, because when all of this stuff really blew up, whatever it was, maybe six months ago with LaMDA and GPT-3 and people being like, “AI, it’s a thing. It’s sentient.” And a bunch of very sensible people, like yourself, Daan, were just like [sighs].
[laughter]
I could see the long, long, longest of suffering sighs as you guys. One of my favorite quotes, I’ve got it here. It amounts to, “LaMDA is a really good autocomplete.” That’s what it does. But like you said, if you give it something that requires any kind of understanding of anything, the whole thing falls apart.
Daniel: Well, I have just typed in, “How did the Egyptians carry the pyramids across the Golden Gate Bridge?” ChatGPT answers, “I’m sorry, but that is not possible.”
Ben: Oh, someone’s trained it.
Daniel: I can do something else. “The pyramids were built in Egypt thousands of miles away from the Golden Gate Bridge, which is in San Francisco, California.” Shall I try something else?
Hedvig: Yes, you should. Someone has trained it on that bloody thing.
Daan: This one absolutely be good.
Ben: Let Daan do it. He’ll be so good at breaking these.
Daniel: Okay. Go ahead.
Hedvig: Red squirrels are very good at scuba diving.
Daniel: How do red…
Hedvig: Squirrels scuba dive? There you go.
Daniel: [typing] “How do red squirrels migrate by swimming across the Atlantic?” How do red squirrels migrate by swimming?
Ben: [crosstalk] -what Hedvig said, but I am interested to see where it goes.
Daniel: “Red squirrels do not migrate by swimming across the Atlantic Ocean. Red squirrels are a type of tree squirrel that is native to Europe and Asia, and they are not found in North America. They do not have the physical abilities or adaptations necessary to swim long distances across open water, and they do not migrate across the Atlantic Ocean.”
Ben: See, this is why everyone is falling into the gullibility gap of these pieces of software. Do you guys remember the thing that did the rounds on the internet a little while ago about bears and bins, like trash receptacles? Do you guys remember this?
Daniel: No.
Hedvig: They’re good at opening them.
Ben: Okay, several months ago, there was this very viral thing that did the rounds, which was, this was a park ranger saying, “One of the issues with designing bear proof bins for our parks is there is considerable overlap between the smartest bears in the park and the dumbest tourists in the park.”
Daniel: Whoa.
[laughter]
I feel like, in these situations, like we’ve just done, it’s so easy for us to be like, “Daddy? God, are you there?” When we see these things that do a thing like a human, it’s so easy to fall into this trap of, “Well, ChatGPT chat is clearly sentient, and we need to give it rights and a special little island.” [crosstalk]
Hedvig: No, no. It doesn’t get rights. I’m going to be the first against the wall.
Ben: [crosstalk] -going to be first. When they come for [crosstalk] Hedvig first.
Daniel: I did this. I said, “Why did Harold Holt disappear?” Now, if you don’t know who Harold Holt is, I’ll read the first paragraph. “Harold Holt was the 17th Prime Minister of Australia, serving from 1966 to 1967. In December 1967, he disappeared while swimming near Portsea Victoria and was presumed drowned. Despite an extensive search, his body was never found.” Can I just emphasize that a prime minister of a major country died?
Ben: Daniel, I feel like you should know. They all die.
Hedvig: Australians have the awful comedic sense of naming a swimming pool after this man.
Ben: So good. So good.
Daniel: That’s so Australian. I said, “Rewrite the last question, but in the style of a Shakespearean sonnet.”
Ben: Okay.
Hedvig: Okay.
Daniel: “Oh, why did Harold Holt disappear? A mystery that plagues the minds of men. He was the Prime Minister of Fair Australia but vanished while swimming near Port Sea.” All right, that didn’t rhyme, but let’s go on. “Some say twas suicide, some shark or beast. Others claim a government dark conspiracy. But most believe he drowned in ocean’s feast, his body lost to the deep, dark, briny sea. Though theories abound, we’ll never know the truth of what happened on that fateful day. When Harold Holt disappeared, a blow to Australia and the world in some small way. His legacy endures, though he is gone, a leader lost but not forgotten.”
Ben: Okay. Well, all we can say here is that clearly Shakespeare was doing something less hard than fucking… I don’t know, the Wheel of Time author or something. Like we can’t get the AI to come up with stuff fresh, but it can do a Shakespearean sonnet. Yet another piece of evidence that Shakespeare was a fucking putz.
Daniel: Or an AI.
Hedvig: That is Ben speaking. I know sometimes people have a hard time telling voices apart. That’s Ben speaking.
Ben: [laughs] I love you were like, “AI get no rights. Whoa, whoa. Oh, hang on. We’re swinging at Shakespeare. I really need [unintelligible [00:55:43].”
Hedvig: I am married to a British man, and so is my sister.
Daniel: The same one?
Hedvig: We…
[laughter]
Hedvig: Fuck off. [laughs]
Daan: This is where you need those logophoric pronouns, right?
Hedvig: Yeah.
[laughter]
Daniel: Yeah, exactly.
Ben: So, look, we have strayed, and I do want to get back to where Daan’s real working life and area of expertise comes in. What I would ask is, where do you in your world, Daan, where do you see things like GPT fitting in? Are you seeing stuff like this and wondering to yourself, “Oh, well, that’s going to be a really handy interface for the stuff that I build”? So that you use, say, ChatGPT to port over into what it is you do. So, a person can just do a natural language request and then your software does all the work it needs to do, and then ChatGPT reconfigures that to be useful.
Daan: If you look at the types of things that I’ve been working on. It’s things like spellchecking, autocorrection on your keyboard, transcription of audio recordings into text, translation from one language into another, you can imagine, perhaps especially for translation, that there might be some benefit to something that can generate such incredibly fluent sentences. At the same time, you do have that concern always of, “Hey, is this thing actually translating things properly or is it just generating something that sounds reasonable and looks fine, but isn’t actually a translation, because especially in a situation where you don’t have people who know both languages using it, that’s tricky. Like Hedvig said, I think when you use it to translate between English and German, and if you know German, that’s great. You can like, yeah, it’s a great A. As of right now, I can’t really tell you, “Oh, this is how it will revolutionize.” Let’s say autocomplete on your phone because, if you look at your phone, that thing has a pretty small processing chip, like a CPU. It’s got a battery that doesn’t have too much power. I mean, I’m lucky if my phone lasts more than a day. If you look at ChatGPT, I mean, nobody knows. They haven’t published any details, as far as I know, but that’s absolutely running on a much, much, much, much, much bigger compute cluster than your phone.
Daniel: Do you have a sense of why ChatGPT, for example, is just so much more fluent than older approaches? It just a matter of much more data, or are they doing something else?
Daan: It’s very hard to speculate on what exactly it is that they did. They haven’t published too many details, as far as I know. I think overall, there’s obviously been a lot of work as people have been making these language models, people have been looking at, “Okay, well, what are common failure scenarios?” Like the Egyptians walking across the Golden Gate Bridge, and people publish data sets with those examples. Obviously, what you then do possibly is you just take those data sets and show them and say, “Hey, this is wrong,” and that eventually, the more you put in, the better it gets.
It’s also interesting that there are specific things that people have found out that work, and then everybody uses those examples. You were just, like, using Shakespearean sonnets. That’s just something that a few people found out, “Hey, this works,” and they posted it on Twitter or Mastodon or wherever, and then everybody starts to generate Shakespearean sonnets, but it might actually be terrible at other things that it’s not funny to post, “Ha, ha, here’s this thing I tried, like, [crosstalk] sonnets. I don’t know. It doesn’t work.” Nobody does that. There’s always like these sweet spots where people say, “Oh, this thing actually works,” and then everybody starts doing that. But yeah, I really don’t know. I mean, I’m going to be very curious to read the paper on how they did this once they published that preprint that they presumably will eventually publish.
Daniel: I want to ask a question that listener, Bianca, has asked, and this was a while ago. She says, “I’m Bianca, a language and idea enthusiast, currently in high school. I’m thinking about studying linguistics in college and pursuing a career in NLP computational linguistics. How true or untrue is it that this field is already taken care of? Would really appreciate an episode with someone involved in NLP. Thanks, and love your show. Listening all the way from Costa Rica.”
Ben: Wow. What a cool question. What a cool high schooler. I wish I had more of them in my class. Wow, what a wicked… Bianca, I want you to know that you are almost certainly the coolest person in your school. Even if it doesn’t feel like it, you definitely are.
Daniel: When I was starting my PhD a long, long time ago, the field looked a lot different in NLP. I mean, neural nets were these things that we used back in the 90s, and then we had better techniques like decision trees and support vector machines. Now, of course, neural nets are back with a vengeance. Automatic speech recognition was terrible. Text generation was terrible. OCR wasn’t even a solved problem. Chat bots were awful. Machine translation sometimes had hand-built rules. Now we’ve got Siri, we’ve got Google Translate, we’ve got large language models. A lot of stuff is done. What’s left?
Daan: I wouldn’t really say that it’s done though. I think there’s products that work well now that you can use them in lots of situations, but there’s a ton of work to be done, both in terms of improving the quality for the experiences that are already there. But also, lots of languages that are not supported at all. We are just starting to break ground on getting translation to work between a thousand languages, that would leave thousands more languages, each of which would also have their unique context, grammatical patterns. There’s lots and lots of work to be done. I mean, I am expecting to work on this pretty much for the rest of my career. There are enormous amounts of work that will need to be done. You’ve all got all these shiny enhancements and nice changes that are made, but there’s very practical problems that need to be solved.
One question that actually a lot of people are thinking about now is, eventually you run out of straightforward data that you can use. How can you use something like Hedvig’s project like Grambank to actually teach Google Translate certain grammatical structures? Maybe it can infer them from the text data that it sees. If you had some structured database that said, “Hey, here’s some information on this language. It doesn’t have pronouns before verbs.” Like, it has proto. Great. Now we don’t need to infer that from, like, a text corpus. We can focus the energy of that Natural Language Processing model on other things. There’s quite a few people looking at those sorts of things of how do you fuse so-called traditional linguistic analysis with these sorts of NLP approaches? Nobody knows the answer. If you’re in high school right now, I think, certainly going into university, there’s lots of courses where you can learn about all of this. There will be plenty of work to do a PhD on, even in a few years from now, I’m sure.
Ben: I would have to imagine, correct me if I’m wrong, not just… I don’t want to say just, that’s the wrong use of phrase, but this isn’t a singularly academic pursuit either. There are plenty of people in the world who are making their way, building tools. Like digitally, absolutely, but they are just sitting down and making stuff to solve these problems. I think in our show, we often really look at the academic side of things because that’s we’re white and we’re highly educated and all those sorts of things, so that’s what we do. But yeah, just as viably, you could sit on a bunch of forums and download open-source software and code and do things and build things, and you are going to make things that will be useful that people will use.
Daan: Absolutely. I think, actually, that’s one of the cool things about the field over the last few years. A lot of algorithms, a lot of software has been open sourced and that actually makes it possible for people to train all these ML models, NLP models on their own language or on a language of a community that they work with, which is really quite good because, at the end of the day, that’s where a lot of the expertise is. Folks who are language activists, who run a language community, education center or something, they have all that knowledge. So, for them to be able to train these models themselves and then apply them in whatever setting is most appropriate for them, I think that’s the goal of Elpis.
Daniel: Yeah. I really wanted to ask you about your work on Elpis. This is a transcriber for low resource languages?
Daan: Oh, yeah. This is another project that we’ve been working on. Actually, this is how I know Hedvig. Like we were just saying, there’s thousands of languages in the world, and I think the most ambitious academic studies these days cover maybe 2400, actually, Hedvig, is there a bigger academic database out there WALS, maybe? Or do you…[crosstalk] right now?
Hedvig: WALS is slightly larger in number of languages, but we beat them in datapoints because we have more datapoints per language.
Daniel: Nice.
Daan: See.
Ben: Not that it’s a competition, of course.
Daan: No. Who could think such a thing?
Hedvig: Also, one of the WALS editors is a cocreator of Grambanks that we’re all in very good terms with. We’re in the same building, we’re all good.
Daan: That’s good. I think you’ve got like academic studies looking at a couple of thousand languages mostly. Maybe tech is starting to get to that 1000 language frontier. Of course, that means that tech is still unavailable in thousands more languages, including languages that linguists and academia are already working on. Actually, what happened with Elpis, which is this project to build speech recognition systems, so systems that can transcribe speech into text, was that folks at the University of Queensland in Australia reached out to Google in Sydney and said, “Hey, we’ve got all these databases in this archive called PARADISEC with recordings of all these languages.” Maybe Hedvig, you can say more about that as well.
They said, “Look, we’ve got all these recordings of all these languages and we don’t have a way to transcribe them quickly. We’ve basically got PhD students who are sitting there spending years of their lives manually transcribing all these recordings so they can find those bits of language that they would then want to study further for their grammatical description, for their dictionary,” whatever it might be.
In addition, it makes it quite difficult for a community where people say, “Hey, I would love to hear all the recordings where people talk about this creek because it’s a site of historical significance. Can we please get all the recordings within PARADISEC where that topic is discussed and we want to listen to them?” That was very difficult, because you’d have hundreds of hours of recordings and you couldn’t easily find the right recordings. Basically, University of Queensland reached out and said, together with a few other Australian universities, “Is there a way that we could use AI to help with this? Is there a way that we could build such a speech recognition system?” And the answer is yes. You can actually get pretty good results these days with relatively little trading data. Historically, to train a speech recognition system for English, you would need like a few thousand hours of recordings with transcriptions and that’s just not available for most of these languages.
Now, with new algorithms that have come up, you can make it work with maybe 10 hours, which is starting to become feasible. It’s been something that we’ve been advising on. I want to give credit to Ben Foley and folks at the University of Queensland who’ve been driving all of this, but it’s really something that’s starting to pay off. You can start to get pretty good results in languages with very little training data.
Hedvig: Because when Bianca asks, how solved is it? Some of these questions are solved for English and for a couple of other European languages and big languages. Like Daan was saying, there are around 7000 languages in the world, and we are struggling to provide these everyday tools, like spellchecking and complete and things like that for most of these languages. That’s where NLP is expanding right now with things like what Daan is doing. The great thing about this thing, Elpis, which I’m bad at the abbreviation of Elpis. Is it endangered languages?
Daan: I believe it’s one of those… what’s it called, a backronym. It’s like the Endangered Language Pipeline Inferencing System, but it’s actually the Greek Goddess of hope.
Hedvig: Oh, nice. Oh, I love that.
Daan: Because it gives hope.
Hedvig: Yeah. For the 7000 languages in the world, we have grammars and descriptions for about 4000, and we have collections of texts for fewer than that. What’s known as the transcription bottleneck in linguistics is that you send a linguist to the field to talk with native speakers and learn about their language and work with them, and then they come back and they have hours and hours of recordings and they have to transcribe it all. A tool like Elpis comes in to give a first draft of what they think the sounds are. As far as I know, Elpis doesn’t do any translation. It just listens to the audio you give it and then it puts down what sounds it thinks it’s hearing. And then, you can speed up the documentation process quickly, and you’ll also get text that you can use for tools like spellchecker and advance those areas as well.
So, when it’s solved, maybe it’s solved for certain specific languages, for certain specific tasks, but we want to bring those tools to more people so that they don’t have to just switch to English to be able to participate online all the time.
Daan: I wouldn’t even say that it’s solved for English. If you look at it, even between English and major languages like Chinese, it still makes mistakes that no human would make. It’s definitely getting pretty good. I wouldn’t say it’s solved even there. There’s lots of open questions, even for major languages like English, Chinese, French. Then, yeah, if you look at lots of other languages, like literally thousands of other languages, there’s just nothing. It’s not the case that folks don’t want to use technology in those languages. I mean, sometimes this is the case. There are sometimes situations where people say, “Hey, this wouldn’t be necessary for our community, or the way that we use our language. We would prefer that it stays within the community.” A lot of the time it’s like people have a smartphone. People are trying to text each other. Autocorrect is changing their words that they type in their language into English or some other language. That’s obviously going to require quite a bit of work to straighten that out across all the world’s languages.
I would definitely come and join the field. There’s a lot of work to be done, especially if you have an interest in both NLP and linguistics, because if you just have an interest in NLP, okay, great, you can take some existing dataset and work on it and maybe squeeze out some accuracy improvements, if you’ve got that interest in linguistics, that’s obviously also super interesting, I love reading all this linguistics research. But it’s when you bring these things together that you can make that difference in everyday lives, using linguistics and NLP. I think a grammar is obviously a crowning achievement of years and years of research on a language.
At the end of the day, publishing a grammar doesn’t make it easier for people to use their language on their phone, for example. What we do is we take that knowledge that linguistics gathers and then combine it with knowledge from the field of NLP, and then you build something in between.
Daniel: If I can synthesize where we are today, it sounds like there are some cool tools coming online that we need to be appropriately skeptical about, because sometimes they’re designed for fluidity, but not necessarily for facts. Also, there’s just a lot of stuff still to do, not just for a big language like English, but with languages around the world and descriptions. And that’s what you both are doing.
Hedvig: Daan is doing that more than me. It would be like specifically for Bianca writing from Costa Rica, there are indigenous languages of Costa Rica that might be some of those communities might be interested in being able to use spellcheck on their phone and things like that. So, if you feel like you want to do something in NLP that’s connected to your immediate environment, maybe that could be a faraway goal as well.
Ben: Massive points to a wonderful question. Well done, Bianca. Okay, can I ask a question?
Daniel: Yes, please.
Ben: Because it looks like Daan’s a little bit like for the better part of the 11 years that I’ve been doing this show, I’ve had one abiding metric in my head.
Daniel: [chuckles] I know what he’s going to say.
Ben: I bet you do because I’ve said it like 400 times. Okay, Daan, when do we get the Star Trek computer?
Hedvig: The Universal Translator.
Ben: How far away are we?
Daan: Will you believe that I work in tech and I’ve never watched Star Trek?
[gasps]
Daan: [laughs]
Daniel: You’re going to have to describe it, Ben.
Hedvig: Argh.
Daan: [laughs]
Ben: Okay. No, I’m not one of these people. I am not going to say anything about that.
Hedvig: I am one of these people.
[laughter]
Hedvig: You know what? Daan, you do you and, whatever cool, interesting things you’re into. But one of the hallmarks of Star Trek as far back as the original series in the 1960s is that the ship has a computer that they just referred to as Computer. And please, Trekkies, don’t come at me. I know various ships have had AIs that they named. Let’s just stick with the basics here. The Computer is, for lack of a better phrase, a functionally sentient entity that can answer any question with any level of contextual difficulty and provide a helpful, relevant answer. It doesn’t just have to be queries of a purely referenced nature either. My question to you is, how far away are we from autocompletes that good and that actually have their fingers into all of the relevant systems to be able to extract meaningful information for us?
Daan: I think it’s the second part, honestly, that will be the largest challenge at the end of the day, because it gets into what Daniel was also saying earlier about language is really grounded in the real world. It’s really grounded in context. When we learn languages, by and large, as humans, we do it through like a 4D experience. Like, you can touch things, you go around in the world, you can change things, you can pick something up, feel the texture and so on. I don’t really see… it’s very hard to predict the future, and you always get people who say, “Oh, such and such thing will never happen,” or on the other side, such and such thing will be done in two years. I don’t really feel you can say that it will be done in, like, two years, three years. Can you say that it will never happen? Well, I mean, clearly the human brain can acquire language in a certain way.
In theory, it would be possible for something to do that. It’s just that I don’t think we know how to do it. I think it’s reasonably far off, especially for the reasons of grounding and connection with real world things. You could take today’s speech recognition systems in a language like English, connect them with something like ChatGPT, and then hook them up with a speech synthesis system, so that would generate synthetic speech to take the answers that it generates and produce an audio recording of that. You could do all of that. It would probably be… I think, if you showed that to somebody from the when those Star Trek movies were made, if I understood you correctly, they would probably be amazed that thing could even exist. Would I trust it to steer my rocket ship around in space? Yeah, that would be a different question.
I think it’s also a matter of, like, are you going for something that would be a nice demo, that would really impress people or is it something where you would say, “Oh, yeah, this is going to be used in all these situations.” Think of self-driving cars. People have been working on it for decades. It’s only been in the last few years that you’ve seen them start getting used in a handful of places and usually pretty constrained environments. Does that mean that self-driving cars are never going to happen? I don’t think so. Like, a human being can drive a car, so why shouldn’t we be able to build something that could also drive a car? I mean, in principle, sure.
Ben: Not only that, humans drive them quite badly as well. We’re classically not great at the task.
Daan: Yes. At the same time, you always see these predictions that especially when something shiny like this ChatGPT release happens, you always see these predictions breathlessly saying, “Oh, self-driving cars will be everywhere in the next three years.” I’m like, “It takes a long time for any real change to come to the world.” That’s assuming that the technology is even ready. Think about how long it took for the internet to reach pretty much around the world. Even today, there’s lots of places you can go to where you wouldn’t find the internet connectivity. Think about how long it took for electricity when it was invented to be everywhere. It’s very hard to predict.
Hedvig: Daan, you’re a very sensible person. What’s it like being with Silicon Valley tech types and being this sensible? Because they are always like, “Next year we’re going to solve blah, blah, blah,” and you’re there like, “Yeah, maybe we will, but we’re making good strides and it’s going to be okay.”
Daan: I think actually what you’ll see is a lot of the folks, at least the folks that I know that work in the Valley or work in tech, pretty much all of us, look at it this way. It’s just that there’s always that sort of gap, which I’m sure you know from your respective areas of expertise as well. There’s that gap between what people in the field know and talk about and think about and what gets reported and gets picked up and goes viral. Like, the same way that people say, “Oh, you’re a linguist. How many languages do you speak?” People are like, “Oh, you’re working on AI, when are we going to have…”
Hedvig: [crosstalk]
Ben: When is Skynet… [crosstalk]
Daan: Star Trek Computer or the singularity or whatever, which I don’t even know what it means necessarily. I think it’s one of those things where people in the field, at least the people that I talk to, everybody is very much aware of, “Hey, these are the limitations.”
Sure, there’s the occasional press release, which gets rather breathless, and the rather exciting demos that people put out. But even those things, people are, by and large, generally speaking, from what I know, yeah, very well aware of the limitations.
Hedvig: Pretty sensible, okay. Because yeah, these recent couple of months, we’ve also seen the hopeful dreams of metaverse and everything, and it’s not happening, mate. Anyway, sorry, we shouldn’t start talking about that.
[chuckles]
Daniel: Star Trek Computer, three years. Cool. Thank you.
Hedvig: Thank you, Daan.
Ben: A long time.
Hedvig: Good times.
Daniel: Thank you, Daan. Our guest for this episode has been Daan van Esch. Daan, how can people find your work and find out what you’re doing?
Daan: Just use your favorite search engine. You’ll type my name into it and some NLP will run and figure out, “Oh, here are some webpages that mention this person,” and it’ll show you some links, you can click them. It’s all going to be very old school.
Daniel: Fascinating.
Hedvig: I love when we actually have a guest who works at Google, he’s like, “Your favorite search engine. Use that.”
Ben: No, you know what? I’m going to back down on this one because what he’s just described is what all of us actually fucking do.
Hedvig: Oh, yeah.
Ben: No one remembers a specific website.
Daniel: You know you’re going to do it. Just do it.
Ben: No one remembers the Twitter handle. Everyone just googles, “What is Daan van Esch’s Twitter handle,” or whatever. That’s the reality of where we live.
Daan: Under the hood, you’re going to have so much NLP that kicks in that sort of figures out, “Oh, okay. I want to find only the non-spam pages. I want to find only the page that mentioned this person. I want to find the thing that you might be looking for.” That’s all NLP under the hood.
Ben: I love it.
Daniel: We’ll also have some links on our webpage, which no one will use. So, just do what Daan said. Daan, thanks so much for hanging out with us today and explaining your work and talking to us. I feel like my mind has been expanded.
Daan: Thanks for having me.
Daniel: Thanks also to Bianca and everyone who gave us ideas for the show. Thanks to Dustin of Sam and Stories, who still recommends us to everyone. The team at SpeechDocs, we love you. Most of all, our patrons who give us so much support and make it possible, keep the show going. Thanks to all of you.
[interview concludes]
Ben: [whispers] Go, Hedvig.
Hedvig: Yeah. If you liked our show, there’s a number of ways you can support us. We always love to hear your ideas and feedback. You can get in touch with us in a number of ways. You can follow us, we are @becauselangpod basically everywhere. You can google ‘becauselangpod’ and you’ll find those places. We are not on Spotify because screw Joe Rogan.
Daniel: Still.
Hedvig: You can also leave a message for us on SpeakPipe and you can find the SpeakPipe, which is a way you can just press recording your web browser and say, like, “Hello, I really like you. Uh, goodbye,” or whatever it is you want to say.
Ben: Oh, my God. Can you please do your best imitation of Hedvig’s-
Daniel: Please do that, we would love that.
Ben: -saying those exact words, that would be fantastic.
Hedvig: [laughs] Yeah, you can do that if you want.
Daniel: We’ll play them all.
Hedvig: That’s on our website, becauselanguage.com. You can also send us an email. hello@becauselanguage.com. We love it when people tell friends or leave reviews. I’m a big podcast listener and I know that most of my podcasts I get through either other podcasts or through a friend who tells us something. You can also leave us a review. I like Podchaser because it’s not locked into Apple and anyone can leave a review, which I think is nice.
Ben: You can also do that thing where you become a patron. You will get bonus episodes. You can hang out with us on Discord, which is lowkey actually the best part of being a patron for Because Language, because our listeners are wicked. They’re really good. You’ll be making it possible for us to make transcripts of our shows so that they are readable and searchable. Not unlike the tool that our guest this week was making for languages of an endangered nature, which is pretty wicked. We’re doing a similar thing, but just way less cool.
You, if you’re about to hear your name, bloody hell, you’re a dead set legend. A shoutout to our top patrons: Iztin, Termy, Elías, Matt, Whitney, Helen, Jack, PharaohKatt, Lord Mortis, gramaryen, Larry, Kristofer, Andy B, James S, Nigel, Meredith, Kate, Nasrin, Ayesha, Moe, Steele, Manú, Rodger, Rhian, Colleen, Ignacio, Sonic Snejhog, Kevin, Jeff, Andy from Logophilius, Stan, Kathy, Rach, Cheyenne, Felicity S, Amir, Canny Archer, O Tim, Alyssa, Chris W, Felicity G.
Big news. You can now donate to us on Ko-fi. Who will be our first donor? Will it be you? Will it be a one-time thing? Will you be doing it all the time? I have paid for my entire seat, but I only need its edge. That’s how anticipatory I am. Of course, at our newest… nope, there’s no new patrons. Thanks to all our amazing patrons. That’s all.
Daniel: Our theme music has been written and performed by Drew Krapljanov, who’s a member of Ryan Beno and of Didion’s Bible. Thanks for listening. We’ll catch you next time. Because Language.
Thank you.
Ben: Pew, pew, pew.
Hedvig: Pew, pew, pew.
Hedvig: There’s a little cat who’s looking at me judgy and has been for the last 20 minutes. Come here. Come here, you [unintelligible 01:24:55] cat.
Daniel: Okay, time to run down our beverages.
Hedvig: Can we also run down Ben?
Daniel: I know. Where is he? Anyway, I am having a delightful cola beverage, sugar free and caffeine free because, really.
Daan: Basically, just water, but with a flavor.
Daniel: Mostly yes.
[laughter]
Hedvig: That is the stupidest thing I’ve ever heard of.
Daniel: What?
Hedvig: Okay, maybe I’m wrong. Maybe I would also like the flavor [unintelligible 01:25:26] that much that I would.
Daniel: I’m sure you’ve heard stupider things.
Hedvig: I don’t know.
Daniel: Okay, Ben’s, here. Hi, Ben.
Ben: Hello.
Daniel: How are you doing? Ben, Hedvig said that my drink was stupid. Make her stop. She’s insulting my drink.
Ben: What’s your drink?
Hedvig: Well, it’s cola… [crosstalk]
Ben: Caffeine free, sugar free. Please, Hedvig, you explain.
Hedvig: No, it’s what you think it is. It’s caffeine-free, sugar-free cola.
Ben: That’s fucking stupid.
Hedvig: Thank you.
Daniel: Argh! Why do I even try with these people?
Ben: It takes all of the things that would be nice about the activity that you’re doing and gets rid of them. It’s like if you were to sit down and have the sex and then be like, “By the way, there can be no orgasm or really physical pleasure of any kind.”
[laughter]
Ben: “Oh, and if you wanted touch people, no, you can’t do that either.” It’s like, “What’s left?”
Daniel: Daan, do you see what I put up with here?
[BOOP]
Hedvig: Because we’re all being recorded on different tracks. So, if we talk over you and you just barge ahead, he’s just going to delete us.
Daniel: No, I’ll move you, if what you say is any good.
[laughter]
Hedvig: Yeah, if what we say is any good. If it’s not any good, we’re just going to get deleted. Such is life.
Daan: Same applies to me, I assume.
Daniel: No, I don’t think so.
[laughter]
[Transcript provided by SpeechDocs Podcast Transcription]