Automatic transcription

Automatic transcription is ready for your live events

February 3rd, 2020 Michael Monette

Automatic transcription is ready for your live events image

A lot can limit what audience members take away from your live event. Maybe in some sections it’s tough to hear what’s being said on stage due to audio issues or chatty table neighbors. And for people who are deaf or hard of hearing, your event might be totally inaccessible. Happily, there’s a solution to these challenges: live transcription. In some cases it’s even a legal requirement. Question is, do you enlist human help, or machine?

Machine-driven transcription, or automatic transcription, isn’t a new invention. It’s one of many applications for automatic speech recognition (ASR) technology, which has been around for over half a century. ASR technology has come very far over the years. No, transcriptionist hasn’t gone the way of elevator operator or bowling alley pinsetter as a needless human occupation. But with recent advances in artificial intelligence (AI) and machine learning, automatic transcription technology is ready for prime time.

How automatic transcription works

Automatic transcription services link the sounds that make up human speech to words in a digital dictionary. When these sounds have multiple possible matches – homophones, for example, or due to unclear speech or audio – the auto transcription software examines the overall context and assigns each possible word a probability, selecting the word it deems the most likely fit. Deep learning algorithms drive this analysis, informed by a broad range of inputs that vary between solutions.

The same basic process is at work when you interact with Siri, Alexa, Cortana, or Google, only in this case the system outputs its conclusions as text.

Most automatic transcription solutions on the market today are built for post-production. Some work by having you upload an audio recording. Services of this sort will run your audio file through automatic transcription software and send you the result. Processing typically takes place in the cloud, but local speech-to-text solutions are also available. Of course, post-production solutions like these aren’t suitable for live events, whether it’s a conference, a court hearing, a legislative assembly, a corporate town hall, or a sermon.

Two ways to transcribe live events

If your goal is to deliver subtitles in real time, you have two options:

  1. Hire one or more human transcriptionists (to work on-site or remotely)
  2. Use an auto transcription service capable of analyzing speech and outputting subtitles quickly enough to keep pace with speakers.

Option A is pretty straightforward. Working on-site or from home, human transcriptionists capture what presenters are saying in real time. The tricky part is figuring out how to display the text on a monitor, tablet, or other device. Live transcription is a whole different game from working with pre-recorded audio, so you’ll want someone who has a degree or certification in court reporting or captioning to ensure they can keep up.

Option B is a bit more complex from a technical standpoint but does offer significant advantages over human-based transcription. You can find live transcription solutions from big names like Google, Amazon, and IBM.

Live transcription set-up

On the surface, AI-driven live transcription doesn’t look all that different from human-based transcription. Imagine a speaker on stage delivering a keynote address. The microphone they’re speaking into is connected to a laptop or other device that’s running cloud-based automatic transcription software. Everything the speaker says is projected through the conference hall speaker system but also sent as audio to the cloud. In the cloud, natural language processing technology matches the various sounds with words in a digital dictionary. The software then sends back the text to display on a monitor so anyone can follow along. The data the solution uploads and downloads is tiny, so all of this happens very quickly.

Subtitles and captions: What’s the difference?

It’s important to note that many automatic transcription solutions generate subtitles rather than captions. A lot of people use these terms interchangeably, but they do refer to slightly different things. Subtitles provide a text alternative for speech or dialogue, whether it’s a translation or in the same language. Closed captions convert speech and dialogue into text but also background music and sound effects (e.g., a phone or a doorbell ringing).

Why does this matter? It could factor in if your live event must adhere to a set of accessibility standards. For instance, in the United States there’s the Web Content Accessibility Guidelines (WCAG), Americans with Disabilities Act (ADA), or Section 508 of the Rehabilitation Act. Some standards distinguish between subtitles and closed captions, requiring the latter to give people who are deaf or hard of hearing a fuller experience. This might not be an issue for a conference, sermon, or any other event where there’s a single person delivering a presentation or speaking. But it’s something to investigate if a mandate is driving your interest in live transcription.

Automatic transcription versus human transcription

Like many things, there’s a bit of give and take when deciding between human and AI-driven transcription. Yes, humans are still better at some things. We’ve all dealt with self-checkout machines that insist there’s an item in the bagging area when there’s no item in sight, only to be rescued by a dutiful (and very human) self-checkout attendant. But machines often win out when it comes to core business concerns like cost and convenience.

We’ll compare human and auto transcription on five key criteria:

  1. Accuracy
  2. Cost
  3. Convenience
  4. Consistency
  5. Privacy

1. Accuracy

Research suggests human transcription accuracy is around 95 percent. That’s one mistake in every 20 words transcribed. Speech recognition researchers are aiming for an error rate that’s on par.

Both Microsoft and IBM claim to have met a level of accuracy nearing this with their own speech-to-text solutions. But AI-based transcription doesn’t always fare so well outside the ideal conditions of a corporate laboratory. Background noise, poor acoustics, heavy accents and dialects, specialized vocabulary, and subpar recording equipment can all hamper the accuracy of automatic transcription. In truly unfavorable conditions you might end up with “word salad”, puzzling (or drawing laughter from) anyone in the audience who’s following along.

Humans tend to do better at transcription particularly when multiple speakers are involved. Machines struggle with this, which may or may not be an issue depending on the nature of your event. (But machines are closing the gap in this regard; see, for example, Google’s AI speaker diarization technology, which will make live automatic transcription of panel discussions and other multi-participant formats possible.)

Don’t discount automatic transcription just yet. Thanks to the deep neural networks that power speech recognition technology, machine-driven transcription is improving by the day. Some solutions you can prime before an event to more accurately interpret a specific speaker, potentially dealing with difficult accents or dialects more effectively than a human transcriptionist. With others, it’s possible to add words and terms to the solution’s dictionary to aid recognition. This feature is invaluable for events that feature specialized language and jargon, such as a conference for engineers or medical practitioners. It’s even possible to improve transcription accuracy of industry-specific terms by identifying North American Industry Classification System (NAICS) code lists.

AI’s accuracy edge doesn’t end there. Recall that speech recognition solutions analyze context to help resolve word use ambiguity. Machine-driven live transcription software can make corrections on the fly as a speaker finishes a thought (at the same time giving the system more context to work with). Humans certainly aren’t immune to mixing up homophones or similar sounding words; we may even be more likely to do so when the pressure is on to keep up with speakers. The difference is that human transcriptionists don’t have time to fix these mistakes – unless they’re willing to risk falling behind.

2. Cost

Live events can be expensive affairs. The costs of venue rental, catering, and travel and accommodations for guest speakers can leave little in the budget for much else. This can present problems if you’d like to (or must) provide live subtitles for audience members.

Human transcriptionist pay rates and pay models vary. Some transcription services charge by the minute, others by the hour. Transcriptionists who can keep up with live speakers will command a higher price than those who work with audio files or videos. Travel expenses might factor in if the transcriptionist isn’t local and needs to be on site. Fees can also be tied to on-site time rather than transcription time, in which case you’re paying the transcriptionist even when the show isn’t on (e.g., during lunch and networking breaks). And if a session runs long? That’s right: overtime fees.

Whatever the case, transcription fees can really climb when you’re relying on human help, especially when your event takes place over multiple days or includes sessions that run in parallel. When budgets are tight, organizations sometimes have to limit subtitles to select speeches or sessions. This can put event planners in an uncomfortable position, as guest speakers may wonder why their talk isn’t important enough to ensure it’s accessible to everyone.

An automatic transcription solution can help you avoid issues like these. AI-driven services still charge transcription fees, but these are significantly lower than the average pay rate for a human. You can run the service only when there’s transcribing to do. And with the lower cost of AI-based transcription, it’s less likely you’ll have to pick and choose which sessions will feature subtitling. The potential savings are even more impressive if you hold or produce multiple events a year.

3. Convenience

It’s not always possible to bring in human help for live captioning or subtitling. Maybe you’ve scheduled a meeting with short notice and you’d really like to send participants away with a transcript for review, or it’s Sunday and the volunteer who usually subtitles your sermons is swaddled in bed with a nasty head cold. Perhaps there are other conferences happening at the same time as yours and no transcriptionist with the right skill set is available. And what happens if the transcriptionist you hired can’t make your event because they’re sick or their flight gets delayed to the next day?

No need to worry about any of this with AI-based transcription. Machines don’t have busy professional lives like people do. At a moment’s notice, you can set up your automatic transcription service and it’ll do its thing. You can test it before the event to gauge accuracy, which is difficult to do with humans (and potentially costly). You can even customize it to recognize industry-specific words.

Automatic transcription services are more flexible as well. Many support multiple languages, eliminating the need to search for a transcriptionist with the right knowledge.

4. Consistency

Transcription ability varies widely between people (a matter of experience, most often). Performance can vary in the same individual, too – for example, if the person you hired slept poorly the night before your event.

This variability is cause for concern. Will the person you hired (or their replacement) be up for the task? Will they be at their best on event day? Are they familiar enough with the subject matter? No such trouble with automatic transcription services. Of course, environmental factors like background noise and the quality of the AV equipment you’re using will affect the software’s ability to transcribe speech. But with these things controlled, you’ll get consistent transcription from one event to the next.

5. Privacy

Transcripts are great for anyone who missed the big meeting and a convenient reference for anyone who was there. But what if that meeting included discussions about unpatented technology or other company secrets? No business wants outsiders privy to such things, but it can’t be avoided if you’re bringing in an external transcriptionist to caption or subtitle the event. Non-disclosure agreements are a thing, though you can never be too careful; leaks happen all the time.

Opting for an automatic transcription service will reduce such privacy risks. It won’t eliminate privacy risks necessarily, since many send audio to the cloud for processing. The risk of a breach is much lower, in any case, which makes AI-driven transcription the way to go for subtitling private events.

Get the best of automatic transcription today

Automatic transcription is a feasible alternative for live subtitling conferences, meetings, and other events – under the right conditions. Epiphan LiveScrypt makes it easier than ever to get those conditions just right. Powered by Google’s advanced speech recognition technology, LiveScrypt features professional audio inputs (XLR, TRS) so you can capture crystal-clear audio conducive to accurate AI transcription. Our automatic transcription solution also includes HDMI and SDI inputs, a built-in screen for configuration, and a QR code system for easy streaming. These simplify setup for auto transcription and make for fewer points of failure.

LiveScrypt diagram

Learn more about how Epiphan LiveScrypt can help you maximize the benefits of today’s automatic transcription technology. And if you have any questions, just ask our product specialists.

Leave a Reply