Video production

11 machine learning applications for video production

March 15, 2018 Jordan Sheldrick

11 machine learning applications for video production image

Machine learning is the latest digital transformation that helps streamline and personalize the user experience for all kinds of technology. This type of computing algorithm essentially “learns” what yields a positive result and what doesn’t, and continuously improves itself based on this collected data.

An everyday example of machine learning can be seen in voice recognition systems, such as Apple’s Siri, which gradually learns and improves at imitating human interaction over time. Another example can be seen in how video platforms like YouTube or Netflix use your watching history to curate a list of personalized video recommendations. This is machine learning at work.

What does machine learning have to do with video?

Machine learning is an exciting new technology that can be applied almost everywhere there’s an opportunity to improve or automate a process – including live video. There are many opportunities for machine learning applications in live video scenarios of all sizes, including large multi-camera events, smaller single-camera livestreams, and even lectures at educational institutions.

Many different kinds of software programs can adopt machine learning, such as video production apps, video animation tools, and encoding software within live production systems like Pearl-2 and Pearl Mini.

Here are 11 machine learning applications for video production to help streamline and automate the process for technicians, presenters, and viewers alike. While these applications are merely conceptual at this point, we’re excited to explore the possibilities of this exciting new technology and how it may soon apply to professional live video gear.

1. Simplified virtual studios

The first in our list of machine learning applications for video production is virtual studios. A virtual studio combines real people and objects with digital, computer-generated environments to emulate a production studio. Virtual studios let you create impressive and state-of-the-art studio productions at a much smaller cost compared to physically building the set. However, configuring a virtual studio requires technical expertise and significant time investment.

Machine learning can streamline the virtual set process by automatically adding digital elements (or removing physical elements) based on detected visuals, such as shapes, depth of field, and static or dynamic images.

Machine learning possibilities:

  • Sense and learn visual information on the physical set
  • Automatically remove or include virtual elements based on learned visuals
  • Streamline the virtual set creation process

Simplified virtual studios

2. Comment integration and aggregation

Live events and video broadcasts are often streamed across various platforms simultaneously, such as Facebook, Vimeo, YouTube, and other CDNs. Viewers will watch on a platform of their choosing, with many viewers preferring one platform over another.

However, this behavior poses a problem: the viewer comments and discussion become divided between platforms. Keeping all relevant comments in one place dramatically boosts engagement levels while making it more convenient for moderators in responding to viewer comments.

Machine learning can be applied to aggregate social media comments into the stream, allow presenters to respond to comments live, and automatically route replies to the appropriate social platform. Machine learning can also simplify the addition of dynamic content to a livestream, such as a Twitter hashtag conversation or a news feed discussion. The code can detect and learn relevant keywords across specific digital media channels and dynamically include the content into the livestream.

Machine learning possibilities:

  • Aggregate viewer comments into the livestream across all video platforms
  • Learn keywords relevant to the live events (e.g. hashtags or the name of the event)
  • Monitor online discussions across specified channels (e.g. Twitter, Facebook, YouTube, news and media sites) and dynamically include content into the livestream

3. Indexing using transcription, visual cues, and OCR

Indexing allows the viewer to quickly locate a desired spot in the presentation or lecture without the need to dig around manually. This is invaluable for longer presentations and lectures where the viewer may want to revisit key moments or essential learning topics in the video.

Machine learning can index a live video using a few different methods:

  • Audio transcription: Audio can be manually transcribed to created indexable text data, but this process costs significant time and human effort.
  • Visual/audio cues: Alternatively, a recorded lecture or live event can be indexed based on visual or audio cues, such as audience applause, a slide change, or a new speaker on the stage.
  • Optical Character Recognition: Optical Character Recognition, or OCR, is a technology that lets you convert a variety of documents, such as scanned paper documents, PDF files, or digital images, into searchable text data. This data can then be indexed, allowing readers to easily locate specific information within a document or media file.

Machine learning can help automate each of these video indexing methods, helping to save tremendous costs by reducing the need for manual transcription. Human operators can instead use their time to verify transcribed/converted text, therefore helping the software learn new words and correct any grammar issues.

Machine learning possibilities:

  • Convert audio into text and index key points in the VOD based on transcribed text
  • Convert overlays, lower thirds, and other on-screen text into searchable data with OCR and automatically index key points in the video
  • Learn specific visual and audio cues (e.g. applause, detection of a presenter’s face) and automatically create index entry when cues are detected in the video

Indexing using transcription, visual cues, and OCR

4. Intelligent live switching

To create a truly engaging live production or lecture, one must take advantage of live switching to swap between multiple video sources or custom layouts. Doing so helps emphasize the essential parts of the presentation while also retaining the viewer’s attention. However, this process needs to be done manually. Frequent switching may also prove tricky for smaller livestreams with minimal staff on hand, such as vloggers and lecturers. These smaller groups may therefore miss out on a valuable opportunity to create a dynamic live video experience for viewers.

Machine learning can be applied to current encoding technology to help automate the process based on visual or verbal cues, such as presenter movement, gestures, or audience applause. Is a speaker telling a personal anecdote? Switch to the camera view. Is the speaker explaining a concept in the presentation slides? Switch to the slide view. Machine learning can create an engaging switched live production with minimal effort required by the presenter or AV techs.

Machine learning possibilities:

  • Learn visual and audio cues for each video source or layout
  • Switch to each video source or layout based on learned cues
  • Help create a fully switched live production or lecture with minimal overhead cost

5. Dynamic image calibration

Live streams and recordings require optimally-calibrated picture settings (such as white balance and exposure) to achieve a clearly visible presentation for viewers. Picture calibration can be a tedious process, particularly when environmental factors are subject to change (such as lighting), or when users lack the expertise to make the necessary adjustments.

Machine learning can streamline the calibration process by detecting current picture settings and making modifications to improve picture quality.

Machine learning possibilities:

  • Detect current picture settings
  • Learn optimal picture settings to achieve the best possible shot
  • Make suggestions to improve current picture (or even configure settings automatically!)

6. Automated audio optimization

Halfway down our list of machine learning applications for video production is automated audio optimization. High-quality audio is essential when live streaming and recording a live presentation or lecture. Without clear audio, viewers are unable to fully experience the presentation. For the average non-technical presenter or lecturer however, audio problems can prove difficult to resolve quickly, such as inaudible/distorted volume or troubleshooting microphone problems. These issues often require the assistance of AV technicians to diagnose and resolve the problem before the presentation can proceed. Not ideal!

Machine learning can be used to keep a watchful eye on audio and automatically make adjustments to ensure maximum audio quality. Technicians could be notified only if critical audio issues are detected.

Machine learning possibilities:

  • Streamline the audio diagnostic process
  • Indicate to technicians when there is an audio issue to address
  • Ensure high-quality audio is available at all times

Automated audio optimization

7. Smarter presenter tracking

A lecture or live event usually includes one or more presenters who are often the focus of the audience’s attention. Presenters will enter and leave the frame and move around the stage as they present their material. In many cases, tracking the speaker with the camera as they move helps create a more engaging presentation overall. However, the tracking would need to be done manually with a human camera operator, which can be costly or unfeasible for smaller live productions or lectures.

Machine learning can be applied in this scenario to learn presenter faces and automatically track presenter movements without the need for manual camera operation. As the presenter moves around the stage, the camera automatically repositions itself in real time to ensure the presenter clearly visible within the frame.

Machine learning possibilities:

  • Learn the presenter’s face and track presenter movements
  • Remember the presenter’s face when as they move in and out of the frame
  • Distinguish presenter from other people who move in and out of the frame

8. Abridged videos

Presentations and lectures often contain some degree of downtime, such as changing speakers, delays in setting up presentation material, trivial technical errors, etc. A recorded presentation allows technicians to the opportunity to remove any such downtime and create a  polished and professional final product for viewers.

Machine learning can help automate this process by identifying and removing gaps in the recorded content in post-production.

Machine learning possibilities:

  • Learn visual and audio cues based on specified parameters, (e.g. greater than 10 seconds of silence or the presenter disappearing from the stage)
  • Automatically remove identified gaps from the final product in post-production
  • Helps save time and effort on routine post-production tasks for video editors in a high-volume video setting

9. Streamlined recording control

When recording a lecture, the presenter needs to manually operate the encoding system to initiate recording at the beginning and end of the presentation. While this task is relatively simple, machine learning presents an opportunity to automate the process and allow lecturers to focus on what they do best: teaching.

Machine learning technology can simplify recording control by automatically detecting the beginning and end of each lecture. For example, machine learning can start recording using environment cues, such as when the room lights are turned on, when audio is detected, when someone enters the stage, etc.

Machine learning possibilities:

  • Learn presenter’s face, environmental lighting, presentation material, and other audiovisual cues
  • Initiate and end recording when learned cues are detected

Streamlined recording control

10. Automated lower thirds

Lower thirds are graphics, animations, or text overlays that are used in live video to engage viewers and convey a message or other contextual information, such as a presenter’s name or title. Created using special video editing software (such as NewBlueFX), lower thirds can be applied in real time or can be manually configured in post-production to appear at key moments during the presentation.

Machine learning in video editing applications can be used to recognize speaker faces and other visual cues and automatically display the appropriate overlays without the need for manual intervention by video editors.

Machine learning possibilities:

  • Learn visual and audio cues, such as each presenter’s face as they enter the frame
  • Automatically display relevant lower thirds information based on learned cues

11. Highlight reels

The last in our list of machine learning applications for video production is highlight reels. A recorded presentation can be repurposed as marketing collateral by editing the original material to contain only the presentation highlights, such as a speaker’s key points or important moments in the event.

Machine learning can be applied to automatically search for and isolate key moments in the recorded video(s) using visual (e.g. transcribed text) and audio cues (e.g. audience applause). The code can help create a highlight reel based on these isolated clips for the video editors to review. This is particularly helpful for video editors in saving time and effort on routine post-production tasks in a high-volume video setting.

Machine learning possibilities:

  • Learn visual and audio cues that correspond to an important moment, such as audience applause or keywords within transcribed text
  • Automatically isolate video clips based on learned cues
  • Use clips together to form a highlight reel

As you can tell, combining machine learning technology with live video solutions presents endless possibilities for automating, streamlining, and personalizing your live streams and recordings. Whether you’re a content creator, AV technician for an educational institution, or live event specialist, machine learning can help improve your live video experience.

Do you have any ideas for machine learning applications for video production? Let us know in the comments!


  1. Hi Jordan

    Thank you for sharing. Reading through your post, a bit sad we Pearl 1 users will not benefit from machine learning features.

    Firstly i would like to tell you in advance how good these machines are. This is the best option by far in the market for broadcasting, multi-camera recording, streaming, live visuals, relaying and much more.

    I deliver all these services with one epiphan pearl 1 box without the need for any extra equipment and crew.
    I have been working with these machines for about 3 years and i love them. My successful event production relies on these units. It is pretty amazing. Client loves the small footprint for a 3 camera setup, live presentation capture (lots of live demos), the graphics and artwork prepared pre-event and the best of all, we aim to deliver the recorded files at the end of the event day for fast publishing. (work for tech industry, computer science world, machine learning and IoT).

    Please follow links to several playlists with the most recent videos produced for clients (we own 4 Epiphan Pearls) – All videos had no post-production done and the video files delivered to the client on event day.

    ADC – Audio Developer Conference 2017

    Oracle Code – 2017

    Microsoft – WinOps Conf 2017

    London’s Calling 2017 and 2018

    The main reason i am contacting is it feels the Epiphan Pearl support could be better here in the UK at least, these boxes have great potential, but the industry doesn’t know anything about them. Work with many people establish in the industry for a few years and they know nothing about the Pearls. They all use either Black Magic stuff which i hate or keep recording conferences/events like old times, camera recording and then ask presentation slide deck to speakers for later post-production (horrible for editing time and file turnaround)

    We feel these boxes deserve a better justice. There are only 2 main vendors Techex and One Video and of course, there is no transparency in pricing.
    But the worst of all is no hiring options. ANYWHERE

    It is a shame Epiphan Pearl is not as well established in the UK as in the US. I am wondering if there is a way for me to be a brand representative here in the UK or Europe and push forward the use of these lovely machines in the industry.
    Talk about these machine with great proud as it is helping me to thrive in the industry at a really fast rate.

    I hope to hear from you.

    • Jordan Sheldrick

      Hi Ricardo. Thanks for your comment! Glad to hear you’re having a positive experience with Pearl. Just to be clear – the machine learning features in this article are simply possibilities for future additions. There are no plans to implement these concepts into Pearl or Pearl-2 at this point in time. I will pass along your request to our Pearl product management. Someone will reach out to you if there is an interest. Thanks again!

Leave a Reply