Note: The creation of this content was human-based, with the assistance on artificial intelligence.

Explanation of the success criteria

WCAG 1.2.2 Captions (Prerecorded) is a Level A conformance level Success Criterion. It states that an alternative for time-based media or an audio description of prerecorded video content must be provided for synchronized media, unless the media is clearly labeled as a text alternative.

Who do captions benefit?

Captions benefit a wide range of users by providing text versions of spoken words and relevant sounds in video or audio content. Here’s a breakdown of who they help:

  • Deaf and Hard of Hearing Users: The primary audience for captions, they allow full access to spoken dialogue, sound effects, and music cues.
  • Non-Native Language Speakers: Useful for language learners, captions help people who are fluent in the written form of a language, but struggle with spoken comprehension.
  • People with Cognitive or Learning Disabilities: Captions help users with ADHD, dyslexia, or auditory processing disorders. They make it easier to process spoken content by offering a visual aid.
  • Users in Noisy or Quiet Environments: Captions let people watch videos in loud settings (e.g., public transport) or silent environments (e.g., libraries) without sound.
  • Everyone for Comprehension and Retention: Captions can aid understanding, especially for complex or technical content. They allow users to re-read information and retain it better.

Here is an example of a video with appropriate captions.

Methods to provide captions

There is a variety of ways that captions can be created from pre-recorded video content.

Automated Captions

In the early days of automated captions, these were given the nickname of “craptions.” That nickname has mostly disappeared, as automated captions have improved. And generating captions automatically have there benefits:

  • Speed: Captions can be generated almost instantly for prerecorded or live content.
  • Cost-Effective: Cheaper than manual captioning, especially for large volumes of content.
  • Scalable: Easily applied across thousands of videos or multiple platforms.
  • Multilingual Support: Many tools support multiple languages, accents, and dialects.
  • Syncing with Media: AI can automatically time captions to match the speech.
  • Improved Access: Offers basic accessibility for deaf or hard-of-hearing users where none existed.

Everything sounds good, right? Well, like most facets in life, there are downsides to be aware of.

  • Accuracy Issues: May misinterpret accents, background noise, or technical terms.
  • Lack of Context Awareness: No understanding of meaning, emotion, or intent.
  • Speaker Errors: Often fails to distinguish speakers or attributes speech incorrectly.
  • No Punctuation or Grammar: Raw captions often lack proper punctuation, making them hard to read.
  • Sound Descriptions Missing: Usually doesn’t include non-speech elements like [laughter] or [music].
  • Limited with Overlapping Speech: Struggles when multiple people talk at once.
  • Editing Required: Requires human review and corrections to meet accessibility standards.

Manual Captioning

Manual captioning provides the highest quality captions of any method.

  • High Accuracy: Human captioners understand context, emotion, and nuance, leading to far fewer errors.
  • Correct Grammar & Punctuation: Includes proper sentence structure, which improves readability and comprehension.
  • Descriptive Sound Captions: Can include non-speech elements like [applause], [dramatic music], etc.
  • Context Awareness: Human captioners can interpret slang, jokes, tone, or cultural references.
  • Speaker Identification: Accurately labels speakers and can distinguish between them.
  • Better for Complex Content: Ideal for technical, scientific, or fast-paced dialogue.
  • WCAG & Legal Compliance: More likely to meet accessibility standards (like WCAG or ADA).

While manual captions are best for quality, there are certain downsides.

  • Time-Consuming: Manual captioning takes longer, especially for long or complex videos.
  • More Expensive: Requires professional services or trained staff, which increases cost.
  • Limited Scalability: Slower for large content libraries or high-volume production schedules.
  • Human Error: While rare, mistakes can still happen if attention to detail is lacking.

Best approach to creating captions for pre-recorded video

It depends. Automatic captions are a great starting point for accessibility and convenience. They are also needed when the number of pre-recorded videos is large. For legal compliance or high-quality experiences, they should be reviewed and edited by humans. Manual captions offer superior quality, clarity, and compliance, making them the best option for formal, legal, educational, and accessibility-critical content—despite higher costs and time investment.

A quick personal note on human captioning

Captioning is near and dear to my heart. When I was organizer of the Chicago Digital Accessibility and Inclusive Design Meetup, we would live-stream the events over YouTube, with sponsored live captions. Afterwards, I would edit the video recording and caption the recording with guidance from the live captions. It is a large amount of time-consuming work. However, the positive impact is priceless.

Best practices for captions

  • Captions should match the spoken dialogue exactly, including slang, grammar, and mispronunciations when relevant.
  • Text must be timed to appear with the corresponding audio, not before or after.
  • Captions should include all spoken content, as well as meaningful non-speech sounds (e.g., [laughter], [music playing], [door slams]).
  • Identify speakers when not visually obvious (e.g., [John]: Let’s start the meeting).
  • Use sentence case, proper punctuation, and line breaks to make captions easy to read.
  • Keep captions to 1–2 lines per screen (max ~32 characters per line if possible). Adjust timing so viewers can read comfortably.
  • Where you have control, place captions at the bottom center of the screen, unless they cover important visuals (then move them appropriately).
  • Where you have control, use legible, sans-serif fonts, high contrast (e.g., white on black background), and consistent style.
  • Match the language of the audio and translate only when creating subtitles.
  • Follow standards like WCAG, FCC, ADA, or other regional accessibility laws.

Testing via automated testing

Automated caption testing is highly efficient, quickly detecting whether captions exist—typically by identifying HTML elements like the <track> tag. It operates at a very fast speed and is highly scalable, making it well-suited for large volumes of content.

Automated caption testing has notable limitations in evaluating quality. It cannot verify the accuracy of caption text, nor does it assess synchronization with audio or the appropriateness of language used. Additionally, it is prone to a high rate of false positives and negatives, often incorrectly flagging captions as missing even when they are present through alternative methods.

Testing via AI

AI can effectively detect the presence of captions, even in the absence of HTML tags, by analyzing audio and visual content directly. In terms of speed, it operates quickly, though not as fast as simpler forms of automation. When properly implemented, AI-based caption testing also offers strong scalability, making it suitable for large-scale evaluations.

AI testing of captions reveals mixed performance across key areas. While some AI models can semi-evaluate caption accuracy, they often struggle with understanding context, limiting their reliability. Caption synchronization can be roughly estimated through audio analysis, but results are still imperfect. In terms of language appropriateness, AI may identify general issues, yet it lacks the nuanced understanding of context required for more accurate assessments. When it comes to identifying false positives and negatives, AI performs moderately well—better than basic automation—but still falls short of delivering flawless results.

Manual testing

Manual testing offers the most reliable method for evaluating captions, as it involves watching the video to accurately detect the presence of captions. It allows for thorough assessment of caption accuracy, including grammar, spelling, and contextual relevance. Humans can effectively verify caption synchronization, ensuring that the timing aligns with the audio. Language appropriateness is also carefully reviewed, including tone and correctness. When performed thoroughly, manual testing results in low rates of false positives and negatives, making it the most comprehensive approach.

Manual caption testing, while accurate, is slow and resource-intensive, requiring significant human effort to review each video. As a result, it has poor scalability, making it impractical for large video libraries or high-volume content environments.

Which approach is best?

No single approach guarantees appropriate alternatives for audio-only and video-only content. However, using the strengths of each approach in combination can have a positive effect.

Automated accessibility testing offers speed and efficiency but is limited to detecting the basic presence of captions, without assessing their quality or accuracy. AI-based accessibility testing helps bridge this gap by providing contextual insights, though its reliability varies depending on the model and how it’s implemented. Manual accessibility testing remains the most accurate and context-aware method, making it ideal for final reviews. However, when dealing with large volumes of pre-recorded video, manual accessibility testing alone becomes time-consuming and subjective, and is best used in combination with automated or AI-driven approaches for greater efficiency and consistency.

Related Resources