Transcription

Introduction

This guide explains the overall process for executing speech transcription with Limecraft Flow.

The outline of the process is as follows:

  1. Upload audiovisual media clips into the Limecraft Flow platform such that they are available for enrichment by processes such as transcription or subtitling;

  2. Start the automated transcription workflow on one or more clips in the platform;

  3. Follow-up the status of the transcription workflow until completion before requesting the results;

  4. Retrieve the transcript now attached to the transcribed media clips;

  5. Optionally customize the transcription process with the use of custom dictionaries and text alignment.

In addition to these, steps, we also describe the statuses a clip can have regarding transcription, to aid in automated workflows and to guide the Limecraft Flow UI to properly display externally modified transcripts. To close off this chapter, we also list the API call to update a transcript from a third party system.

Upload audiovisual media clips into the Limecraft Flow platform

Before transcription workflows can be run, audiovisual material needs to be uploaded to the Limecraft platform. Both media with audio and video or audio-only clips can be uploaded and processed for speech transcription.

The various ways of creating clips are described in its dedicated documentation section.

Start the automated transcription workflow

Once a clip has been ingested successfully, it can be used for further enrichment, including transcription.

Starting the speech transcription workflow is done using this call:

POST /production/{prId}/mo/{moId}/service/transcript

Start a transcription generation process and use a specific engine depending on the body of the query.

Details
Description

If nothing is stated, production-defined one will be used.

Parameters
Path Parameters
Name Description Required Type

prId

ID of the production.

Long

moId

ID of the media object.

Long

Body Parameters
Name Description Required Type

TranscriptRequest

Transcript request object.

TranscriptRequest

TranscriptRequest

Field Name Required Type Description Format

align

Boolean

Run transcription in alignment mode, in which the alignInput will become the transcript.

alignInput

String

Text to use for transcription alignment.

dictionaryId

Long

Id of the dictionary to use during transcription.

int64

language

String

Language code to use for transcription. The code has to be supported by the speechEngine.

numberOfSpeakers

Long

How many speakers are expected. Usage depends on the speechEngine.

int64

redo

Boolean

Run again, even if the workflow already ran in this context.

redoSingleTask

Boolean

skipActiveWorkflowTest

Boolean

speechEngine

String

Which speech engine should be used for transcription.

Enum: VOLDEMORT, KALDI, VOLDEMORT2, VOLDEMORT3, VOLDEMORT4, VOLDEMORT5,

subtitle

Boolean

After the transcript is generated, also create a Subtitle annotation from it.

subtitlePresetId

String

When subtitle is true, create subtitles using this subtitle preset.

subtitlingConfiguration

subtitlingConfiguration

transcriptConfiguration

transcriptConfiguration

waitForWorkflow

Boolean

Return Type

MediaObjectWorkflowReport

Field Name Required Type Description Format

adminOnly

Boolean

audioAnalyzerCompleted

Date

date-time

created

Date

The time when this resource was created

date-time

createdBy

String

The request or process that created this resource

createdByShareId

Long

int64

createdBySharedUserId

Long

int64

creatorId

Long

The id of the user who created this resource

int64

duration

Double

double

errorReports

List of TaskReport

extra

Object

funnel

String

id

Long

The id of this resource

int64

label

String

User-friendly label of the workflow

lastUpdated

Date

The time when this resource was last updated

date-time

mediaAnalyzerCompleted

Date

date-time

mediaObjectId

Long

int64

modifiedBy

String

The request or process responsible for the last update of this resource

objectType

String

The data model type or class name of this resource

productionId

Long

int64

publishedFiles

List of object

Files generated by the workflow, which can be downloaded.

removeFromQuota

Boolean

requiredRights

List of ProductionPermission

size

Long

int64

startupParameters

Object

status

String

Enum: Inited, Started, Completed, Error, Cancelled, Paused, CompletedPending, ErrorPending, WaitForCallback, Scheduled,

successFul

Boolean

target

String

taskReports

List of TaskReport

transcoder1Completed

Date

date-time

transcoder2Completed

Date

date-time

variables

Object

version

Long

The version of this resource, used for Optimistic Locking

int64

workflowCompleted

Date

When did the workflow complete?

date-time

workflowFailed

Date

When did the workflow fail?

date-time

workflowId

String

The id of the workflow. This can be used to retrieve the workflow status.

workflowStarted

Date

When was the workflow started?

date-time

workflowTask

String

workflowType

String

Enum: INGEST, SPEECH, IPPAMEXPORT, IPPAMSYNC, MOIEXPORT, REMOTE_SPEECH, VOLDEMORT_SPEECH, TRANSCODE, AUDIOANALYZE, EXPORT_VWFLOW, FEATURE_EXTRACTION, BLACK_FRAME, STON_APPROVE, AAF_EXPORT, FCP_EXPORT, VOLDEMORT_SPEECH_2, KALDI_SPEECH, SUBTITLING, MIGRATE, INDEX, BACKUP, VOLDEMORT_SPEECH_3, VOLDEMORT_SPEECH_4, VOLDEMORT_SPEECH_5, TRANSLATION, INDEX_SWITCH, SIMPLEINGEST, CLONE, UPDATE_CATEGORY, WEBHOOK, SETKEEPER_ATTACH, PDF_EXPORT, SHOT_DETECTION, EXPORT, REMOTE_HELLO_WORLD, CUSTOM, UNKNOWN, CHANGE_AUDIO_LAYOUT, WORKSPACE_BOOTSTRAP, MEDIA_TRANSFER_COMPLETE, MEDIA_TRANSFER_FAILED, DELIVERY_REQUEST_SUBMISSION_CLIP_PROBED, DELIVERY_REQUEST_SUBMISSION, ADVANCED_SUBTITLE, TRANSCRIPTION_SUMMARIZE,

Content Type
  • application/json

Responses
Table 1. http response codes
Code Description Datatype

200

The request was successful.

MediaObjectWorkflowReport

400

The language is not supported.

BadRequestError

403

The user needs START_TRANSCRIPTION_WORKFLOW rights.

ForbiddenError

404

The production or media object was not found.

NotFoundError

409

The workflow was already in progress.

AlreadyExistsError

The body of this request is a TranscriptRequest JSON object. Its language parameter should be the language code for the language the clip audio is in.

For example:

{
    "language": "en",
    "redo": true
}

The redo parameter can be used as a safeguard to prevent duplicate transcription workflows. If redo is false, (which is the default), starting the workflow will result in a failure if there was a transcription workflow started before.

To learn which language codes are supported, refer to Speech Engines and supported features.

Another way to create a transcript automatically is through Translation, which is discussed on its own page.

Follow-up the status of the transcription workflow

The execution of the automated transcription process is modeled as any other Limecraft Flow platform workflow, like the other enrichment and media processing workflows in our system. As such, the workflow API can be used to track its progress, till completion or failure.

The call mentioned above will return a MediaObjectWorkflowReport. Its workflowId field gives you a reference to the workflow that was started. Once this workflow completes, the TranscriptAnnotations will have been created. To learn how to wait for a workflow to complete, see this section.

Retrieve the transcript

Speech transcription workflows which complete succesfully deliver speech transcription results and attach those to each respective clip as multiple TranscriptAnnotation objects.

Retrieving TranscriptAnnotations is done using the query call to List all the annotations of a MediaObject with the appropriate parameters:

GET /production/{prId}/mo/{moId}/an/query?offset=0&rows=1000&sort=start ASC&fq=language:"en"&fq=funnel:TranscriptAnnotation
Query Parameter Description

fq=funnel:TranscriptAnnotation

We only want to retrieve TranscriptAnnotations, so we add a filter query on funnel.

fq=language:"en"

Another filter query is set on the language fields, to only retrieve the TranscriptAnnotations in this particular language.

offset=0 rows=1000

Keep in mind that the annotation endpoint uses paging to deliver the full set of data from the API. As such, use proper paging parameters to ensure that a complete set of transcript elements is returned (in one or more API calls).

  • The offset in the entire data set of the first result returned in this call: offset=0;

  • The number of results requested in this call as the rows parameter: rows=1000. We suggest to use no more than 1000 results per call. Requests for higher numbers might be capped by the API.

sort=start ASC

Sorting the results with increasing start times will return the transcription annotations chronologically.

The result of this call will be a sorted list of TranscriptAnnotations.

Customize the transcription process

Use a different speech transcription engine

The Limecraft Flow platform supports multiple transcription engines. The default transcription engine is Speechmatics, unless your production workspace is configured otherwise.

Users with access to an enterprise plan can also use one of the other speech transcription engines we support: Google Speech, Vocapia and Kaldi. The speechEngine parameter is used to choose the engine:

speechEngine Description

voldemort2

Vocapia

voldemort3

Speechmatics. This is usually the default engine, unless your production workspace is configured differently.

voldemort4

Google Speech

For example, the following example starts a transcription with Vocapia:

{
    "speechEngine": "voldemort2",
    "language": "en",
    "redo": true,
    "redoSingleTask": true
}

It is important to note that not all speech engines support the same languages and feature set! Refer to Speech Engines and supported features to learn more.

Custom Dictionaries

Apart from specifying the language and speech engine to be used for transcription, our platform also supports the use of custom dictionaries to help return more accurate speech transcription results.

Documentation on how to create and maintain dictionary is available in the relevant document.

Using the custom dictionary when running the transcription process can be done by specifying the dictionaryId parameter as part of the request JSON body to the transcription call, as follows:

{
    "force": true,
    "dictionaryId": 27,
    "language": "fr",
    "align": false,
    "subtitle": false
}
Custom dictionaries can currently only be used with the Speechmatics ASR backend. Refer to Speech Engines and supported features to learn more.

Alignment of existing transcripts

Our platform also provides functionality for ‘alignment’ of pre-existing transcripts. In this case, non-timed input text is given per-word timings and speaker assignments, and are returned in the same transcription format as from regular audio transcription calls.

Alignment can be initiated by sending the following JSON body to the transcription call, with align: true and the input text place in the alignInput field:

{
    "force": true,
    "language": "en",
    "align": true,
    "alignInput": "Look at this clock. When the bell rings, we can see it as well as hear it."
}

The text in alignInput should conform to certain requirements for optimal results:

  • UTF-8 encoded plain text (no markup, no timecodes, …​)

  • Text should be in the same language as the audio of the clip

  • Only spoken text (no time codes, no speakers)

  • One sentence on each line (with punctuation marks).

If you have speaker info available, you can put it in the alignInput like this:

SPEAKER: ILSA
But what about us?

SPEAKER: RICK
We'll always have Paris. We didn't have, we, we lost it until you came to Casablanca. We got it back last night.

SPEAKER: ILSA
When I said I would never leave you.
Transcript alignment is currently available with the Speechmatics and Vocapia ASR backends. Refer to Speech Engines and supported features to learn more.

Transcription status of a clip

The MediaObjectAnnotation of the clip has a field transcriptionStatuses which contains the transcription status for each language for that clip.

Note that transcriptionStatuses is used to populate the language selector in the transcriber application of Flow-UI. If the status isn’t set, it won’t be shown in Flow-UI!

Example:

{
  "transcriptionStatuses": {
    "en": "AUTOMATIC_COMPLETED",
    "fr": "EDITING"
  }
}

The keys of the transcriptionStatuses are the language codes.

The values of the transcriptionStatuses map are any of the following:

status Description

NOT_STARTED

The transcription for this version has not started. Same as if the key wouldn’t exist.

AUTOMATIC_STARTED

The automatic transcription process is busy

AUTOMATIC_FAILED

The automatic transcription was started but has failed

AUTOMATIC_COMPLETED

The automatic transcription has completed succesfully

EDITING

Editing has started. This could come after AUTOMATIC_COMPLETED.

COMPLETED

The transcript editing for this version has completed. Editing won’t be possible in Flow-UI unless the status is changed to EDITING.

Edit the transcript manually

POST /production/{prId}/mo/{moId}/an

This call is used to create an annotation and tie it to the specific media object.

Details
Description
Parameters
Path Parameters
Name Description Required Type

prId

ID of the production.

Long

moId

ID of the media object.

Long

Body Parameters
Name Description Required Type

body

Annotation object.

Object

body

Field Name

Required

Type

Description

Format

Return Type

Annotation

Field Name Required Type Description Format

annotationProductionId

Long

int64

clipMetadata

ClipMetadata

created

Date

The time when this resource was created

date-time

createdBy

String

The request or process that created this resource

createdByShareId

Long

int64

createdBySharedUserId

Long

int64

creatorId

Long

The id of the user who created this resource

int64

crossProduction

Boolean

customFields

CustomFields

deleted

Date

date-time

description

String

Textual contents of the Annotation

end

Long

The frame range described by the annotation runs up to end, but not including it. Should be less than or equal to the amount of frames the MediaObject has.

int64

funnel

String

Describes how the Annotation should be interpreted by the client application. Can be thought of as a subtype.

id

Long

The id of this resource

int64

includeTranslatedTo

Boolean

includesFrom

Set of string

keyframeFrames

Long

int64

label

String

language

String

lastUpdated

Date

The time when this resource was last updated

date-time

mediaObject

MediaObject

mediaObjectId

Long

int64

modifiedBy

String

The request or process responsible for the last update of this resource

objectType

String

The data model type or class name of this resource

origin

String

productionId

Long

int64

rating

Double

double

relatedToId

Long

int64

securityClasses

Set of string

Enum:

source

String

spatial

String

Link the Annotation to a specific part of the video or image frame. A Media Fragments Spatial Dimension description string is expected.

start

Long

First frame of the Annotation. 0 is the first frame of the clip. The start frame is included in the frame range the annotation describes.

int64

systemFields

CustomFields

tags

Set of string

translatedFromId

Long

int64

translatedToIds

Set of long

int64

version

Long

The version of this resource, used for Optimistic Locking

int64

Content Type
  • application/json

Responses
Table 2. http response codes
Code Description Datatype

201

The request was successful.

Annotation

403

The user needs LIBRARY_UPDATE_METADATA, LOG_EDIT, SUBTITLE_EDIT, or TRANSCRIBER_EDIT rights depending on the type of the annotation.

ForbiddenError

404

The production or MediaObject was not found.

NotFoundError

422

The annotation does not validate to the annotation restriction. For example annotation.start > annotation.end.

ValidationError

Body example

{
  "start": 0,
  "end": 6,
  "funnel": "TranscriptAnnotation",
  "language": "en",
  "source": "PostMan",
  "label": "PostMan Generated",
  "type": "TRANSCRIBER",
  "speaker": "F1",
  "objectType": "TranscriptAnnotation",
  "structuredDescription": {
    "confidence": 0.9935,
    "parts": [
      {
        "start": 0,
        "duration": 2,
        "word": "Hi, ",
        "confidence": 1,
        "speaker": "F1",
        "type": "LEX"
      },
      {
        "start": 2,
        "duration": 3,
        "word": "I'm ",
        "confidence": 1,
        "speaker": "F1",
        "type": "LEX"
      },
      {
        "start": 5,
        "duration": 3,
        "word": "Amy",
        "confidence": 0.95,
        "speaker": "F1",
        "type": "LEX"
      }
    ],
    "language": "en",
    "gender": "F"
  }
}