Transcription

Introduction

This guide explains the overall process for executing speech transcription with Limecraft Flow.

The outline of the process is as follows:

Upload audiovisual media clips into the Limecraft Flow platform such that they are available for enrichment by processes such as transcription or subtitling;
Start the automated transcription workflow on one or more clips in the platform;
Follow-up the status of the transcription workflow until completion before requesting the results;
Retrieve the transcript now attached to the transcribed media clips;
Optionally customize the transcription process with the use of custom dictionaries and text alignment.

In addition to these, steps, we also describe the statuses a clip can have regarding transcription, to aid in automated workflows and to guide the Limecraft Flow UI to properly display externally modified transcripts. To close off this chapter, we also list the API call to update a transcript from a third party system.

Upload audiovisual media clips into the Limecraft Flow platform

Before transcription workflows can be run, audiovisual material needs to be uploaded to the Limecraft platform. Both media with audio and video or audio-only clips can be uploaded and processed for speech transcription.

The various ways of creating clips are described in its dedicated documentation section.

Start the automated transcription workflow

Once a clip has been ingested successfully, it can be used for further enrichment, including transcription.

Starting the speech transcription workflow is done using this call:

POST /production/{prId}/mo/{moId}/service/transcript

Start a transcription generation process and use a specific engine depending on the body of the query.

Details

Description

If nothing is stated, production-defined one will be used.

Parameters

Path Parameters

Name Description Required Type

Name	Description	Required	Type
`prId`	ID of the production.	✔	Long
`moId`	ID of the media object.	✔	Long

prId

ID of the production.

✔

Long

moId

ID of the media object.

✔

Long

Body Parameters

Name Description Required Type

Name	Description	Required	Type
`TranscriptRequest`	Transcript request object.	✘	TranscriptRequest

TranscriptRequest

Transcript request object.

✘

TranscriptRequest

TranscriptRequest

Field Name	Required	Type	Description	Format
align	✘	Boolean	Run transcription in alignment mode, in which the alignInput will become the transcript.
alignInput	✘	String	Text to use for transcription alignment.
dictionaryId	✘	Long	Id of the dictionary to use during transcription.	int64
language	✘	String	Language code to use for transcription. The code has to be supported by the speechEngine.
numberOfSpeakers	✘	Long	How many speakers are expected. Usage depends on the speechEngine.	int64
redo	✘	Boolean	Run again, even if the workflow already ran in this context.
redoSingleTask	✘	Boolean
skipActiveWorkflowTest	✘	Boolean
speechEngine	✘	String	Which speech engine should be used for transcription.	Enum: VOLDEMORT, KALDI, VOLDEMORT2, VOLDEMORT3, VOLDEMORT4, VOLDEMORT5,
subtitle	✘	Boolean	After the transcript is generated, also create a Subtitle annotation from it.
subtitlePresetId	✘	String	When subtitle is true, create subtitles using this subtitle preset.
subtitlingConfiguration	✘	subtitlingConfiguration
transcriptConfiguration	✘	transcriptConfiguration
waitForWorkflow	✘	Boolean
workflowLabel	✘	String

Field Name

Required

Type

Description

Format

align

✘

Boolean

Run transcription in alignment mode, in which the alignInput will become the transcript.

alignInput

✘

String

Text to use for transcription alignment.

dictionaryId

✘

Long

Id of the dictionary to use during transcription.

int64

language

✘

String

Language code to use for transcription. The code has to be supported by the speechEngine.

numberOfSpeakers

✘

Long

How many speakers are expected. Usage depends on the speechEngine.

int64

redo

✘

Boolean

Run again, even if the workflow already ran in this context.

redoSingleTask

✘

Boolean

skipActiveWorkflowTest

✘

Boolean

speechEngine

✘

String

Which speech engine should be used for transcription.

Enum: VOLDEMORT, KALDI, VOLDEMORT2, VOLDEMORT3, VOLDEMORT4, VOLDEMORT5,

subtitle

✘

Boolean

After the transcript is generated, also create a Subtitle annotation from it.

subtitlePresetId

✘

String

When subtitle is true, create subtitles using this subtitle preset.

subtitlingConfiguration

✘

subtitlingConfiguration

transcriptConfiguration

✘

transcriptConfiguration

waitForWorkflow

✘

Boolean

workflowLabel

✘

String

Return Type

MediaObjectWorkflowReport

Field Name	Required	Type	Description	Format
adminOnly	✘	Boolean
audioAnalyzerCompleted	✘	Date		date-time
created	✘	Date	The time when this resource was created	date-time
createdBy	✘	String	The request or process that created this resource
createdByShareId	✘	Long		int64
createdBySharedUserId	✘	Long		int64
creatorId	✘	Long	The id of the user who created this resource	int64
duration	✘	Double		double
errorReports	✘	List of TaskReport
extra	✘	Object
funnel	✘	String
id	✔	Long	The id of this resource	int64
label	✘	String	User-friendly label of the workflow
lastUpdated	✘	Date	The time when this resource was last updated	date-time
mediaAnalyzerCompleted	✘	Date		date-time
mediaObjectId	✘	Long		int64
modifiedBy	✔	String	The request or process responsible for the last update of this resource
objectType	✘	String	The data model type or class name of this resource
productionId	✘	Long		int64
publishedFiles	✘	List of object	Files generated by the workflow, which can be downloaded.
removeFromQuota	✘	Boolean
requiredRights	✘	List of ProductionPermission
size	✘	Long		int64
startupParameters	✘	Object
status	✘	String		Enum: Inited, Started, Completed, Error, Cancelled, Paused, CompletedPending, ErrorPending, WaitForCallback, Scheduled, Skipped,
successFul	✘	Boolean
target	✘	String
taskReports	✘	List of TaskReport
transcoder1Completed	✘	Date		date-time
transcoder2Completed	✘	Date		date-time
variables	✘	Object
version	✔	Long	The version of this resource, used for Optimistic Locking	int64
workflowCompleted	✘	Date	When did the workflow complete?	date-time
workflowFailed	✘	Date	When did the workflow fail?	date-time
workflowId	✘	String	The id of the workflow. This can be used to retrieve the workflow status.
workflowStarted	✘	Date	When was the workflow started?	date-time
workflowTask	✘	String
workflowType	✘	String		Enum: INGEST, SPEECH, IPPAMEXPORT, IPPAMSYNC, MOIEXPORT, REMOTE_SPEECH, VOLDEMORT_SPEECH, TRANSCODE, AUDIOANALYZE, EXPORT_VWFLOW, FEATURE_EXTRACTION, BLACK_FRAME, STON_APPROVE, AAF_EXPORT, FCP_EXPORT, VOLDEMORT_SPEECH_2, KALDI_SPEECH, SUBTITLING, MIGRATE, INDEX, BACKUP, VOLDEMORT_SPEECH_3, VOLDEMORT_SPEECH_4, VOLDEMORT_SPEECH_5, TRANSLATION, INDEX_SWITCH, SIMPLEINGEST, CLONE, UPDATE_CATEGORY, WEBHOOK, SETKEEPER_ATTACH, PDF_EXPORT, SHOT_DETECTION, EXPORT, REMOTE_HELLO_WORLD, CUSTOM, UNKNOWN, CHANGE_AUDIO_LAYOUT, WORKSPACE_BOOTSTRAP, MEDIA_TRANSFER_COMPLETE, MEDIA_TRANSFER_FAILED, DELIVERY_REQUEST_SUBMISSION_CLIP_PROBED, DELIVERY_REQUEST_SUBMISSION, ADVANCED_SUBTITLE, TRANSCRIPTION_SUMMARIZE, VOLUME_TRANSFER_COMPLETE, VOLUME_TRANSFER_FAILED, VOLUME_TRANSFER_MONITOR,

Field Name

Required

Type

Description

Format

adminOnly

✘

Boolean

audioAnalyzerCompleted

✘

Date

date-time

created

✘

Date

The time when this resource was created

date-time

createdBy

✘

String

The request or process that created this resource

createdByShareId

✘

Long

int64

createdBySharedUserId

✘

Long

int64

creatorId

✘

Long

The id of the user who created this resource

int64

duration

✘

Double

double

errorReports

✘

List of TaskReport

extra

✘

Object

funnel

✘

String

✔

Long

The id of this resource

int64

label

✘

String

User-friendly label of the workflow

lastUpdated

✘

Date

The time when this resource was last updated

date-time

mediaAnalyzerCompleted

✘

Date

date-time

mediaObjectId

✘

Long

int64

modifiedBy

✔

String

The request or process responsible for the last update of this resource

objectType

✘

String

The data model type or class name of this resource

productionId

✘

Long

int64

publishedFiles

✘

List of object

Files generated by the workflow, which can be downloaded.

removeFromQuota

✘

Boolean

requiredRights

✘

List of ProductionPermission

size

✘

Long

int64

startupParameters

✘

Object

status

✘

String

Enum: Inited, Started, Completed, Error, Cancelled, Paused, CompletedPending, ErrorPending, WaitForCallback, Scheduled, Skipped,

successFul

✘

Boolean

target

✘

String

taskReports

✘

List of TaskReport

transcoder1Completed

✘

Date

date-time

transcoder2Completed

✘

Date

date-time

variables

✘

Object

version

✔

Long

The version of this resource, used for Optimistic Locking

int64

workflowCompleted

✘

Date

When did the workflow complete?

date-time

workflowFailed

✘

Date

When did the workflow fail?

date-time

workflowId

✘

String

The id of the workflow. This can be used to retrieve the workflow status.

workflowStarted

✘

Date

When was the workflow started?

date-time

workflowTask

✘

String

workflowType

✘

String

Enum: INGEST, SPEECH, IPPAMEXPORT, IPPAMSYNC, MOIEXPORT, REMOTE_SPEECH, VOLDEMORT_SPEECH, TRANSCODE, AUDIOANALYZE, EXPORT_VWFLOW, FEATURE_EXTRACTION, BLACK_FRAME, STON_APPROVE, AAF_EXPORT, FCP_EXPORT, VOLDEMORT_SPEECH_2, KALDI_SPEECH, SUBTITLING, MIGRATE, INDEX, BACKUP, VOLDEMORT_SPEECH_3, VOLDEMORT_SPEECH_4, VOLDEMORT_SPEECH_5, TRANSLATION, INDEX_SWITCH, SIMPLEINGEST, CLONE, UPDATE_CATEGORY, WEBHOOK, SETKEEPER_ATTACH, PDF_EXPORT, SHOT_DETECTION, EXPORT, REMOTE_HELLO_WORLD, CUSTOM, UNKNOWN, CHANGE_AUDIO_LAYOUT, WORKSPACE_BOOTSTRAP, MEDIA_TRANSFER_COMPLETE, MEDIA_TRANSFER_FAILED, DELIVERY_REQUEST_SUBMISSION_CLIP_PROBED, DELIVERY_REQUEST_SUBMISSION, ADVANCED_SUBTITLE, TRANSCRIPTION_SUMMARIZE, VOLUME_TRANSFER_COMPLETE, VOLUME_TRANSFER_FAILED, VOLUME_TRANSFER_MONITOR,

Content Type

application/json

Responses

Table 1. http response codes
Code	Description	Datatype
200	The request was successful.	`MediaObjectWorkflowReport`
400	The language is not supported.	`BadRequestError`
403	The user needs START_TRANSCRIPTION_WORKFLOW rights.	`ForbiddenError`
404	The production or media object was not found.	`NotFoundError`
409	The workflow was already in progress.	`AlreadyExistsError`

The body of this request is a TranscriptRequest JSON object. Its language parameter should be the language code for the language the clip audio is in.

For example:

{
    "language": "en",
    "redo": true
}

The redo parameter can be used as a safeguard to prevent duplicate transcription workflows. If redo is false, (which is the default), starting the workflow will result in a failure if there was a transcription workflow started before.

To learn which language codes are supported, refer to Speech Engines and supported features.

Another way to create a transcript automatically is through Translation, which is discussed on its own page.

Follow-up the status of the transcription workflow

The execution of the automated transcription process is modeled as any other Limecraft Flow platform workflow, like the other enrichment and media processing workflows in our system. As such, the workflow API can be used to track its progress, till completion or failure.

The call mentioned above will return a MediaObjectWorkflowReport. Its workflowId field gives you a reference to the workflow that was started. Once this workflow completes, the TranscriptAnnotations will have been created. To learn how to wait for a workflow to complete, see this section.

Retrieve the transcript

Speech transcription workflows which complete succesfully deliver speech transcription results and attach those to each respective clip as multiple TranscriptAnnotation objects.

Retrieving TranscriptAnnotations is done using the query call to List all the annotations of a MediaObject with the appropriate parameters:

GET /production/{prId}/mo/{moId}/an/query?offset=0&rows=1000&sort=start ASC&fq=language:"en"&fq=funnel:TranscriptAnnotation

Query Parameter Description

Query Parameter	Description
`fq=funnel:TranscriptAnnotation`	We only want to retrieve TranscriptAnnotations, so we add a filter query on funnel.
`fq=language:"en"`	Another filter query is set on the language fields, to only retrieve the TranscriptAnnotations in this particular language.
`offset=0` `rows=1000`	Keep in mind that the annotation endpoint uses paging to deliver the full set of data from the API. As such, use proper paging parameters to ensure that a complete set of transcript elements is returned (in one or more API calls). The offset in the entire data set of the first result returned in this call: `offset=0`; The number of results requested in this call as the rows parameter: `rows=1000`. We suggest to use no more than 1000 results per call. Requests for higher numbers might be capped by the API.
`sort=start ASC`	Sorting the results with increasing start times will return the transcription annotations chronologically.

fq=funnel:TranscriptAnnotation

We only want to retrieve TranscriptAnnotations, so we add a filter query on funnel.

fq=language:"en"

Another filter query is set on the language fields, to only retrieve the TranscriptAnnotations in this particular language.

offset=0 rows=1000

Keep in mind that the annotation endpoint uses paging to deliver the full set of data from the API. As such, use proper paging parameters to ensure that a complete set of transcript elements is returned (in one or more API calls).

The offset in the entire data set of the first result returned in this call: offset=0;
The number of results requested in this call as the rows parameter: rows=1000. We suggest to use no more than 1000 results per call. Requests for higher numbers might be capped by the API.

sort=start ASC

Sorting the results with increasing start times will return the transcription annotations chronologically.

The result of this call will be a sorted list of TranscriptAnnotations.

Customize the transcription process

Use a different speech transcription engine

The Limecraft Flow platform supports multiple transcription engines. The default transcription engine is Speechmatics, unless your production workspace is configured otherwise.

Users with access to an enterprise plan can also use one of the other speech transcription engines we support: Google Speech, Vocapia and Kaldi. The speechEngine parameter is used to choose the engine:

speechEngine	Description
voldemort2	Vocapia
voldemort3	Speechmatics. This is usually the default engine, unless your production workspace is configured differently.
voldemort4	Google Speech

speechEngine

Description

voldemort2

Vocapia

voldemort3

Speechmatics. This is usually the default engine, unless your production workspace is configured differently.

voldemort4

Google Speech

For example, the following example starts a transcription with Vocapia:

{
    "speechEngine": "voldemort2",
    "language": "en",
    "redo": true,
    "redoSingleTask": true
}

It is important to note that not all speech engines support the same languages and feature set! Refer to Speech Engines and supported features to learn more.

Custom Dictionaries

Apart from specifying the language and speech engine to be used for transcription, our platform also supports the use of custom dictionaries to help return more accurate speech transcription results.

Documentation on how to create and maintain dictionary is available in the relevant document.

Using the custom dictionary when running the transcription process can be done by specifying the dictionaryId parameter as part of the request JSON body to the transcription call, as follows:

{
    "force": true,
    "dictionaryId": 27,
    "language": "fr",
    "align": false,
    "subtitle": false
}

Custom dictionaries can currently only be used with the Speechmatics ASR backend. Refer to Speech Engines and supported features to learn more.

Alignment of existing transcripts

Our platform also provides functionality for ‘alignment’ of pre-existing transcripts. In this case, non-timed input text is given per-word timings and speaker assignments, and are returned in the same transcription format as from regular audio transcription calls.

Alignment can be initiated by sending the following JSON body to the transcription call, with align: true and the input text place in the alignInput field:

{
    "force": true,
    "language": "en",
    "align": true,
    "alignInput": "Look at this clock. When the bell rings, we can see it as well as hear it."
}

The text in alignInput should conform to certain requirements for optimal results:

UTF-8 encoded plain text (no markup, no timecodes, …)
Text should be in the same language as the audio of the clip
Only spoken text (no time codes, no speakers)
One sentence on each line (with punctuation marks).

If you have speaker info available, you can put it in the alignInput like this:

SPEAKER: ILSA
But what about us?

SPEAKER: RICK
We'll always have Paris. We didn't have, we, we lost it until you came to Casablanca. We got it back last night.

SPEAKER: ILSA
When I said I would never leave you.

Transcript alignment is currently available with the Speechmatics and Vocapia ASR backends. Refer to Speech Engines and supported features to learn more.

Transcription status of a clip

The MediaObjectAnnotation of the clip has a field transcriptionStatuses which contains the transcription status for each language for that clip.

Note that transcriptionStatuses is used to populate the language selector in the transcriber application of Flow-UI. If the status isn’t set, it won’t be shown in Flow-UI!

Example:

{
  "transcriptionStatuses": {
    "en": "AUTOMATIC_COMPLETED",
    "fr": "EDITING"
  }
}

The keys of the transcriptionStatuses are the language codes.

The values of the transcriptionStatuses map are any of the following:

status	Description
NOT_STARTED	The transcription for this version has not started. Same as if the key wouldn’t exist.
AUTOMATIC_STARTED	The automatic transcription process is busy
AUTOMATIC_FAILED	The automatic transcription was started but has failed
AUTOMATIC_COMPLETED	The automatic transcription has completed succesfully
EDITING	Editing has started. This could come after AUTOMATIC_COMPLETED.
COMPLETED	The transcript editing for this version has completed. Editing won’t be possible in Flow-UI unless the status is changed to EDITING.

status

Description

NOT_STARTED

The transcription for this version has not started. Same as if the key wouldn’t exist.

AUTOMATIC_STARTED

The automatic transcription process is busy

AUTOMATIC_FAILED

The automatic transcription was started but has failed

AUTOMATIC_COMPLETED

The automatic transcription has completed succesfully

EDITING

Editing has started. This could come after AUTOMATIC_COMPLETED.

COMPLETED

The transcript editing for this version has completed. Editing won’t be possible in Flow-UI unless the status is changed to EDITING.

Edit the transcript manually

POST /production/{prId}/mo/{moId}/an

This call is used to create an annotation and tie it to the specific media object.

Details

Description

Parameters

Path Parameters

Name Description Required Type

Name	Description	Required	Type
`prId`	ID of the production.	✔	Long
`moId`	ID of the media object.	✔	Long

prId

ID of the production.

✔

Long

moId

ID of the media object.

✔

Long

Body Parameters

Name Description Required Type

Name	Description	Required	Type
`body`	Annotation object.	✘	Object

body

Annotation object.

✘

Object

body

Field Name

Required

Type

Description

Format

Return Type

Annotation

Field Name	Required	Type	Description	Format
annotationProductionId	✘	Long		int64
clipMetadata	✘	ClipMetadata
created	✘	Date	The time when this resource was created	date-time
createdBy	✘	String	The request or process that created this resource
createdByShareId	✘	Long		int64
createdBySharedUserId	✘	Long		int64
creatorId	✘	Long	The id of the user who created this resource	int64
crossProduction	✘	Boolean
customFields	✘	CustomFields
deleted	✘	Date		date-time
description	✘	String	Textual contents of the Annotation
end	✘	Long	The frame range described by the annotation runs up to end, but not including it. Should be less than or equal to the amount of frames the MediaObject has.	int64
funnel	✘	String	Describes how the Annotation should be interpreted by the client application. Can be thought of as a subtype.
id	✔	Long	The id of this resource	int64
includeTranslatedTo	✘	Boolean
includesFrom	✘	Set of string
keyframeFrames	✘	Long		int64
label	✘	String
language	✘	String
lastUpdated	✘	Date	The time when this resource was last updated	date-time
mediaObject	✘	MediaObject
mediaObjectId	✘	Long		int64
modifiedBy	✔	String	The request or process responsible for the last update of this resource
objectType	✘	String	The data model type or class name of this resource
origin	✘	String
productionId	✘	Long		int64
rating	✘	Double		double
relatedToId	✘	Long		int64
securityClasses	✘	Set of string		Enum:
source	✘	String
spatial	✘	String	Link the Annotation to a specific part of the video or image frame. A Media Fragments Spatial Dimension description string is expected.
start	✘	Long	First frame of the Annotation. 0 is the first frame of the clip. The start frame is included in the frame range the annotation describes.	int64
systemFields	✘	CustomFields
tags	✘	Set of string
translatedFromId	✘	Long		int64
translatedToIds	✘	Set of long		int64
version	✔	Long	The version of this resource, used for Optimistic Locking	int64

Field Name

Required

Type

Description

Format

annotationProductionId

✘

Long

int64

clipMetadata

✘

ClipMetadata

created

✘

Date

The time when this resource was created

date-time

createdBy

✘

String

The request or process that created this resource

createdByShareId

✘

Long

int64

createdBySharedUserId

✘

Long

int64

creatorId

✘

Long

The id of the user who created this resource

int64

crossProduction

✘

Boolean

customFields

✘

CustomFields

deleted

✘

Date

date-time

description

✘

String

Textual contents of the Annotation

end

✘

Long

The frame range described by the annotation runs up to end, but not including it. Should be less than or equal to the amount of frames the MediaObject has.

int64

funnel

✘

String

Describes how the Annotation should be interpreted by the client application. Can be thought of as a subtype.

✔

Long

The id of this resource

int64

includeTranslatedTo

✘

Boolean

includesFrom

✘

Set of string

keyframeFrames

✘

Long

int64

label

✘

String

language

✘

String

lastUpdated

✘

Date

The time when this resource was last updated

date-time

mediaObject

✘

MediaObject

mediaObjectId

✘

Long

int64

modifiedBy

✔

String

The request or process responsible for the last update of this resource

objectType

✘

String

The data model type or class name of this resource

origin

✘

String

productionId

✘

Long

int64

rating

✘

Double

double

relatedToId

✘

Long

int64

securityClasses

✘

Set of string

Enum:

source

✘

String

spatial

✘

String

Link the Annotation to a specific part of the video or image frame. A Media Fragments Spatial Dimension description string is expected.

start

✘

Long

First frame of the Annotation. 0 is the first frame of the clip. The start frame is included in the frame range the annotation describes.

int64

systemFields

✘

CustomFields

Content Type

application/json

Responses

Table 2. http response codes
Code	Description	Datatype
201	The request was successful.	`Annotation`
403	The user needs LIBRARY_UPDATE_METADATA, LOG_EDIT, SUBTITLE_EDIT, or TRANSCRIBER_EDIT rights depending on the type of the annotation.	`ForbiddenError`
404	The production or MediaObject was not found.	`NotFoundError`
422	The annotation does not validate to the annotation restriction. For example annotation.start > annotation.end.	`ValidationError`

Body example

{
  "start": 0,
  "end": 6,
  "funnel": "TranscriptAnnotation",
  "language": "en",
  "source": "PostMan",
  "label": "PostMan Generated",
  "type": "TRANSCRIBER",
  "speaker": "F1",
  "objectType": "TranscriptAnnotation",
  "structuredDescription": {
    "confidence": 0.9935,
    "parts": [
      {
        "start": 0,
        "duration": 2,
        "word": "Hi, ",
        "confidence": 1,
        "speaker": "F1",
        "type": "LEX"
      },
      {
        "start": 2,
        "duration": 3,
        "word": "I'm ",
        "confidence": 1,
        "speaker": "F1",
        "type": "LEX"
      },
      {
        "start": 5,
        "duration": 3,
        "word": "Amy",
        "confidence": 0.95,
        "speaker": "F1",
        "type": "LEX"
      }
    ],
    "language": "en",
    "gender": "F"
  }
}