TranscriptAnnotation

A TranscriptAnnotation is used by the transcription engine and the Limecraft Flow UI Transcriber application. It represents a single paragraph of spoken text, spoken by a single speaker, where each single word is timecoded to the exact frames where it is spoken out loud.

objectType, funnel, type

The TranscriptAnnotation is an extension of StructuredAnnotation. So all properties of a StructuredAnnotation are also available on the TranscriptAnnotation.

The objectType will be "TranscriptAnnotation", the type of the StructuredAnnotation will be "TRANSCRIBER", and the funnel will be "TranscriptAnnotation".

Example

The example below illustrates a TranscriptAnnotation representing two sentences of text.

{
    "id": 727548,
    "objectType": "TranscriptAnnotation",
    "funnel": "TranscriptAnnotation",
    "start": 2566,
    "end": 2737,
    "speaker": "S1",
    "includesFrom": [],
    "label": "Automatic Transcription Fri Oct 29 2021 08:00:52 GMT+0000 (Coordinated Universal Time)",
    "language": "en",
    "mediaObjectId": 470672,
    "source": "automatic_transcription",
    "type": "TRANSCRIBER",
    "version": 1,
    "structuredDescription": {
        "confidence": 0.9612068965517244,
        "language": "en",
        "parts": [
            {
                "start": 2566,
                "duration": 8,
                "word": "This ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2566",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2574,
                "duration": 17,
                "word": "blade ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2574",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2591,
                "duration": 6,
                "word": "has ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2591",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2597,
                "duration": 2,
                "word": "a ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2597",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2599,
                "duration": 9,
                "word": "dark ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2599",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2608,
                "duration": 14,
                "word": "past. ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2608",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2675,
                "duration": 5,
                "word": "It ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2675",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2680,
                "duration": 3,
                "word": "has ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2680",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2683,
                "duration": 14,
                "word": "shed ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2683",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2697,
                "duration": 12,
                "word": "much ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2697",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2714,
                "duration": 11,
                "word": "innocent ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2714",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            },
            {
                "start": 2728,
                "duration": 9,
                "word": "blood. ",
                "confidence": 0,
                "speaker": "S1",
                "type": "LEX",
                "id": "727548-2728",
                "isStartInterpolated": false,
                "isEndInterpolated": false
            }
        ]
    }
}

Timecoded words in parts

The structuredDescription.parts array is the core of a TranscriptAnnotation. It contains an object for each word of the text:

{
    "start": 2566,
    "duration": 8,
    "word": "This ",
    "speaker": "S1",
    "confidence": 0,
    "type": "LEX",
    "id": "727548-2566",
    "isInterpolation": false,
    "isStartInterpolated": false,
    "isEndInterpolated": false
}

Field Description

Field	Description
start	Start frame of the part. Should be ≥ the `start` of the `TranscriptAnnotation`.
duration	Duration, in frames, of the part. So, `start+duration` gives the 'end' of the part. The `duration` and `start` should be chosen so that `duration+start` is ≤ the `end` of the `TranscriptAnnotation`.
word	A string containing the word being spoken. Spaces should be included in the `word` string, they are not added automatically between words. Note that it can also contain multiple words in a single string if there is no need to timecode individual words. So, `"This is some text "` would also be a valid value. On the other hand, it could also contain only part of a word, if there is a need to timecode individual parts of a word.
confidence	A value between 0 and 1 indicating how confident we are about the text in `word` being the actual text being spoken out loud. This might be filled in by the automatic transcription.
speaker	A string indicating the name of the speaker.
type	What type of word this is. This could be filled in by the automatic transcription.
id	An optional client side identifier for the part.
isInterpolation	Deprecated. Use `isStartInterpolated` / `isEndInterpolated` instead.
isStartInterpolated	`true` if the start boundary is not known for sure, but interpolated between other boundaries. `false` if the `start` value is set by either the automatic transcription engine, or by the user, or assumed to be correct for another reason.
isEndInterpolated	Like `isStartInterpolated`, but for the end of the part (so, for `start+duration`).

start

Start frame of the part. Should be ≥ the start of the TranscriptAnnotation.

duration

Duration, in frames, of the part. So, start+duration gives the 'end' of the part. The duration and start should be chosen so that duration+start is ≤ the end of the TranscriptAnnotation.

word

A string containing the word being spoken. Spaces should be included in the word string, they are not added automatically between words.

Note that it can also contain multiple words in a single string if there is no need to timecode individual words. So, "This is some text " would also be a valid value. On the other hand, it could also contain only part of a word, if there is a need to timecode individual parts of a word.

confidence

A value between 0 and 1 indicating how confident we are about the text in word being the actual text being spoken out loud. This might be filled in by the automatic transcription.

speaker

A string indicating the name of the speaker.

type

What type of word this is. This could be filled in by the automatic transcription.

An optional client side identifier for the part.

isInterpolation

Deprecated. Use isStartInterpolated / isEndInterpolated instead.

isStartInterpolated

true if the start boundary is not known for sure, but interpolated between other boundaries. false if the start value is set by either the automatic transcription engine, or by the user, or assumed to be correct for another reason.

isEndInterpolated

Like isStartInterpolated, but for the end of the part (so, for start+duration).

The parts in the array are not necessarily sorted chronologically! It is the client’s responsibility to sort them by start.

The start and duration of the parts should be chosen so that the parts fall are fully within the start and end range of the TranscriptAnnotation. There can also be no overlap between parts. There can be gaps though (indicating silence).

Part Timing Interpolation: isStartInterpolated, isEndInterpolated

In the Limecraft Flow UI Transcriber application, it is possible to set the text between a start and end timecode.

Interpolation in the Transcriber application

This will create a part for each word, but it will have to interpolate some of the timecodes. The start of the first part ("Demo") is known, and the intended end (so, start+duration) of the last part is also known. All other 'boundaries' (start and end of parts) have to be interpolated in between. This is usually done by assuming the same length for each word. When a boundary is interpolated like this, it will set isStartInterpolated or isEndInterpolated to true. This information is important, so we know the boundaries are not known for sure.

[
    {
        "start": 3675,
        "duration": 16,
        "type": "LEX",
        "word": "Demo",
        "isStartInterpolated": false,
        "isEndInterpolated": true
    },
    {
        "start": 3691,
        "duration": 16,
        "type": "LEX",
        "word": " of",
        "isStartInterpolated": true,
        "isEndInterpolated": true
    },
    {
        "start": 3707,
        "duration": 17,
        "type": "LEX",
        "word": " interpolation.",
        "isStartInterpolated": true,
        "isEndInterpolated": false
    },
    {
        "start": 4952,
        "duration": 8,
        "word": "Hey, ",
        "confidence": 0.98,
        "speaker": "S2",
        "type": "LEX",
        "id": "727549-4952",
        "isStartInterpolated": false,
        "isEndInterpolated": false
    },
    {
        "start": 4966,
        "duration": 9,
        "word": "she's ",
        "confidence": 0.53,
        "speaker": "S2",
        "type": "LEX",
        "id": "727549-4966",
        "isStartInterpolated": false,
        "isEndInterpolated": false
    }
]

If edit operations are done afterward, interpolation will always happen between 'trusted' boundaries which have is[Start|End]Interpolated set to false.

Language

All transcript annotations having the same value for language of a MediaObject are assumed to represent a single transcript. In the Limecraft Flow UI, these will be shown as part of the same transcript.

Multiple transcripts can exist, as long as they have distinct values for the language field.

language is a property of the Annotation. It might also appear on the structuredDescription, which is set by the automatic transcription engine. When talking about the language of an annotation, we are always talking about the property on the Annotation (not the one in structuredDescription).

See Annotation for information on the possible values for the language.

Speaker

The speaker string indicates who is talking. It should uniquely identify one of the participants in the transcribed conversations.

Property Reference

Field Name	Required	Type	Description	Format
annotationProductionId	✘	Long		int64
clipMetadata	✘	ClipMetadata
created	✘	Date	The time when this resource was created	date-time
createdBy	✘	String	The request or process that created this resource
createdByShareId	✘	Long		int64
createdBySharedUserId	✘	Long		int64
creatorId	✘	Long	The id of the user who created this resource	int64
crossProduction	✘	Boolean
customFields	✘	CustomFields
deleted	✘	Date		date-time
description	✘	String	Textual contents of the Annotation
end	✘	Long	The frame range described by the annotation runs up to end, but not including it. Should be less than or equal to the amount of frames the MediaObject has.	int64
funnel	✘	String	Describes how the Annotation should be interpreted by the client application. Can be thought of as a subtype.
genericType	✘	String
id	✔	Long	The id of this resource	int64
includeTranslatedTo	✘	Boolean
includesFrom	✘	Set of string
keyframeFrames	✘	Long		int64
label	✘	String
language	✘	String
lastUpdated	✘	Date	The time when this resource was last updated	date-time
mediaObject	✘	MediaObject
mediaObjectId	✘	Long		int64
modifiedBy	✔	String	The request or process responsible for the last update of this resource
objectType	✘	String	The data model type or class name of this resource
origin	✘	String
productionId	✘	Long		int64
rating	✘	Double		double
relatedToId	✘	Long		int64
securityClasses	✘	Set of string		Enum:
source	✘	String
spatial	✘	String	Link the Annotation to a specific part of the video or image frame. A Media Fragments Spatial Dimension description string is expected.
speaker	✘	String
start	✘	Long	First frame of the Annotation. 0 is the first frame of the clip. The start frame is included in the frame range the annotation describes.	int64
structuredDescription	✘	Object
systemFields	✘	CustomFields
tags	✘	Set of string
translatedFromId	✘	Long		int64
translatedToIds	✘	Set of long		int64
type	✘	String		Enum: MINIME, EBU_CORE, TRANSCRIBER, TECHNICAL_DETAILS, GENERIC, SUBTITLE, SHOT,
version	✔	Long	The version of this resource, used for Optimistic Locking	int64

Field Name

Required

Type

Description

Format

annotationProductionId

✘

Long

int64

clipMetadata

✘

ClipMetadata

created

✘

Date

The time when this resource was created

date-time

createdBy

✘

String

The request or process that created this resource

createdByShareId

✘

Long

int64

createdBySharedUserId

✘

Long

int64

creatorId

✘

Long

The id of the user who created this resource

int64

crossProduction

✘

Boolean

customFields

✘

CustomFields

deleted

✘

Date

date-time

description

✘

String

Textual contents of the Annotation

end

✘

Long

The frame range described by the annotation runs up to end, but not including it. Should be less than or equal to the amount of frames the MediaObject has.

int64

funnel

✘

String

Describes how the Annotation should be interpreted by the client application. Can be thought of as a subtype.

genericType

✘

String

✔

Long

The id of this resource

int64

includeTranslatedTo

✘

Boolean

includesFrom

✘

Set of string

keyframeFrames

✘

Long

int64

label

✘

String

language

✘

String

lastUpdated

✘

Date

The time when this resource was last updated

date-time

mediaObject

✘

MediaObject

mediaObjectId

✘

Long

int64

modifiedBy

✔

String

The request or process responsible for the last update of this resource

objectType

✘

String

The data model type or class name of this resource

origin

✘

String

productionId

✘

Long

int64

rating

✘

Double

double

relatedToId

✘

Long

int64

securityClasses

✘

Set of string

Enum:

source

✘

String

spatial

✘

String

Link the Annotation to a specific part of the video or image frame. A Media Fragments Spatial Dimension description string is expected.

speaker

✘

String

start

✘

Long

First frame of the Annotation. 0 is the first frame of the clip. The start frame is included in the frame range the annotation describes.

int64

structuredDescription

✘

Object

systemFields

✘

CustomFields