com.innovaphone.transcriptions

This API facilitates audio transcription by submitting audio data to the OpenAI Whisper backend. The following documentation details the messaging sequence and data structures involved in requesting and receiving transcriptions.

Messages

TranscriptionRequest
TranscriptionRequestResult
TransferTranscribedAudio

TranscriptionRequest

To initiate a transcription, the client sends a TranscriptionRequest message.

{
    mt:                         "TranscriptionRequest",
    lang:                       "de",
    task:                       "transcribe",
    prompt:                     "Please write words exactly as spoken. Mixed languages possible.",
    response_format:            "verbose_json",
    timestamp_granularities[]:  "word",
    src:                        "src0"
}

Parameters

string mt	message type [mandatory]
string lang	the language of the audio. If not provided or set to "no-entry", Whisper will automatically (not always correctly) detect the language [optional]. Supported values: "ca", "cs", "de", "en", "es", "eu", "fr", "it", "nl", "pl", "pt", "ru", "si", "tr", "no-entry"
string task	the given task. Defaults to "transcribe". Other option: "translate" (transcribes and outputs translated text in the target language) [optional]
string response_format	the format of returned transcription. Defaults to "text". Other possible formats are json, text, srt, verbose_json, vtt, or diarized_json [optional]. For more info, WhisperAPI
string timestamp_granularities[]	the timestamp detail level for timestamped formats (ignored for pure text responses) [optional]. For more info, WhisperAPI
string src	client-defined unique identifier for this request. Used to coordinate messages when multiple requests run in parallel [mandatory]

Notes

When the response format is not "text", parsing and interpreting the returned data is the responsibility of the client.
Response_format must be set verbose_json to use timestamp granularities.
Timestamp_granularities is only relevant for timestamped output formats. Supported values: word, segment (one or both may be used).

Response

{
    mt:      "TranscriptionRequestResult",
    url:     "http://localhost/transcriptions/?SessionId=3435407065703356068&TranscriptionId=4089590383170956585",
    src:     "src0"
}

Parameters

string mt	message type "TranscriptionRequestResult"
string url	the upload endpoint. Includes both service-assigned SessionId and TranscriptionId. SessionId identifies the current session. TranscriptionId refers to the specific audio stream.
string src	same source identifier as in the request.

Response

{
    mt:                     "TransferTranscribedAudio",
    SessionID:              3435407065703356068,
    TranscriptionID:        4089590383170956585,
    DataComplete :          true,
    TranscribedAudioData :  "Happy transcriptions\n",
    DataLength :            21,
    src:                    "src0"
}

Parameters

string mt	message type "TransferTranscribedAudio"
ulong64 SessionID	matches SessionID from the request result
ulong64 TranscriptionID	matches TranscriptionID from the request result
boolean DataComplete	indicates whether the transcription is complete. False means additional chunks will follow; true marks the final chunk
string TranscribedAudioData	the transcribed text (format depends on response_format)
ulong64 DataLength	the length (in bytes) of TranscribedAudioData. This value represents the raw byte length of the TranscribedAudioData field, whether the content is plain text or a structured format such as JSON
string src	same source identifier as in the request

Response

If an error occurs, the service sends a final TransferTranscribedAudio message indicating failure.

{
    mt:                     "TransferTranscribedAudio",
    SessionID:              3435407065703356068,
    TranscriptionID:        4089590383170956585,
    DataComplete :          true,
    TranscribedAudioData :  "",
    DataLength :            0,
    error:                  "HTTP_AUTHENTICATION_FAILED"
    src:                    "src0"
}

Parameters

string mt	same message type for consistency "TransferTranscribedAudio"
ulong64 SessionID	matches SessionID from the request result
ulong64 TranscriptionID	matches TranscriptionID from the request result, identifiying the failed transcription
boolean DataComplete	always true for errors
string TranscribedAudioData	empty
ulong64 DataLength	always 0 for errors
string error	human readable reason
string src	same source identifier as in the request

Typical Flow

Client to Service: TranscriptionRequest
Service to Client: TranscriptionRequestResult (contains SessionId & TranscriptionId)
Client: Uploads audio to the provided URL
Service to Client: TransferTranscribedAudio (one or more messages)
- DataComplete = false means streaming in progress
- DataComplete = true means final message

Example

You can consume the API com.innovaphone.transcriptions and send request to it. Inside the callback you can access the response text via recv.msg.mt


    var transcriptionsApi = start.consumeApi("com.innovaphone.transcriptions");

    function requestTranscription(file) {
    transcriptionsApi.sendSrc({ mt: "TranscriptionRequest", lang: "de", src: "src0" }, transcriptionsApi.providers[0], (recv) => onRequest(recv, file));
    }

    // Callback for TranscriptionRequest
    function onRequest(recv, file) {
        const msg = recv.msg;

        if (msg.mt === "TranscriptionRequestResult") {
            // Create and send the audio file via HTTP POST to the given URL
            const postUrl = msg.url;
            const httpReq = new XMLHttpRequest();
            httpReq.open("POST", postUrl, true);
            httpReq.onload = () => {
            if (httpReq.status === 200) {
                console.log("Audio uploaded successfully.");
            } else {
                console.error("Upload failed:", httpReq.statusText);
            }
        };
            httpReq.onerror = () => console.error("Network error during upload.");
            httpReq.send(file);
        }

        else if (msg.mt === "TransferTranscribedAudio") {
            console.log("Transcribed text:", msg.TranscribedAudioData);
            if (msg.DataComplete) {
                console.log("Transcription complete.");
            }
        }
    }

    requestTranscription(file);