Swift: Speech-To-Text With SpeechAnalyzer



This content originally appeared on Level Up Coding – Medium and was authored by Itsuki

Real Time Or From File.

Full Code On GitHub!

If you have ever worked with speech to text, you might know that we had this legacy SFSpeechRecognizer, a swift wrapper around some Objective-C code, mainly aiming for short-form dictation.

We got something better in this year’s (2025) WWDC, the brand new SpeechAnalyzer!

Faster, more flexible, can be used for transcribing long-form, distant audio such as lectures and over conversational speech!

And most importantly! It is actual Swift! Using Swift Concurrency!

Apple is already using this new API in many of the system apps such as Notes, Voice memos, Journal, and etc.!

Above is what APPLE WANT US TO KNOW!

Of course, these are true, but there are also the not-so-good part that Apple doesn’t mention, not in this Bring advanced speech-to-text to your app with SpeechAnalyzer session, not in the documentation!

  1. We still rely on user choosing a Locale
  2. Non-latin (or maybe even non-English?) language support is pretty poor. For example, for こんにちは (Hello), sometimes (not all), it is transcribed as Konichiwa; and sometimes, it is indeed こんにちは…
  3. Critical bugs that seriously affect the capability of the framework/API. I will be sharing with you in more details on what they are as we work through!

In this article, let’s check out how we can use this SpeechAnalyzer for transcribing speech to text, for both real-time audio input as well as those from the files!

We will also be checking out on how we can improve user experience by preheating the analyzer, setting the priority of analysis work, caching analyzer resources, and etc!

Wait!

Doesn’t apple provided a Full Sample Code for it?

  1. Apple’s Engineers are always way too smart for me to understand!
  2. It is probably (or definitely) written with a Beta version that is too old…
  3. There are a lot of points to pay attention to that we won’t notice (or at least I) until I make it myself!

So!

Feel Free to grab the full code from my GitHub and let’s start!

SpeechAnalyzer Overview

This SpeechAnalyzer class is the interaction point for performing speech analysis.

It is responsible for managing an analysis session to which we can add different SpeechModules such as SpeechTranscriber for performing a specific tasks of analysis. We will be getting a little more into those Modules next!

The audio input to be analyze on will be feed directly into the SpeechAnalyzer class whereas we will be getting the output from the individual SpeechModules because different modules perform different tasks and therefore yield different output results.

And one of the best part about this SpeechAnalyzer (in my opinion) is that this entire analysis process is asynchronous, and the Input, output, and session control are decoupled!

That is!

We can feed the audio to the analyzer as it becomes available, while displaying or further processing the result from the module independently somewhere else!

Both by using AsyncSequence!

We will see how this actually works out in just a couple seconds!

NOTE: The analyzer can only analyze one input sequence at a time.

Modules

Time to take a look at those different modules available!

SpeechTranscriber

A speech-to-text transcription module that’s appropriate for normal conversation and general purposes.

This is what we will be using in most of the cases.

DictationTranscriber

A fall back when SpeechTranscriber is not available, compatible with older devices.

(I don’t care!)

SpeechDetector

№1 bug for today!

Speech Detector
Speech Module Conforming Type

Realize the problem?

Apple said that it is a module, without having it conforming to the SpeechModule protocol!

That is!

That is if we try do what the document suggests for enabling voice activated transcription by adding SpeechDetector to the analyzer

https://developer.apple.com/documentation/speech/speechdetector

Here is what we will get!

Of course we will get this error!

Because it is not conforming to the protocol needed!

General Steps

Here are the general step for transcribing speech to text, either on audio files or streams.

To Set up.

  1. Check SpeechTranscriber Module Availability with the isAvailable property. You could choose trying to fall back to DictationTranscriber, but I don’t care about people who use devices that does not support SpeechTranscriber so I will just be disabling the feature here in this article.
  2. Check if the Locale request is supported with supportedLocales or supportedLocale(equivalentTo:) to obtain a near-equivalent, a supported (and by preference already-installed) locale that shares the same Locale.LanguageCode value but has a different Locale.Region value, if there is no exact equivalent.
  3. Create and configure the SpeechTranscriber Module and the SpeechAnalyzer with the module
  4. Set up the output handler for obtaining the analysis result by for await on the results sequence.
  5. Download the assets required by the modules (Locale) if needed with AssetInventory.

When we are ready to transcribe some audio

  1. Create an input sequence (AsyncStream) we can use to provide the spoken audio in the case of analyzing from stream, or an AVAudioFile for analyzing from file.
  2. Start analysis.
  3. Finalize analysis when desired. Finalize, not Finish. Those are not the same as we will be getting into a little more details later. But basically, finalize finishes the current input, whereas finish finishes the entire analysis session and we cannot reuse the analyzer for any further analysis!

⭐⭐⭐ Transcribe From File ⭐⭐⭐

I really want to say let’s start simple here with transcribing audio from file, but the fact is that I am already running into couple problems that makes me don’t want to categorize it as simple!

Anyway, let’s use it to check out the steps above, one by one!

Starting with setting up the SpeechAnalyzer!

Check Module Availability

We can check whether the SpeechTranscriber Module is available or not by simply use the isAvailable property, a boolean value indicating whether this module is available given the device’s hardware and capabilities.

guard SpeechTranscriber.isAvailable else {
throw _Error.notAvailable
}

Check Locale

If you want to check if there is an exact match, you can use the supportedLocales properties which includes all the locales that the transcriber can transcribe into, including locales that may not be installed but are downloadable.

However, let’s use the supportedLocale(equivalentTo:) function here to obtain a near-equivalent when possible, a supported (and by preference already-installed) locale that shares the same Locale.LanguageCode value but has a different Locale.Region value, if there is no exact equivalent. This may result in an unexpected transcription, such as between color and colour, but I personally think it is a little better than just claiming we don’t support that language at all!

guard let locale = await SpeechTranscriber.supportedLocale(equivalentTo: locale) else {
throw _Error.localeNotSupported
}

(Optional) Reserve Locale Asset

We can add the locale above to the app’s current asset reservations by using the reserve(locale:) function on AssetInventory.

try await AssetInventory.reserve(locale: locale)

This is optional because The AssetInventory class does this automatically if needed and is only necessary for modules with locale-specific assets; that is, modules conforming to LocaleDependentSpeechModule.

Create SpeechTranscriber

To create a SpeechTranscriber , we can either use init(locale:preset:) to create a general-purpose transcriber according to a Preset or init(locale:transcriptionOptions:reportingOptions:attributeOptions:) to customize all the options.

We can also modify the values of a preset’s properties and configure a transcriber with the modified values or even create our own presets by extending SpeechTranscriber.Preset.

Here is a table for the configurations of the built-in presets.

https://developer.apple.com/documentation/speech/speechtranscriber/preset

Here, we will be using the timeIndexedProgressiveTranscription as the base plus some additional properties/options.

private let preset: SpeechTranscriber.Preset = .timeIndexedProgressiveTranscription
// ...
transcriber = SpeechTranscriber(
locale: locale,
transcriptionOptions: self.preset.transcriptionOptions,
reportingOptions: self.preset.reportingOptions.union([.alternativeTranscriptions]),
attributeOptions: self.preset.attributeOptions.union([.transcriptionConfidence])
)

Create Analyzer

In its simplest form, all we have to do is to pass in the module we have created using init(modules:options:?) to create an SpeechAnalyzer.

analyzer = SpeechAnalyzer(modules: [transcriber])

We can also use the setModules(_:) function to add or remove modules after analyzer creation, by providing a list of newModules to include in the analyzer. These modules replace the previous modules, but we may preserve previous modules by including them in the list.

Two important notes here!
  1. Modules can be added or removed to the analyzer mid-stream. A newly-added module will immediately begin analysis on new audio input, but it will not have access to already-analyzed audio.
  2. Modules cannot be reused from a different analyzer.

Now, to delay or prevent unloading an analyzer’s resources and set the priority of analysis work, we can further specify the options parameter when constructing the analyzer, by selecting a SpeechAnalyzer.Options.ModelRetention and a TaskPriority.

analyzer = SpeechAnalyzer(modules: [transcriber], options: .init(priority: .userInitiated, modelRetention: .processLifetime))

(Optional) Preheat Analyzer

To proactively load system resources and “preheat” the analyzer, we can call the prepareToAnalyze(in:) after setting the modules. This may improve how quickly the modules return their first results.

self.bestAvailableAudioFormat = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])

try await analyzer.prepareToAnalyze(in: self.bestAvailableAudioFormat, withProgressReadyHandler: nil)

Set Up Handler

As I have mentioned above, the output of the module is provided by an AsyncSequence , so to consume it, let’s set up a simple Task with a for await loop.

transcriptionResultsTask = Task {
do {
for try await result in transcriber.results {
print(result.text, result.isFinal)
//....
}
} catch(let error) {
if error is CancellationError {
print("task cancelled")
return
}
self.error = error
}
}

When the analysis session becomes finished, a CancellationError is thrown from all waiting methods and result streams.

Download Locale Assets If needed

First of all, we can check if the locale is already installed or not by using the installedLocales property.

If it is not included, we can then can call assetInstallationRequest(supporting:) on AssetInventory to obtain an instance of AssetInstallationRequest and call its downloadAndInstall() method to download the asset.

let installed = (await SpeechTranscriber.installedLocales).contains(locale)

// Before using the SpeechAnalyzer class, we must install assets required by the modules (Locale) we plan to use.
// These assets are machine-learning models downloaded from Apple’s servers and managed by the system.
if !installed {
// If the current status is .installed, returns nil, indicating that nothing further needs to be done.
// An error if the assets are not supported or no reservations are available.
// If some of the assets require locales that aren’t reserved, it automatically reserves those locales. If that would exceed maximumReservedLocales, then it throws an error.
if let installationRequest = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
try await installationRequest.downloadAndInstall()
}
}

Transcribe

Here is the part where I am running into some trouble! Let me first share the code with you and then point out those points!

// for transcribing file
func transcribeFile(_ fileURL: URL) async throws {
let audioFile = try AVAudioFile(forReading: fileURL)
let _ = fileURL.startAccessingSecurityScopedResource()
let cmTime = try await analyzer.analyzeSequence(from: audioFile)
try await self.analyzer.finalize(through: cmTime)
fileURL.stopAccessingSecurityScopedResource()
}

Yes!

Five lines!

But problems occur!

Anyway, let’s take a look at what we have here!

We call startAccessingSecurityScopedResource for accessing the URL (If the URL is not security-scoped, we will get a false for the return value but we will still able to create an AVAudioFile, so we will move on regardlessly), create an AVAudioFile for reading, call analyzeSequence(from:) to analyze an input sequence created from an audio file, returning when the file has been read, call finalize(through:) to wait for the analysis to complete, and stopAccessingSecurityScopedResource after we finishes.

By the way, if we try to create an AVAudioFile using a security-scoped URL without calling startAccessingSecurityScopedResource, we will get this com.apple.coreaudio.avfaudio error -54.

Simple! So where are the problems?

To perform analysis, there is also another autonomous version start(inputAudioFile:finishAfterFile:) to start analysis of an input sequence created from an audio file and returns immediately.

However!

When using this method with finishAfterFile setting to false, the analyzer will not analyze the last couple buffer of the file correctly, nor finalize the output result!

That is within that for try await result output handler above, I never observed a single result with the isFinal property being true, at least not for a fairly short audio one!

So why don’t we just call finalize(through:) the same as we did above? Unfortunately, since we don’t know the CMTime of the last audio sample of the file, passing in a nil to this function and calling it will not solve our problem!

Then, why don’t we set finishAfterFile to true?

If we have set it to true, it will INDEED analyze and finalize the file correctly!

However, this will make the analysis finish after the audio file has been fully processed. Equivalent to calling finalizeAndFinishThroughEndOfInput().

What does that mean?

At the return of the finish(after:) method or any other ones that finish the analysis session,

  1. The modules’ (SpeechTranscriber, and etc.) result streams will have ended and the modules will not accept further input from the input sequence.
  2. The analyzer will not be able to resume analysis with a different input sequence and will not accept module changes; most methods will do nothing.

That is, we cannot reuse our SpeechTranscriber nor the SpeechAnalyzer for any further transcribing tasks anymore!

I want to be able to reuse those resources instead of creating a new one on every single analysis task (input)!

Clean Up

Two things we want to do when tearing down the analyzer.

First of all, call one of the finish methods to end the analysis session. We might also want to remove an asset locale reservation so that the system can remove that assets at a later time to free up some spaces in user’s device using release(reservedLocale:).

func finishAnalysisSession() async {
// To end an analysis session, we must use one of the analyzer’s finish methods or parameters, or deallocate the analyzer.
await self.analyzer.cancelAndFinishNow()

// Removes an asset locale reservation
// The system will remove the assets at a later time.
await AssetInventory.release(reservedLocale: self.locale)
}

⭐⭐⭐ Real Time Transcription ⭐⭐⭐

Thought it will be a lot harder?

Nope! Not at all!

In this article, I assume that you know how to capture audio input with AVAudioEngine and already have those AVAudioPCMBuffer from the installTap block in hand! So that we can only focus on the SpeechAnalyzer part!

If you need a catch up, please check out my previous article SwiftUI: AVAudioEngine With Swift Concurrency! It has all we need here!

In addition to those basic steps we had above, here are the additional/different ones we will need when performing real-time analysis!

For set up
  1. Before starting the analysis, create an AsyncStream for AnalyzerInput. We will be using this to feed those audio buffers to the analyzer for analysis whenever we obtain a new one from the input node installTap block.
  2. Call start(inputSequence:) passing in the stream above to start analysis.
func startRealTimeTranscription() async throws {  
(inputStream, inputContinuation) = AsyncStream<AnalyzerInput>.makeStream()
try await analyzer.start(inputSequence: inputStream!)
}
When we get a new AVAudioPCMBuffer from the installTap block
  1. Convert the AVAudioFormat of the buffer to that supported by the analyzer’s modules, obtained with bestAvailableAudioFormat(compatibleWith:considering:) method.
  2. Create an AnalyzerInput using the converted AVAudioPCMBuffer
  3. Use AsyncStream.Continuation to yield(_:) the new value to the analyzer.
func streamAudioToTranscriber(_ buffer: AVAudioPCMBuffer) {

let format: AVAudioFormat = self.bestAvailableAudioFormat ?? buffer.format

// fall back to the original one if conversion fails
var convertedBuffer: AVAudioPCMBuffer = buffer

do {
convertedBuffer = try self.convertBuffer(buffer, to: format)
} catch(let error) {
print("error converting buffer: \(error)")
}

let input: AnalyzerInput = AnalyzerInput(buffer: convertedBuffer)
self.inputContinuation?.yield(input)
}

// https://developer.apple.com/documentation/speech/bringing-advanced-speech-to-text-capabilities-to-your-app
func convertBuffer(_ buffer: AVAudioPCMBuffer, to format: AVAudioFormat) throws -> AVAudioPCMBuffer {
let inputFormat = buffer.format

guard inputFormat != format else {
return buffer
}

if audioConverter == nil || audioConverter?.outputFormat != format {
audioConverter = AVAudioConverter(from: inputFormat, to: format)
audioConverter?.primeMethod = .none // Sacrifice quality of first samples in order to avoid any timestamp drift from source
}

guard let audioConverter = audioConverter else {
throw _Error.audioConverterCreationFailed
}

let sampleRateRatio = audioConverter.outputFormat.sampleRate / audioConverter.inputFormat.sampleRate
let scaledInputFrameLength = Double(buffer.frameLength) * sampleRateRatio
let frameCapacity = AVAudioFrameCount(scaledInputFrameLength.rounded(.up))
guard let conversionBuffer = AVAudioPCMBuffer(pcmFormat: audioConverter.outputFormat, frameCapacity: frameCapacity) else {
throw _Error.failedToConvertBuffer("Failed to create AVAudioPCMBuffer.")
}

var nsError: NSError?
var bufferProcessed = false

let status = audioConverter.convert(to: conversionBuffer, error: &nsError) { packetCount, inputStatusPointer in
defer { bufferProcessed = true }
// This closure can be called multiple times, but it only offers a single buffer.
inputStatusPointer.pointee = bufferProcessed ? .noDataNow : .haveData
return bufferProcessed ? nil : buffer
}

guard status != .error else {
throw _Error.failedToConvertBuffer(nsError?.localizedDescription)
}

return conversionBuffer
}
To finalize (finish for the specific input, but not the session)
  1. AsyncStream.Continuation.finish() the input stream
  2. finalize(through:) with CMTime set to nil, to finalize up to and including the last audio the analyzer has taken from the input sequence
func finalizePreviousTranscribing() async throws {
self.inputContinuation?.finish()
self.inputStream = nil
self.inputContinuation = nil
// When nil, finalizes up to and including the last audio the analyzer has taken from the input sequence, and
try await self.analyzer.finalize(through: nil)
}

Error Handling

In addition to the CancellationError above within the output handler, there are also couple Speech framework specific Errors we can react on.

For example, to return some custom messages.

let nsError: NSError = error as NSError
let code = nsError.code
let domain = nsError.domain

// SFSpeechError.Code: https://developer.apple.com/documentation/speech/sfspeecherror/code
if domain == SFSpeechError.errorDomain {
switch code {

// Audio input errors
case SFSpeechError.Code.audioDisordered.rawValue:
return "The audio input time-code overlaps or precedes prior audio input."

case SFSpeechError.Code.audioReadFailed.rawValue:
return "Fail to read audio file."

// Audio format errors
case SFSpeechError.Code.incompatibleAudioFormats.rawValue:
return "The selected modules do not have an audio format in common."

case SFSpeechError.Code.unexpectedAudioFormat.rawValue:
return "The audio input is in unexpected format."

// Asset errors
case SFSpeechError.Code.assetLocaleNotAllocated.rawValue:
return "The asset locale has not been allocated."

case SFSpeechError.Code.cannotAllocateUnsupportedLocale.rawValue:
return "The asset locale being requested is not supported by SpeechFramework."

case SFSpeechError.Code.noModel.rawValue:
return "The selected locale/options does not have an appropriate model available or downloadable."

case SFSpeechError.Code.timeout.rawValue:
return "The operation timed out."

case SFSpeechError.Code.tooManyAssetLocalesAllocated.rawValue:
return "The application has allocated too many locales."


// Custom language model errors
case SFSpeechError.Code.malformedSupplementalModel.rawValue:
return "The custom language model file was malformed."

case SFSpeechError.Code.missingParameter.rawValue:
return "Required parameter is missing/nil."

case SFSpeechError.Code.undefinedTemplateClassName.rawValue:
return "The custom language model templates were malformed."


// Other errors
case SFSpeechError.Code.insufficientResources.rawValue:
return "There are not sufficient resources available on-device to process the incoming transcription request."

case SFSpeechError.Code.internalServiceError.rawValue:
return "An internal error occurred."

case SFSpeechError.Code.moduleOutputFailed.rawValue:
return "The module’s result task failed."

default:
break
}
}

Thank you for reading!

That’s it for this article!

It is getting pretty long so please allow me to leave the full code out here!

You can always just grab it from my GitHub!

Happy speech-to-texting!


Swift: Speech-To-Text With SpeechAnalyzer was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.


This content originally appeared on Level Up Coding – Medium and was authored by Itsuki