-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Swift/iOS wrapper for TFLite libdeepspeech #3061
Comments
It looks like building a fat binary for x86_64 and arm64 is not useful because you still need two separate frameworks for device and simulator, as they use different SDKs. I've edited the first comment to reflect this. |
Some initial tests on device (iPhone Xs), averaged across 3 runs, with the 0.7.4 models: no scorer, cold cache: RTF 0.60x |
@lissyx could you give me a quick overview of what it would take to test changes to macOS workers in an isolated environment? Like, say, adding a macos-heavy-b worker type and spawning tasks with it to test. |
My probably incomplete idea:
|
You'd have to use the script that is on the worker #1
We don't have a nicer way to spin new workers mostly because it's not something we needed to do often and because it'd require again much more tooling. Given the current status of our macOS workers ... Doing that in parallel of running existing infra is likely to be complicated because of ... resources (CPU / RAM). I thought disk would be an issue but that should be fine. |
I assume I can find the appropriate versions and config changes by inspecting the currently running VMs, right? In that case, I just need the IP of worker 1 to get started.
Yeah. I thought about making a VM copy of a worker to side-step these provisioning issues but I guess that's also prone to causing problems.
I would probably run that worker on my personal machine while I test it, since it's not meant for general availability. |
Indeed, you can fetch the json config. IPs on matrix.
Be aware the existing workers if you copy them to your system are meant for VMWare Fusion Pro |
Ah, yes, that was also one of the complications. OK, thanks. |
(Hopefully) finished wrapping the C API, now moving on to CI work. Wrapper is here if anyone wants to take a quick look and provide any suggestions: https://github.com/mozilla/DeepSpeech/blob/ios-build/native_client/swift/deepspeech_ios/DeepSpeech.swift |
In particular I'd be very interested in any feedback from Swift developers on how the error handling looks and how the buffer handling around |
PR for adding macos-heavy-b and macos-light-b worker type and instances: taskcluster/community-tc-config#308 |
Looks like adding Xcode to the worker brings the free space to under 10GB which stops the taskcluster client from picking up any jobs... Resizing the partition does not seem to work. My next step will be to create a worker from scratch with a larger disk image. |
@dabinat tagging you because you mentioned interest in these bindings in the CoreML issue, in case you have anything to mention regarding the design of the bindings here. |
@reuben super awesome that you are working on this! This is actually perfect timing as we are looking for an offline speech recognition for iOS right now. I know it's not finished yet, but could you provide a small guide on how I could try it out, is the .so library already available somewhere? Maybe then I could also help with writing an example app, if you wish. |
on taskcluster, if you browse to the iOS artifacts sections you should get it |
Thank you! I'll try it out! |
I tried it out with the current 0.7.4 models from the release page and one of the audio files from there.
And then comes this error
in this line of 178 public func feedAudioContent(buffer: UnsafeBufferPointer<Int16>) {
180 precondition(streamCtx != nil, "calling method on invalidated Stream")
181
182 DS_FeedAudioContent(streamCtx, buffer.baseAddress, UInt32(buffer.count)) <<<<< Thread 5: EXC_BAD_ACCESS (code=1, address=0x177000075)
183 } Just leaving this here, but I don't want to bother too much before this is even called finished :D Update: A sorry I saw the version of DeepSpeech and I guess the models are not compatible. |
The models should be compatible. I don't know what's going on there, can't reproduce it locally... |
Does the log have any more details for this error? |
Also, maybe double check the signing options in Xcode? At some point when writing the bindings I ran into some runtime exceptions due to incorrect signing options. |
You're right. Since it crashes when communication with library happens, it's probably just included incorrectly. So to go through what I tried:
The error in the post before is after trying some things in the "Frameworks, Libraries and Embedded Content Section".
What do you have there? Maybe some things have to be switched to embed & sign? |
After setting Here is the full log I got for that error.
|
I tried a simple test app where I loaded a pre-converted file of a few seconds into memory and called DeepSpeechModel.speechToText. There are no crashes or anything, but the resulting text string is empty. It seemed from the header like all I had to do was initialize a DeepSpeechModel with the .tflite file and then call speechToText with the buffer. Did I miss a step? Do I need to setup a streaming context even if I’m not streaming? |
That's correct, you don't need to setup a streaming context. |
@reuben Thanks, that solved the Android issue! It also allowed me to compare memory usage between iOS and Android. On the D-Day speech, Android climbs to ~200mb usage almost immediately and stays there, whereas iOS climbs more slowly to a similar number (or crashes on longer speeches). It seems like you were right to assume there was no memory leak, though it's weird to me that memory use is >10x the size of the input file. However, on longer files, the crash on iOS is still present. Memory usage on Android balloons to >500MB on the Reagan RNC speech and run for a long while (I'm virtualizing Android on a beefy PC), so I'll update if it finishes. Additionally, (this might be related to the -bitexact issue), on iOS on the shorter speeches there are long periods of sparse (one word for a minute of speech) transcription in between chunks of near-perfect transcription. The audio quality doesn't noticeably decrease in those areas when I listen. Finally, in the process of updating react-native-transcription to the latest DeepSpeech version I noticed that Android DeepSpeech 0.9.3 hadn't been published to Maven yet, only 0.9.2. |
@reuben Another interesting bug I discovered while trying to find clues: the iOS example happily transcribed non-16khz wav files while Android example refused them. Also weird is Android and iOS give different transcriptions for the same noisy input despite using the exact same pretrained model and audio files. |
@zaptrem @CatalinVoss What's the installation process for testing on iOS? There isn't a cocoapod yet is there? |
Its not surprising and not a bug, @reuben already explained it: android example wav reading is hardcoded to reduce dependencies required by CI, so it only handles 16khz and it's not robust to broken wav files. |
@reuben Trying again in case I sent it too early in the morning last week. @CatalinVoss Are you able to reproduce this with your binary? |
Can you get a backtrace from inside deepspeech by typing |
@CatalinVoss Here you go. I had to use a text file since I exceeded pastebin's size limit: deepSpeechCrashST.txt |
Ah. The relevant part is at the bottom
This looks like a crash in the beam search decoder. I'm not actually using the DeepSpeech decoder -- have my own -- so unfortunately I can't help :( it appears to be unrelated to the bug I saw |
@CatalinVoss Thanks for pointing that out! I'm curious; what about your use case required building an alternate decoder? Is it open source? @reuben What do you make of this error? Or is there a different person who worked on the beam decoder I should ping? Is this something I should move to its own, non-platform-specific bug report? |
Working on child literacy. Closed source I'm afraid |
@CatalinVoss Cool stuff! I can see why it's necessary in that case, detecting specific mispronunciations is more than hotword-boosting can do. Coincidentally, I stumbled across this Google feature that might interest you about 10 minutes ago in a discussion with someone. I ran the test again on an iPad Pro late 2018 (before it was an iPhone 11 Pro Max) and got a slightly different result. Does this still indicate Decoder issues?
|
Oh cool I had not seen that!
Yes this still looks like decoder. :/
…On Tue, Jan 12, 2021 at 23:13 zaptrem ***@***.***> wrote:
@CatalinVoss <https://github.com/CatalinVoss> Cool stuff! I can see why
it's necessary in that case, detecting specific mispronunciations is more
than hotword-boosting can do. Coincidentally, I stumbled across this Google
feature that might interest you
<https://www.theverge.com/2019/11/14/20964401/google-search-pronunciation-guide-feedback-machine-learning-ai>
about 10 minutes ago in a discussion with someone.
I ran the test again on an iPad Pro late 2018 (before it was an iPhone 11
Pro Max) and got a slightly different result. Does this still indicate
Decoder issues?
frame #4739: 0x000000010bb6bcf8 deepspeech_ios`PathTrie::iterate_to_vec(std::__1::vector<PathTrie*, std::__1::allocator<PathTrie*> >&) + 64
frame #4740: 0x000000010bb6bcf8 deepspeech_ios`PathTrie::iterate_to_vec(std::__1::vector<PathTrie*, std::__1::allocator<PathTrie*> >&) + 64
frame #4741: 0x000000010ba99114 deepspeech_ios`DecoderState::next(double const*, int, int) + 404
frame #4742: 0x000000010b903470 deepspeech_ios`StreamingState::processBatch(std::__1::vector<float, std::__1::allocator<float> > const&, unsigned int) + 296
frame #4743: 0x000000010b903300 deepspeech_ios`StreamingState::pushMfccBuffer(std::__1::vector<float, std::__1::allocator<float> > const&) + 236
frame #4744: 0x000000010b902e88 deepspeech_ios`StreamingState::feedAudioContent(short const*, unsigned int) + 396
* frame #4745: 0x0000000101a80808 ReLearn`closure #1 in Transcription.render(samples=Swift.UnsafeRawBufferPointer @ 0x000000017032dc20, stream=(streamCtx = 0x000000010bfd2670)) at Transcription.swift:85:24
frame #4746: 0x0000000101a80830 ReLearn`thunk for @callee_guaranteed ***@***.*** UnsafeRawBufferPointer) -> ***@***.*** @owned Error) at <compiler-generated>:0
frame #4747: 0x0000000101a823d0 ReLearn`partial apply for thunk for @callee_guaranteed ***@***.*** UnsafeRawBufferPointer) -> ***@***.*** @owned Error) at <compiler-generated>:0
frame #4748: 0x00000001942f5174 libswiftFoundation.dylib`Foundation.Data.withUnsafeBytes<A>((Swift.UnsafeRawBufferPointer) throws -> A) throws -> A + 392
frame #4749: 0x0000000101a7ff68 ReLearn`Transcription.render(audioContext=0x0000000282e94930, stream=(streamCtx = 0x000000010bfd2670), self=0x0000000282f28cc0) at Transcription.swift:83:26
frame #4750: 0x0000000101a80eb8 ReLearn`closure #1 in Transcription.recognizeFile(audioContext=0x0000000282e94930, self=0x0000000282f28cc0, stream=(streamCtx = 0x000000010bfd2670), audioPath=Swift.String @ 0x000000017032e458, start=2021-01-13 02:00:54 EST) at Transcription.swift:107:18
frame #4751: 0x0000000101a7e23c ReLearn`closure #1 in static AudioContext.load(asset=0x000000028278c8a0, assetTrack=0x00000002822ea6d0, audioURL=<unavailable; try printing with "vo" or "po">, completionHandler=0x0000000101a8248c ReLearn`partial apply forwarder for closure #1 (Swift.Optional<react_native_transcription.AudioContext>) -> () in react_native_transcription.Transcription.(recognizeFile in _79B1BC2F893AB0135086535C16DBA135)(audioPath: Swift.String) -> () at <compiler-generated>) at AudioContext.swift:58:17
frame #4752: 0x0000000101a5a490 ReLearn`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
frame #4753: 0x000000010bd83bcc libdispatch.dylib`_dispatch_call_block_and_release + 32
frame #4754: 0x000000010bd856c0 libdispatch.dylib`_dispatch_client_callout + 20
frame #4755: 0x000000010bd8d354 libdispatch.dylib`_dispatch_lane_serial_drain + 736
frame #4756: 0x000000010bd8e0c0 libdispatch.dylib`_dispatch_lane_invoke + 448
frame #4757: 0x000000010bd9a644 libdispatch.dylib`_dispatch_workloop_worker_thread + 1520
frame #4758: 0x00000001db297804 libsystem_pthread.dylib`_pthread_wqthread + 276
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3061 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACRFK6SXCOM2VR4MLOUKEDSZVB2ZANCNFSM4N5MDJDQ>
.
|
@reuben It's a bit late, but I just proposed those quick changes here: #3510 I also think there's some more debug info I can provide about the decoder issue, I'll test it tomorrow. |
@reuben Okay, poking at Xcode's debugging options/info is like trying to reverse engineer an alien spaceship at my current knowledge level, so I haven't learned much from a few hours trying that. However, I did get some interesting results from messing around with the scorer/models. Disabling the external scorer allowed the transcription process to run for much longer before eventually crashing (similar to the length it took my Android build to complete the same transcription task) with the same error. Additionally, using the Chinese model and scorer lasted the same extended amount of time as disabling the external scorer. Also, sometimes across all tests the "reading samples" message varied exactly once near the beginning of the job like so:
With No/Chinese scorer there is a drop/spike in thread activity right before the crash. Is the transcription job being run on the AVAudioSession Notify Thread instead of another thread when using the English model? Alternatives have constant utilization on Thread 4 with a spike on the AV Notify Thread before the crash, whereas English has constant utilization on the AV Notify Thread. Also, the English model resulted in much higher CPU usage during transcription than Chinese/no scorer. English runtime: 1 minute Unlikely theory: Maybe it's an ARM-specific issue? I've been testing Android with an x86 emulator and have no way to verify this one way or another (my ARM emulator isn't working). |
@CatalinVoss Any idea why it would still be failing when the external scorer is disabled? |
@reuben Does any of this help? Is there potentially an FFMPEG setting that is causing this, since it doesn't happen with non-converted WAV files? It's been over a month since your last response, so is there any more experimentation/testing I can do to help this along? |
@reuben I tried converting from MKV files encoded with vorbis instead of the .ogg I was using before. Didn't work. I tried converting with Audacity manually. Didn't work. I wonder if it has nothing to do with the conversion and it's actually just the length of the recording. Do you know of any long conversations recorded in PCM 16 natively we can test with? I'm lead to this conclusion not just from my testing, but because @CatalinVoss suggested it's a decoder issue (implying to me the audio file had already been successfully read/inference had completed). EDIT: Recording the WAV natively with Audacity results in no crash. Exporting the exact same recording as OGG and converting with FFMPEG crashes. |
@CatalinVoss (and @reuben if you're still here) I continued testing and I think the issue might(????) have something to do a bug in render() caused by however FFMPEG/SoX converts 48000mhz to 16000mhz. I rewrote the render function in iOS to be much simpler and (I would think) more robust due to the utilization of higher-level Apple APIs that are file-format independent. The crash disappeared (hooray!) but I'm now getting jibberish short transcriptions (whereas the same files on Android give me acceptable results). Any idea where I went wrong here?: private func newRender(url: URL, stream: DeepSpeechStream) {
let file = try! AVAudioFile(forReading: url)
guard let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: file.fileFormat.sampleRate, channels: 1, interleaved: false) else { fatalError("Couldn't read the audio file") }
print("reading file")
var done = false;
while(!done){
let buf = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: 8096)!
do {
try file.read(into: buf)
} catch {
print("end of file");
done = true
return;
}
print("read 8096 frames")
let floatBufPointer = UnsafeBufferPointer(start: buf.int16ChannelData?[0], count: Int(buf.frameLength))
stream.feedAudioContent(buffer: floatBufPointer);
}
} |
Sorry for the delay. I never used the audio file recognition path that goes through As a hackky workaround, are you just calling Another way to isolate the issue may be to see if you encounter the issue with long-running mic detection (which, again, I don't see), but I'm not doing hours and hours of it either. As for your |
@CatalinVoss Thanks for getting back! I could split it into chunks, but it would likely mess with transcriptions mid-sentence and paragraph, as well as reset the time part of tokens (I'm using intermediateDecodeWithMetadata as finishStreamWithMetadata also has an unknown crash... one problem at a time). It's called floatBufPointer because originally I was passing floats instead of 16bit ints and I forgot to change it. As you can see here: guard let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: file.fileFormat.sampleRate, channels: 1, interleaved: false) else { fatalError("Couldn't read the audio file") }
...
let floatBufPointer = UnsafeBufferPointer(start: buf.int16ChannelData?[0], count: Int(buf.frameLength)) I am passing the int16ChannelData. TypeOf tells me this is a pointer to an int16 array. I can also wrap it in a swift Array() and typeOf is int16[]. Weirdly enough, when I pass the Swift Array() I get a different (but still jibberish) result. Would it be helpful if I opened a pull request with these changes so they're easy for others to look at? |
@reuben Any idea what I’m doing wrong in the proposed Swift code above? |
@reuben Can you respond with an estimate on when you can look at this? It's been over a month and a half. I've given it my best shot but have run out of ideas for now. |
@reuben @lissyx I gave up on fixing this and just took the accuracy hit from splitting the audio file into 5-minute segments. I've published the app using this, ReLearn, to the Android/iOS App Stores. It uses DeepSpeech to transcribe long video recordings of lectures for free on-device in the background (it also transcribes audio in-person recordings on Android, but uses Apple's solution for that on iOS for now). |
Fantastic job |
@zaptrem I've been wanting to build something very similar to this but just ran up against Apple's one minute live transcription limit with SFSpeechRecognizer. How are you getting around this in ReLearn? Any chance you'd like to collaborate on some live speech / mind mapping type applications? |
WIP: Build works fine on latest master with some small modifications (DS branch ios-build):
Build for simulator x86_64:
$ bazel build --verbose_failures --config=ios_x86_64 --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt //native_client:libdeepspeech.so --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV $ cp -f bazel-bin/native_client/libdeepspeech.so ../native_client/swift/libdeepspeech.so
Build for arm64:
$ bazel build --verbose_failures --config=ios_arm64 --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt //native_client:libdeepspeech.so --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV $ cp -f bazel-bin/native_client/libdeepspeech.so ../native_client/swift/libdeepspeech.so
Scope for 0.8:
Future scope:
The text was updated successfully, but these errors were encountered: