Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swift/iOS wrapper for TFLite libdeepspeech #3061

Open
9 of 11 tasks
reuben opened this issue Jun 14, 2020 · 130 comments
Open
9 of 11 tasks

Swift/iOS wrapper for TFLite libdeepspeech #3061

reuben opened this issue Jun 14, 2020 · 130 comments
Assignees

Comments

@reuben
Copy link
Contributor

reuben commented Jun 14, 2020

WIP: Build works fine on latest master with some small modifications (DS branch ios-build):

Build for simulator x86_64:

$ bazel build --verbose_failures --config=ios_x86_64 --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt //native_client:libdeepspeech.so --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV
$ cp -f bazel-bin/native_client/libdeepspeech.so ../native_client/swift/libdeepspeech.so

Build for arm64:

$ bazel build --verbose_failures --config=ios_arm64 --workspace_status_command="bash native_client/bazel_workspace_status_cmd.sh" --config=monolithic -c opt //native_client:libdeepspeech.so --define=runtime=tflite --copt=-DTFLITE_WITH_RUY_GEMV
$ cp -f bazel-bin/native_client/libdeepspeech.so ../native_client/swift/libdeepspeech.so

Scope for 0.8:

  • Does it actually run? Need to write a test app
  • How to embed and load model in device/simulator
  • Install full Xcode (not just command line tools) in CI workers
    • Create worker VMs locally to test changes
    • Get it working on new worker
    • Upgrade normal workers
    • Land code and CI changes
  • Publish native_client package for iOS
  • Publish deepspeech_ios.framework

Future scope:

  • Publish Swift docs on RTD (no Doxygen support, https://github.com/AnarchyTools/anarchy_sphinx only works with older Sphinx, maybe SourceKitten can be used)
  • How to test things in CI (can we embed a command line client binary in a test app and call it via simctl somehow to get stdout/stderr?)
@reuben reuben self-assigned this Jun 14, 2020
@reuben
Copy link
Contributor Author

reuben commented Jun 15, 2020

It looks like building a fat binary for x86_64 and arm64 is not useful because you still need two separate frameworks for device and simulator, as they use different SDKs.

I've edited the first comment to reflect this.

@reuben
Copy link
Contributor Author

reuben commented Jun 30, 2020

Some initial tests on device (iPhone Xs), averaged across 3 runs, with the 0.7.4 models:

no scorer, cold cache: RTF 0.60x
no scorer, warm cache: RTF 0.48x
with scorer, cold cache: RTF 0.55x
with scorer, warm cache: RTF 0.24x

@reuben
Copy link
Contributor Author

reuben commented Jun 30, 2020

@lissyx could you give me a quick overview of what it would take to test changes to macOS workers in an isolated environment? Like, say, adding a macos-heavy-b worker type and spawning tasks with it to test.

@reuben
Copy link
Contributor Author

reuben commented Jun 30, 2020

My probably incomplete idea:

  1. Make PR against https://github.com/mozilla/community-tc-config/blob/master/config/projects/deepspeech.yml adding new worker instance with -b type.
  2. Wait for it to be landed and deployed.
  3. Make a copy of one of the existing worker images, change the worker type, make other modifications, start VM.
  4. Spawn tasks against new worker type.

@lissyx
Copy link
Collaborator

lissyx commented Jun 30, 2020

My probably incomplete idea:

1. Make PR against https://github.com/mozilla/community-tc-config/blob/master/config/projects/deepspeech.yml adding new worker instance with -b type.

2. Wait for it to be landed and deployed.

3. Make a copy of one of the existing worker images, change the worker type, make other modifications, start VM.

4. Spawn tasks against new worker type.

You'd have to use the script that is on the worker #1 prov.sh, and you'd have to update the base image prior to that because I have not done it:

  • generic-worker version
  • taskclusterProxy
  • generic-worker.json config update

We don't have a nicer way to spin new workers mostly because it's not something we needed to do often and because it'd require again much more tooling. Given the current status of our macOS workers ...

Doing that in parallel of running existing infra is likely to be complicated because of ... resources (CPU / RAM). I thought disk would be an issue but that should be fine.

@reuben
Copy link
Contributor Author

reuben commented Jun 30, 2020

You'd have to use the script that is on the worker #1 prov.sh, and you'd have to update the base image prior to that because I have not done it:

* `generic-worker` version

* `taskclusterProxy`

* `generic-worker.json` config update

I assume I can find the appropriate versions and config changes by inspecting the currently running VMs, right? In that case, I just need the IP of worker 1 to get started.

We don't have a nicer way to spin new workers mostly because it's not something we needed to do often and because it'd require again much more tooling. Given the current status of our macOS workers ...

Yeah. I thought about making a VM copy of a worker to side-step these provisioning issues but I guess that's also prone to causing problems.

Doing that in parallel of running existing infra is likely to be complicated because of ... resources (CPU / RAM). I thought disk would be an issue but that should be fine.

I would probably run that worker on my personal machine while I test it, since it's not meant for general availability.

@lissyx
Copy link
Collaborator

lissyx commented Jun 30, 2020

I assume I can find the appropriate versions and config changes by inspecting the currently running VMs, right? In that case, I just need the IP of worker 1 to get started.

Indeed, you can fetch the json config. IPs on matrix.

I would probably run that worker on my personal machine while I test it, since it's not meant for general availability.

Be aware the existing workers if you copy them to your system are meant for VMWare Fusion Pro

@reuben
Copy link
Contributor Author

reuben commented Jun 30, 2020

Be aware the existing workers if you copy them to your system are meant for VMWare Fusion Pro

Ah, yes, that was also one of the complications. OK, thanks.

@reuben
Copy link
Contributor Author

reuben commented Jul 8, 2020

(Hopefully) finished wrapping the C API, now moving on to CI work. Wrapper is here if anyone wants to take a quick look and provide any suggestions: https://github.com/mozilla/DeepSpeech/blob/ios-build/native_client/swift/deepspeech_ios/DeepSpeech.swift

@reuben
Copy link
Contributor Author

reuben commented Jul 8, 2020

In particular I'd be very interested in any feedback from Swift developers on how the error handling looks and how the buffer handling around Model.speechToText and Stream.feedAudioContent looks.

@reuben
Copy link
Contributor Author

reuben commented Jul 8, 2020

PR for adding macos-heavy-b and macos-light-b worker type and instances: taskcluster/community-tc-config#308

@reuben
Copy link
Contributor Author

reuben commented Jul 11, 2020

Looks like adding Xcode to the worker brings the free space to under 10GB which stops the taskcluster client from picking up any jobs... Resizing the partition does not seem to work. My next step will be to create a worker from scratch with a larger disk image.

@reuben
Copy link
Contributor Author

reuben commented Jul 13, 2020

@dabinat tagging you because you mentioned interest in these bindings in the CoreML issue, in case you have anything to mention regarding the design of the bindings here.

@erksch
Copy link
Contributor

erksch commented Jul 17, 2020

@reuben super awesome that you are working on this! This is actually perfect timing as we are looking for an offline speech recognition for iOS right now. I know it's not finished yet, but could you provide a small guide on how I could try it out, is the .so library already available somewhere? Maybe then I could also help with writing an example app, if you wish.

@lissyx
Copy link
Collaborator

lissyx commented Jul 17, 2020

@reuben super awesome that you are working on this! This is actually perfect timing as we are looking for an offline speech recognition for iOS right now. I know it's not finished yet, but could you provide a small guide on how I could try it out, is the .so library already available somewhere? Maybe then I could also help with writing an example app, if you wish.

on taskcluster, if you browse to the iOS artifacts sections you should get it

@lissyx
Copy link
Collaborator

lissyx commented Jul 17, 2020

@erksch
Copy link
Contributor

erksch commented Jul 17, 2020

Thank you! I'll try it out!

@erksch
Copy link
Contributor

erksch commented Jul 17, 2020

I tried it out with the current 0.7.4 models from the release page and one of the audio files from there.

TensorFlow: v2.2.0-17-g0854bb5188
DeepSpeech: v0.9.0-alpha.2-34-gdd20d35c
2020-07-17 09:48:31.981587-0700 deepspeech_ios_test[9411:91040] Initialized TensorFlow Lite runtime.
/private/var/containers/Bundle/Application/F5F8492E-D9B8-4BC4-AF46-29CEB23FC3A6/deepspeech_ios_test.app/4507-16021-0012.wav
read 8192 samples
(lldb)

And then comes this error

Thread 5: EXC_BAD_ACCESS (code=1, address=0x177000075)

in this line of DeepSpeech.swift

178    public func feedAudioContent(buffer: UnsafeBufferPointer<Int16>) {
180        precondition(streamCtx != nil, "calling method on invalidated Stream")
181        
182        DS_FeedAudioContent(streamCtx, buffer.baseAddress, UInt32(buffer.count)) <<<<< Thread 5: EXC_BAD_ACCESS (code=1, address=0x177000075)
183   }

Just leaving this here, but I don't want to bother too much before this is even called finished :D

Update: A sorry I saw the version of DeepSpeech and I guess the models are not compatible.

@reuben
Copy link
Contributor Author

reuben commented Jul 19, 2020

The models should be compatible. I don't know what's going on there, can't reproduce it locally...

@reuben
Copy link
Contributor Author

reuben commented Jul 19, 2020

Does the log have any more details for this error?

@reuben
Copy link
Contributor Author

reuben commented Jul 19, 2020

Also, maybe double check the signing options in Xcode? At some point when writing the bindings I ran into some runtime exceptions due to incorrect signing options.

@erksch
Copy link
Contributor

erksch commented Jul 19, 2020

You're right. Since it crashes when communication with library happens, it's probably just included incorrectly.

So to go through what I tried:

  • Cloning the repo and checking out the ios-build branch
  • In XCode, set signing for the targets deepspeech_ios and deepspeech_ios_test to my team and adjust the bundle identifier to one of mine
  • Trying to run deepspeech_ios_test on my device
    -> Build failed with
clang: error: no such file or directory: '[...]/DeepSpeech/native_client/swift/libdeepspeech.so'
Command Ld failed with a nonzero exit code
  • So downloading the ARM library from the link that you provided and moving it to the given destination
  • Trying to run again
    -> Build succeeded but runtime error
dyld: Library not loaded: @rpath/deepspeech_ios.framework/deepspeech_ios
  Referenced from: /private/var/containers/Bundle/Application/E9B900F3-5F4C-466D-BB03-E97F5588A768/deepspeech_ios_test.app/deepspeech_ios_test
  Reason: image not found
(lldb) bt 
* thread #1, stop reason = signal SIGABRT

The error in the post before is after trying some things in the "Frameworks, Libraries and Embedded Content Section".
After doing a fresh start and just adding the library like explained above I get the following configs:

  • deepspeech_ios_test target

Bildschirmfoto 2020-07-19 um 21 10 15

  • deepspeech_ios target

Bildschirmfoto 2020-07-19 um 21 11 03

What do you have there? Maybe some things have to be switched to embed & sign?

@erksch
Copy link
Contributor

erksch commented Jul 19, 2020

After setting deepspeech_ios.framework to Embed & Sign in the deepspeech_ios_test target (which is just a random tryout), the code at least passes until the DS_FeedAudioContent, and the error occurs that I mentioned in the first post.

Here is the full log I got for that error.

* thread #2, queue = 'com.apple.avfoundation.avasset.completionsQueue', stop reason = EXC_BAD_ACCESS (code=2, address=0x130800076)
    frame #0: 0x0000000103891494 libdeepspeech.so`___lldb_unnamed_symbol5$$libdeepspeech.so + 360
  * frame #1: 0x000000010385df68 deepspeech_ios`DeepSpeechStream.feedAudioContent(buffer=(_position = 0x000000016502d200, count = 8192), self=(streamCtx = 0x0000000162f32f40)) at DeepSpeech.swift:181:9
    frame #2: 0x00000001028b20b0 deepspeech_ios_test`closure #1 in render(samples=Swift.UnsafeRawBufferPointer @ 0x000000016d5e0100, stream=(streamCtx = 0x0000000162f32f40)) at AppDelegate.swift:126:20
    frame #3: 0x00000001028b2138 deepspeech_ios_test`thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #4: 0x00000001028b2198 deepspeech_ios_test`partial apply for thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #5: 0x00000001b810c348 libswiftFoundation.dylib`Foundation.Data.withUnsafeBytes<A>((Swift.UnsafeRawBufferPointer) throws -> A) throws -> A + 504
    frame #6: 0x00000001028b097c deepspeech_ios_test`render(audioContext=0x000000016760bd10, stream=(streamCtx = 0x0000000162f32f40)) at AppDelegate.swift:124:22
    frame #7: 0x00000001028b30c8 deepspeech_ios_test`closure #1 in test(audioContext=0x000000016760bd10, stream=(streamCtx = 0x0000000162f32f40), audioPath="/private/var/containers/Bundle/Application/D6D001A2-07F7-4BD3-80E9-9DBECCA975E8/deepspeech_ios_test.app/4507-16021-0012.wav", start=2020-07-19 21:19:15 CEST, completion=0x00000001028b83c8 deepspeech_ios_test`partial apply forwarder for closure #1 () -> () in closure #1 () -> () in deepspeech_ios_test.AppDelegate.application(_: __C.UIApplication, didFinishLaunchingWithOptions: Swift.Optional<Swift.Dictionary<__C.UIApplicationLaunchOptionsKey, Any>>) -> Swift.Bool at <compiler-generated>) at AppDelegate.swift:174:9
    frame #8: 0x00000001028acfc0 deepspeech_ios_test`closure #1 in static AudioContext.load(asset=0x000000016365fd40, assetTrack=0x00000001637212f0, audioURL=Foundation.URL @ 0x0000000163674090, completionHandler=0x00000001028b375c deepspeech_ios_test`partial apply forwarder for closure #1 (Swift.Optional<deepspeech_ios_test.AudioContext>) -> () in deepspeech_ios_test.test(model: deepspeech_ios.DeepSpeechModel, audioPath: Swift.String, completion: () -> ()) -> () at <compiler-generated>) at AppDelegate.swift:59:17
    frame #9: 0x00000001028ad9b0 deepspeech_ios_test`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
    frame #10: 0x0000000102c0fefc libclang_rt.asan_ios_dynamic.dylib`__wrap_dispatch_async_block_invoke + 196
    frame #11: 0x0000000104ec605c libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #12: 0x0000000104ec74d8 libdispatch.dylib`_dispatch_client_callout + 20
    frame #13: 0x0000000104ecec20 libdispatch.dylib`_dispatch_lane_serial_drain + 720
    frame #14: 0x0000000104ecf834 libdispatch.dylib`_dispatch_lane_invoke + 440
    frame #15: 0x0000000104edb270 libdispatch.dylib`_dispatch_workloop_worker_thread + 1344
    frame #16: 0x00000001816a7718 libsystem_pthread.dylib`_pthread_wqthread + 276

@dabinat
Copy link
Collaborator

dabinat commented Jul 20, 2020

I tried a simple test app where I loaded a pre-converted file of a few seconds into memory and called DeepSpeechModel.speechToText. There are no crashes or anything, but the resulting text string is empty.

It seemed from the header like all I had to do was initialize a DeepSpeechModel with the .tflite file and then call speechToText with the buffer. Did I miss a step? Do I need to setup a streaming context even if I’m not streaming?

@reuben
Copy link
Contributor Author

reuben commented Jul 20, 2020

It seemed from the header like all I had to do was initialize a DeepSpeechModel with the .tflite file and then call speechToText with the buffer. Did I miss a step? Do I need to setup a streaming context even if I’m not streaming?

That's correct, you don't need to setup a streaming context.

@zaptrem
Copy link
Contributor

zaptrem commented Dec 26, 2020

@reuben Thanks, that solved the Android issue! It also allowed me to compare memory usage between iOS and Android. On the D-Day speech, Android climbs to ~200mb usage almost immediately and stays there, whereas iOS climbs more slowly to a similar number (or crashes on longer speeches). It seems like you were right to assume there was no memory leak, though it's weird to me that memory use is >10x the size of the input file.

However, on longer files, the crash on iOS is still present. Memory usage on Android balloons to >500MB on the Reagan RNC speech and run for a long while (I'm virtualizing Android on a beefy PC), so I'll update if it finishes.

Additionally, (this might be related to the -bitexact issue), on iOS on the shorter speeches there are long periods of sparse (one word for a minute of speech) transcription in between chunks of near-perfect transcription. The audio quality doesn't noticeably decrease in those areas when I listen.

Finally, in the process of updating react-native-transcription to the latest DeepSpeech version I noticed that Android DeepSpeech 0.9.3 hadn't been published to Maven yet, only 0.9.2.

@zaptrem
Copy link
Contributor

zaptrem commented Dec 27, 2020

@reuben Another interesting bug I discovered while trying to find clues: the iOS example happily transcribed non-16khz wav files while Android example refused them. Also weird is Android and iOS give different transcriptions for the same noisy input despite using the exact same pretrained model and audio files.

@fender
Copy link

fender commented Dec 27, 2020

@zaptrem @CatalinVoss What's the installation process for testing on iOS? There isn't a cocoapod yet is there?

@zaptrem
Copy link
Contributor

zaptrem commented Dec 27, 2020

@fender You've gotta compile the framework yourself using these instructions. You can use the Swift example in the repo.

@lissyx
Copy link
Collaborator

lissyx commented Dec 27, 2020

@reuben Another interesting bug I discovered while trying to find clues: the iOS example happily transcribed non-16khz wav files while Android example refused them. Also weird is Android and iOS give different transcriptions for the same noisy input despite using the exact same pretrained model and audio files.

Its not surprising and not a bug, @reuben already explained it: android example wav reading is hardcoded to reduce dependencies required by CI, so it only handles 16khz and it's not robust to broken wav files.

@zaptrem
Copy link
Contributor

zaptrem commented Jan 12, 2021

@reuben Trying again in case I sent it too early in the morning last week.

@CatalinVoss Are you able to reproduce this with your binary?

@mozilla mozilla deleted a comment from zaptrem Jan 12, 2021
@mozilla mozilla deleted a comment from zaptrem Jan 12, 2021
@CatalinVoss
Copy link
Collaborator

Can you get a backtrace from inside deepspeech by typing bt on the lldb console? I do see one sporadic issue that crashes within tensorflow in my application, but it's fairly rare.

@zaptrem
Copy link
Contributor

zaptrem commented Jan 13, 2021

@CatalinVoss Here you go. I had to use a text file since I exceeded pastebin's size limit: deepSpeechCrashST.txt

@CatalinVoss
Copy link
Collaborator

Ah. The relevant part is at the bottom

deepspeech_ios`PathTrie::iterate_to_vec(std::__1::vector<PathTrie*, std::__1::allocator<PathTrie*> >&) + 64
    frame #4742: 0x000000010fee1114 deepspeech_ios`DecoderState::next(double const*, int, int) + 404
    frame #4743: 0x000000010fd4b470 deepspeech_ios`StreamingState::processBatch(std::__1::vector<float, std::__1::allocator<float> > const&, unsigned int) + 296
    frame #4744: 0x000000010fd4b300 deepspeech_ios`StreamingState::pushMfccBuffer(std::__1::vector<float, std::__1::allocator<float> > const&) + 236
    frame #4745: 0x000000010fd4ae88 deepspeech_ios`StreamingState::feedAudioContent(short const*, unsigned int) + 396
  * frame #4746: 0x0000000105f1c808 ReLearn`closure #1 in Transcription.render(samples=Swift.UnsafeRawBufferPointer @ 0x000000016b295c30, stream=(streamCtx = 0x000000011031e420)) at Transcription.swift:85:24
    frame #4747: 0x0000000105f1c830 ReLearn`thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #4748: 0x0000000105f1e3d0 ReLearn`partial apply for thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #4749: 0x00000001f031ca8c libswiftFoundation.dylib`Foundation.Data.withUnsafeBytes<A>((Swift.UnsafeRawBufferPointer) throws -> A) throws -> A + 504
    frame #4750: 0x0000000105f1bf68 ReLearn`Transcription.render(audioContext=0x0000000280435b30, stream=(streamCtx = 0x000000011031e420), self=0x00000002805e2130) at Transcription.swift:83:26
    frame #4751: 0x0000000105f1ceb8 ReLearn`closure #1 in Transcription.recognizeFile(audioContext=0x0000000280435b30, self=0x00000002805e2130, stream=(streamCtx = 0x000000011031e420), audioPath=Swift.String @ 0x000000016b2964a8, start=2021-01-13 01:27:08 EST) at Transcription.swift:107:18
    frame #4752: 0x0000000105f1a23c ReLearn`closure #1 in static AudioContext.load(asset=0x0000000280c2a500, assetTrack=0x00000002808f4420, audioURL=<unavailable; try printing with "vo" or "po">, completionHandler=0x0000000105f1e48c ReLearn`partial apply forwarder for closure #1 (Swift.Optional<react_native_transcription.AudioContext>) -> () in react_native_transcription.Transcription.(recognizeFile in _79B1BC2F893AB0135086535C16DBA135)(audioPath: Swift.String) -> () at <compiler-generated>) at AudioContext.swift:58:17
    frame #4753: 0x0000000105ef6490 ReLearn`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
    frame #4754: 0x00000001101c9d10 libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #4755: 0x00000001101cb18c libdispatch.dylib`_dispatch_client_callout + 20
    frame #4756: 0x00000001101d2968 libdispatch.dylib`_dispatch_lane_serial_drain + 724
    frame #4757: 0x00000001101d3580 libdispatch.dylib`_dispatch_lane_invoke + 440
    frame #4758: 0x00000001101df0f0 libdispatch.dylib`_dispatch_workloop_worker_thread + 1344
    frame #4759: 0x00000001b96f3714 libsystem_pthread.dylib`_pthread_wqthread + 276
(lldb) 

This looks like a crash in the beam search decoder. I'm not actually using the DeepSpeech decoder -- have my own -- so unfortunately I can't help :( it appears to be unrelated to the bug I saw

@zaptrem
Copy link
Contributor

zaptrem commented Jan 13, 2021

@CatalinVoss Thanks for pointing that out! I'm curious; what about your use case required building an alternate decoder? Is it open source?

@reuben What do you make of this error? Or is there a different person who worked on the beam decoder I should ping? Is this something I should move to its own, non-platform-specific bug report?

@CatalinVoss
Copy link
Collaborator

@CatalinVoss Thanks for pointing that out! I'm curious; what about your use case required building an alternate decoder? Is it open source?

Working on child literacy. Closed source I'm afraid

@zaptrem
Copy link
Contributor

zaptrem commented Jan 13, 2021

@CatalinVoss Cool stuff! I can see why it's necessary in that case, detecting specific mispronunciations is more than hotword-boosting can do. Coincidentally, I stumbled across this Google feature that might interest you about 10 minutes ago in a discussion with someone.

I ran the test again on an iPad Pro late 2018 (before it was an iPhone 11 Pro Max) and got a slightly different result. Does this still indicate Decoder issues?

frame #4739: 0x000000010bb6bcf8 deepspeech_ios`PathTrie::iterate_to_vec(std::__1::vector<PathTrie*, std::__1::allocator<PathTrie*> >&) + 64
    frame #4740: 0x000000010bb6bcf8 deepspeech_ios`PathTrie::iterate_to_vec(std::__1::vector<PathTrie*, std::__1::allocator<PathTrie*> >&) + 64
    frame #4741: 0x000000010ba99114 deepspeech_ios`DecoderState::next(double const*, int, int) + 404
    frame #4742: 0x000000010b903470 deepspeech_ios`StreamingState::processBatch(std::__1::vector<float, std::__1::allocator<float> > const&, unsigned int) + 296
    frame #4743: 0x000000010b903300 deepspeech_ios`StreamingState::pushMfccBuffer(std::__1::vector<float, std::__1::allocator<float> > const&) + 236
    frame #4744: 0x000000010b902e88 deepspeech_ios`StreamingState::feedAudioContent(short const*, unsigned int) + 396
  * frame #4745: 0x0000000101a80808 ReLearn`closure #1 in Transcription.render(samples=Swift.UnsafeRawBufferPointer @ 0x000000017032dc20, stream=(streamCtx = 0x000000010bfd2670)) at Transcription.swift:85:24
    frame #4746: 0x0000000101a80830 ReLearn`thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #4747: 0x0000000101a823d0 ReLearn`partial apply for thunk for @callee_guaranteed (@unowned UnsafeRawBufferPointer) -> (@error @owned Error) at <compiler-generated>:0
    frame #4748: 0x00000001942f5174 libswiftFoundation.dylib`Foundation.Data.withUnsafeBytes<A>((Swift.UnsafeRawBufferPointer) throws -> A) throws -> A + 392
    frame #4749: 0x0000000101a7ff68 ReLearn`Transcription.render(audioContext=0x0000000282e94930, stream=(streamCtx = 0x000000010bfd2670), self=0x0000000282f28cc0) at Transcription.swift:83:26
    frame #4750: 0x0000000101a80eb8 ReLearn`closure #1 in Transcription.recognizeFile(audioContext=0x0000000282e94930, self=0x0000000282f28cc0, stream=(streamCtx = 0x000000010bfd2670), audioPath=Swift.String @ 0x000000017032e458, start=2021-01-13 02:00:54 EST) at Transcription.swift:107:18
    frame #4751: 0x0000000101a7e23c ReLearn`closure #1 in static AudioContext.load(asset=0x000000028278c8a0, assetTrack=0x00000002822ea6d0, audioURL=<unavailable; try printing with "vo" or "po">, completionHandler=0x0000000101a8248c ReLearn`partial apply forwarder for closure #1 (Swift.Optional<react_native_transcription.AudioContext>) -> () in react_native_transcription.Transcription.(recognizeFile in _79B1BC2F893AB0135086535C16DBA135)(audioPath: Swift.String) -> () at <compiler-generated>) at AudioContext.swift:58:17
    frame #4752: 0x0000000101a5a490 ReLearn`thunk for @escaping @callee_guaranteed () -> () at <compiler-generated>:0
    frame #4753: 0x000000010bd83bcc libdispatch.dylib`_dispatch_call_block_and_release + 32
    frame #4754: 0x000000010bd856c0 libdispatch.dylib`_dispatch_client_callout + 20
    frame #4755: 0x000000010bd8d354 libdispatch.dylib`_dispatch_lane_serial_drain + 736
    frame #4756: 0x000000010bd8e0c0 libdispatch.dylib`_dispatch_lane_invoke + 448
    frame #4757: 0x000000010bd9a644 libdispatch.dylib`_dispatch_workloop_worker_thread + 1520
    frame #4758: 0x00000001db297804 libsystem_pthread.dylib`_pthread_wqthread + 276
(lldb)

@CatalinVoss
Copy link
Collaborator

CatalinVoss commented Jan 13, 2021 via email

@zaptrem
Copy link
Contributor

zaptrem commented Jan 22, 2021

Yes, I'm already doing that because the Metadata and Token arrays were (erroneously?) marked as private(set) instead of public private(set).

PRs are welcome.

@reuben It's a bit late, but I just proposed those quick changes here: #3510

I also think there's some more debug info I can provide about the decoder issue, I'll test it tomorrow.

@zaptrem
Copy link
Contributor

zaptrem commented Jan 23, 2021

@reuben Okay, poking at Xcode's debugging options/info is like trying to reverse engineer an alien spaceship at my current knowledge level, so I haven't learned much from a few hours trying that.

However, I did get some interesting results from messing around with the scorer/models. Disabling the external scorer allowed the transcription process to run for much longer before eventually crashing (similar to the length it took my Android build to complete the same transcription task) with the same error. Additionally, using the Chinese model and scorer lasted the same extended amount of time as disabling the external scorer. Also, sometimes across all tests the "reading samples" message varied exactly once near the beginning of the job like so:

...
reading 8192 samples
reading 7425 (or some other nearby number) samples
reading 8192 samples
...

With No/Chinese scorer there is a drop/spike in thread activity right before the crash.
This isn't present on the English model
).

Is the transcription job being run on the AVAudioSession Notify Thread instead of another thread when using the English model? Alternatives have constant utilization on Thread 4 with a spike on the AV Notify Thread before the crash, whereas English has constant utilization on the AV Notify Thread.

Also, the English model resulted in much higher CPU usage during transcription than Chinese/no scorer.

English runtime: 1 minute
None/Chinese runtime: 5 minutes

Unlikely theory: Maybe it's an ARM-specific issue? I've been testing Android with an x86 emulator and have no way to verify this one way or another (my ARM emulator isn't working).

@zaptrem
Copy link
Contributor

zaptrem commented Jan 25, 2021

@CatalinVoss Any idea why it would still be failing when the external scorer is disabled?

@zaptrem
Copy link
Contributor

zaptrem commented Feb 1, 2021

@reuben Does any of this help? Is there potentially an FFMPEG setting that is causing this, since it doesn't happen with non-converted WAV files? It's been over a month since your last response, so is there any more experimentation/testing I can do to help this along?

@zaptrem
Copy link
Contributor

zaptrem commented Feb 6, 2021

@reuben I tried converting from MKV files encoded with vorbis instead of the .ogg I was using before. Didn't work. I tried converting with Audacity manually. Didn't work.

I wonder if it has nothing to do with the conversion and it's actually just the length of the recording. Do you know of any long conversations recorded in PCM 16 natively we can test with?

I'm lead to this conclusion not just from my testing, but because @CatalinVoss suggested it's a decoder issue (implying to me the audio file had already been successfully read/inference had completed).

EDIT: Recording the WAV natively with Audacity results in no crash. Exporting the exact same recording as OGG and converting with FFMPEG crashes.

@zaptrem
Copy link
Contributor

zaptrem commented Feb 9, 2021

@CatalinVoss (and @reuben if you're still here) I continued testing and I think the issue might(????) have something to do a bug in render() caused by however FFMPEG/SoX converts 48000mhz to 16000mhz.

I rewrote the render function in iOS to be much simpler and (I would think) more robust due to the utilization of higher-level Apple APIs that are file-format independent. The crash disappeared (hooray!) but I'm now getting jibberish short transcriptions (whereas the same files on Android give me acceptable results). Any idea where I went wrong here?:

private func newRender(url: URL, stream: DeepSpeechStream) {
        let file = try! AVAudioFile(forReading: url)
        guard let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: file.fileFormat.sampleRate, channels: 1, interleaved: false) else { fatalError("Couldn't read the audio file") }
        print("reading file")
        var done = false;
        while(!done){
            let buf = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: 8096)!
            do {
                try file.read(into: buf)
            } catch {
                print("end of file");
                done = true
                return;
            }
            print("read 8096 frames")
            let floatBufPointer = UnsafeBufferPointer(start: buf.int16ChannelData?[0], count: Int(buf.frameLength))
            stream.feedAudioContent(buffer: floatBufPointer);
        }
    }

@CatalinVoss
Copy link
Collaborator

Sorry for the delay.

I never used the audio file recognition path that goes through render() so an error there could explain why you're facing issues and I'm not. I'm just capturing mic output.

As a hackky workaround, are you just calling finishStream() on the stream once at the very end? Is there any way you can split your monsterous audio file into chunks, passing a few chunks at at a time and "finalize" the stream a few times in between?

Another way to isolate the issue may be to see if you encounter the issue with long-running mic detection (which, again, I don't see), but I'm not doing hours and hours of it either.

As for your newRender(), hard to debug without playing with it. This kinda stuff is tricky to get right, but you only have to do it once. I am not sure why you're calling this guy a floatBufPointer, since the stream takes 16-bit int samples. With what you'e doing now, it certainly looks like you need to be 100% sure that your input is PCM single channel at 16 kHz.

@zaptrem
Copy link
Contributor

zaptrem commented Feb 9, 2021

@CatalinVoss Thanks for getting back! I could split it into chunks, but it would likely mess with transcriptions mid-sentence and paragraph, as well as reset the time part of tokens (I'm using intermediateDecodeWithMetadata as finishStreamWithMetadata also has an unknown crash... one problem at a time).

It's called floatBufPointer because originally I was passing floats instead of 16bit ints and I forgot to change it. As you can see here:

 guard let format = AVAudioFormat(commonFormat: .pcmFormatInt16, sampleRate: file.fileFormat.sampleRate, channels: 1, interleaved: false) else { fatalError("Couldn't read the audio file") }

...

let floatBufPointer = UnsafeBufferPointer(start: buf.int16ChannelData?[0], count: Int(buf.frameLength))

I am passing the int16ChannelData. TypeOf tells me this is a pointer to an int16 array. I can also wrap it in a swift Array() and typeOf is int16[]. Weirdly enough, when I pass the Swift Array() I get a different (but still jibberish) result.

Would it be helpful if I opened a pull request with these changes so they're easy for others to look at?

@zaptrem
Copy link
Contributor

zaptrem commented Feb 11, 2021

@reuben Any idea what I’m doing wrong in the proposed Swift code above?

@zaptrem
Copy link
Contributor

zaptrem commented Feb 13, 2021

@reuben Can you respond with an estimate on when you can look at this? It's been over a month and a half. I've given it my best shot but have run out of ideas for now.

@zaptrem
Copy link
Contributor

zaptrem commented Feb 26, 2021

@reuben @lissyx I gave up on fixing this and just took the accuracy hit from splitting the audio file into 5-minute segments. I've published the app using this, ReLearn, to the Android/iOS App Stores. It uses DeepSpeech to transcribe long video recordings of lectures for free on-device in the background (it also transcribes audio in-person recordings on Android, but uses Apple's solution for that on iOS for now).

@martin642
Copy link

Fantastic job

@lustig-bakkt
Copy link

@reuben @lissyx I gave up on fixing this and just took the accuracy hit from splitting the audio file into 5-minute segments. I've published the app using this, ReLearn, to the Android/iOS App Stores. It uses DeepSpeech to transcribe long video recordings of lectures for free on-device in the background (it also transcribes audio in-person recordings on Android, but uses Apple's solution for that on iOS for now).

@zaptrem I've been wanting to build something very similar to this but just ran up against Apple's one minute live transcription limit with SFSpeechRecognizer. How are you getting around this in ReLearn? Any chance you'd like to collaborate on some live speech / mind mapping type applications?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests