Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Android build with Cross Compilation for BLAS and OpenCL back-ends #848

Closed
wants to merge 21 commits into from

Conversation

lealgo
Copy link
Contributor

@lealgo lealgo commented May 10, 2019

Provides the files and tweaks to meson that enable the cross-compilation to Android, and also to other platforms like ARM Linux. This PR depends on the work done by @borg323's on the subprojects: protobuf and opencl, allowing the build to be performed from source instead of using the build machine's shared libraries.

You could test this PR as it is right now, but the file subprojects/protobuf-3.5.1/meson.build should be replaced with this one: meson.build . The problem with cross-compiling protobuf is better explained here: #848 (comment)


Note: This PR started as a method for compiling lc0 on Termux, but then it evolved into cross-compilation. I kept the previous comments for reference. If you want instructions on Termux then follow the wiki

meson.build Outdated Show resolved Hide resolved
@lealgo
Copy link
Contributor Author

lealgo commented May 11, 2019

I'm starting to look at this PR as a proof of concept. I realize the real solution here is to use static versions of the libraries and ultimately to create a statically-linked executable, possibly with a separate meson file for cross-compiling with NDK. That's the only way if we want to distribute binaries. Some links:

http://dev-smart.com/cross-compiling-with-android/
http://mesonbuild.com/Cross-compilation.html
https://clang.llvm.org/docs/CrossCompilation.html

OpenBLAS for Android

https://github.com/xianyi/OpenBLAS/wiki/How-to-build-OpenBLAS-for-Android

OpenCL for Mali & Adreno

https://arrayfire.com/getting-started-with-opencl-on-android/
https://github.com/ARM-software/ComputeLibrary
https://github.com/ARM-software/armnn/blob/branches/armnn_19_02/BuildGuideAndroidNDK.md
https://developer.qualcomm.com/software/adreno-gpu-sdk

@borg323 borg323 added the wip Work in progress label May 11, 2019
@lealgo

This comment has been minimized.

@lealgo

This comment has been minimized.

@lealgo

This comment has been minimized.

@lealgo
Copy link
Contributor Author

lealgo commented May 14, 2019

Back to the real problem. I made some good progress today, achieving a successful cross-compilation with meson from my Linux workstation. For now the build was carried out without the back-ends. The final executable was created with its proper architecture.

The cross compilation file, cross-files/arm-linux-gnueabi.txt:

[host_machine]
system = 'linux'
cpu_family = 'arm'
cpu = 'armv7'
endian = 'little'

[binaries]
c = 'arm-linux-gnueabi-gcc'
cpp = 'arm-linux-gnueabi-g++'
ar = 'arm-linux-gnueabi-ar'
strip = 'arm-linux-gnueabi-strip'
ld = 'arm-linux-gnueabi-ld'
ranlib = 'arm-linux-gnueabi-ranlib'
as = 'arm-linux-gnueabi-as'
pkgconfig = 'arm-linux-gnueabi-pkg-config'
exe_wrapper = 'qemu-arm-static' 

the commands:

meson --cross-file cross-files/arm-linux-gnueabi.txt build/ -Dbuild_backends=false
cd build
ninja

@borg323 borg323 added the help wanted Extra attention is needed label May 14, 2019
@gonzalezjo
Copy link

gonzalezjo commented May 14, 2019

If the Tensorflow backend was brought up to date, would that mean that we could use the NN accelerators on modern Android devices?

@mooskagh
Copy link
Member

Mobile devices use tensorflow-lite, not tensorflow. I think it's totally incompatible API-wise, but still pretty easy to write a backend for tensorflow-lite..

@lealgo
Copy link
Contributor Author

lealgo commented May 15, 2019

For now the GNU's toolchain is able to build the project without the back-ends, but the resulting executable is for a generic linux system (different linker, libs, etc.) .

I'd prefer to use the NDK toolchain as it produces binaries for an Android platform. So I'm trying really hard to make the cross-compilation work with the NDK toolchain

@borg323 was helping me fix a problem with exe_wrapper in the meson cross-file. Apparently the protobuf build could not continue without an exe_wrapper. It seemed strange to me. Then I stumbled upon this:

From https://mesonbuild.com/Cross-compilation.html

Mixing host and build targets

Sometimes you need to build a tool which is used to generate source files. These are then compiled for the actual target. For this you would want to build some targets with the system's native compiler. This requires only one extra keyword argument.

native_exe = executable('mygen', 'mygen.c', native : true)

You can then take native_exe and use it as part of a generator rule or anything else you might want.

Now the previous meson error seems clearer:

ERROR: Can not use target protoc as a generator because it is cross-built
and no exe wrapper is defined. You might want to set it to native instead.

Meson has no protoc to call, only its source. Now I just need to tell it to use the local protoc. Right?

@lealgo
Copy link
Contributor Author

lealgo commented May 16, 2019

Cross compilation finally achieved with the NDK toolchain!

After a real struggle and some magic tricks from @borg323 we could make protobuf pass the build. Here's the output of the first cross-compiled binary running on my phone through adb:

HWHWI:/data/local/tmp/new $ ./lc0 benchmark -b random -w /sdcard/DroidFish/lib/11258-48x5-se.pb.gz                                                                                                          
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 15 2019
Loading weights file from: /sdcard/DroidFish/lib/11258-48x5-se.pb.gz
Creating backend [random]...
Benchmark time 8ms, 18 nodes, 2250 nps, move e2e3
Benchmark time 13ms, 32 nodes, 2461 nps, move e2e3
Benchmark time 20ms, 56 nodes, 2800 nps, move e2e3
...
Benchmark time 7304ms, 72391 nodes, 9911 nps, move b2b3
Benchmark time 8014ms, 79777 nodes, 9954 nps, move b2b3
Benchmark time 9746ms, 96296 nodes, 9880 nps, move b2b3
bestmove b2b3
Benchmark final time 9.77028s calculating 9888.25 nodes per second.

The random back-end is blazing fast! :D

Here's the meson cross-file:

[host_machine]
system = 'android'
cpu_family = 'arm'
cpu = 'aarch64'
endian = 'little'

[properties]
needs_exe_wrapper = true
cpp_args = []
cpp_link_args = ['-llog', '-latomic']

[binaries]
c = 'aarch64-linux-android28-clang'
cpp = 'aarch64-linux-android28-clang++'
ar = 'aarch64-linux-android-ar'
strip = 'aarch64-linux-android-strip'
ld = 'aarch64-linux-android-ld'
ranlib = 'aarch64-linux-android-ranlib'
as = 'aarch64-linux-android-as'

Now the proper back-ends need to be tackled. Expect a good fight!

@lealgo
Copy link
Contributor Author

lealgo commented May 17, 2019

Guys,

@borg323 is making this look easy. Now his new branch opencl_sub, passed the cross-compilation on my android gnu toolchain. This means that apparently we've got the first back-end working on ARM! Look at this:

$ file lc0
lc0: ELF 32-bit LSB executable, ARM, EABI5 version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-, for GNU/Linux 3.2.0, BuildID[sha1]=28711f228713334199d6267c5726474c38f4a2f8, not stripped

$ qemu-arm-static lc0
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 16 2019
uci
id name Lc0 v0.22.0-dev
id author The LCZero Authors.
option name WeightsFile type string default <autodiscover>
option name Backend type combo default opencl var opencl var check var random var roundrobin var multiplexing var demux
...

$ qemu-arm-static lc0 benchmark -w ../../../../weights/dkappe/32930-112x9-se.pb.gz 
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 16 2019
Loading weights file from: ../../../../weights/dkappe/32930-112x9-se.pb.gz
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminate called after throwing an instance of 'cl::Error'
  what():  clGetPlatformIDs
qemu: uncaught target signal 6 (Aborted) - core dumped

Now I just gotta massage the NDK toolchain so that it follows suit...
This is great!

Some notes:

  • libzma linking is not needed and not asked for.
  • lpthread is not recognized but it's not needed, neither in the project nor in the sub. So we'll need more conditions in the block "if host_machine.system()..."

@lealgo
Copy link
Contributor Author

lealgo commented May 17, 2019

Lc0 cross-built for android with NDK, opencl loader statically linked. adb output:

$ file lc0
lc0: ELF shared object, 64-bit LSB arm64, dynamic (/system/bin/linker64), for Android 28, built by NDK r19c (5345600), not stripped

$ ./lc0 benchmark -w /sdcard/DroidFish/11258-48x5-se.pb.gz                                                                                                                        
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 17 2019
Loading weights file from: /sdcard/DroidFish/11258-48x5-se.pb.gz
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
OpenCL: clGetPlatformIDs
terminating with uncaught exception of type cl::Error: clGetPlatformIDs
Aborted

Now, on to investigate why opencl is not being detected...
Maybe a mismatch of CL_TARGET_OPENCL_VERSION?

https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_ICD_Installation.html#_android_sdkndk

@lealgo
Copy link
Contributor Author

lealgo commented May 17, 2019

dynamically linked opencl on android:

$ ./lc0 benchmark -w /sdcard/DroidFish/11258-48x5-se.pb.gz                                                                                                                        
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 17 2019
Loading weights file from: /sdcard/DroidFish/11258-48x5-se.pb.gz
Creating backend [opencl]...
OpenCL, maximum batch size set to 16.
Initializing OpenCL.
Detected 1 OpenCL platforms.
Platform version: OpenCL 2.0 v1.r14p0-00cet0.b33ebadea0b6ef5b967c7f21064b122d
Platform profile: FULL_PROFILE
Platform name:    ARM Platform
Platform vendor:  ARM
Device ID:      0
Device name:    Mali-G71
Device type:    GPU
Device vendor:  ARM
Device driver:  2.0
Device speed:   5 MHZ
Device cores:   8 CU
Device score:   120
Selected platform: ARM Platform
Selected device: Mali-G71
with OpenCL 2.0 capability.
Started OpenCL SGEMM tuner with batch size 16.
Will try 578 valid configurations.
(2/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 2741.6 us (6.9 GFLOPS)
(3/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=8 NDIMC=8 NWG=16 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 2644.4 us (7.1 GFLOPS)
(5/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 1655.1 us (11.4 GFLOPS)
(17/578) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=32 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 1415.2 us (13.3 GFLOPS)
(24/578) KWG=32 KWI=2 MDIMA=32 MDIMC=32 MWG=64 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 1411.8 us (13.4 GFLOPS)
(69/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 1381.3 us (13.7 GFLOPS)
(77/578) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 1235.2 us (15.3 GFLOPS)
(90/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=16 NDIMC=16 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=1 1062.2 us (17.8 GFLOPS)
(115/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=1 892.4 us (21.1 GFLOPS)
(198/578) KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=32 NDIMB=8 NDIMC=8 NWG=64 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2 871.3 us (21.7 GFLOPS)
(223/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2 732.8 us (25.8 GFLOPS)
(281/578) KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 699.7 us (27.0 GFLOPS)
Wavefront/Warp size: 4

Max workgroup size: 384
Max workgroup dimensions: 384 384 384
Benchmark time 1038ms, 7 nodes, 6 nps, move d2d4
Benchmark time 1406ms, 16 nodes, 11 nps, move g2g3
Benchmark time 1896ms, 35 nodes, 18 nps, move g2g3
Benchmark time 2526ms, 56 nodes, 22 nps, move g2g3
Benchmark time 2685ms, 74 nodes, 27 nps, move g2g3
Benchmark time 3160ms, 98 nodes, 31 nps, move g2g3
Benchmark time 3688ms, 130 nodes, 35 nps, move g2g3
Benchmark time 4275ms, 170 nodes, 39 nps, move g2g3
Benchmark time 4427ms, 203 nodes, 45 nps, move c2c4
Benchmark time 4907ms, 240 nodes, 48 nps, move g1f3
Benchmark time 4916ms, 245 nodes, 49 nps, move c2c4
Benchmark time 5067ms, 269 nodes, 53 nps, move g1f3
Benchmark time 5784ms, 353 nodes, 61 nps, move g1f3
Benchmark time 6565ms, 446 nodes, 67 nps, move c2c4
Benchmark time 7041ms, 480 nodes, 68 nps, move c2c4
Benchmark time 7476ms, 533 nodes, 71 nps, move g1f3
Benchmark time 7521ms, 542 nodes, 72 nps, move c2c4
Benchmark time 7801ms, 575 nodes, 73 nps, move g1f3
Benchmark time 9224ms, 745 nodes, 80 nps, move c2c4
Benchmark time 9521ms, 797 nodes, 83 nps, move g1f3
Benchmark time 10000ms, 814 nodes, 81 nps, move g1f3
bestmove g1f3
Benchmark final time 10.7405s calculating 88.4505 nodes per second.

@lealgo
Copy link
Contributor Author

lealgo commented May 17, 2019

Now we've got a beautiful 8 MB binary engine that should work with regular Android chess GUIs!

With OpenCL back-end.

This is bliss!!!

@lealgo
Copy link
Contributor Author

lealgo commented May 19, 2019

Attached meson detection when cross compiling
meson detection.zip

@lealgo
Copy link
Contributor Author

lealgo commented May 19, 2019

@borg323

Some comments about the most recent commit:

  • The build will fail unless you apply the following subprojects changes. Attached the modified meson.build files, not the diffs.

opencl meson.zip
protobuf meson.zip

  • I tested the changes to work for a regular non-cross compile build.

  • There is at least one change (maybe more) that is not ideal, but I pushed just it to be able to comment on them and seek advise.

@lealgo lealgo changed the title Android build with Termux for BLAS and OpenCL back-ends Android build with Cross Compilation for BLAS and OpenCL back-ends May 19, 2019
@@ -118,11 +118,6 @@ option('pext',
value: false,
description: 'Use the pext instruction')

option('android',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed as one can detect android on meson cross build with host_machine.system() == 'android'

@@ -15,7 +15,7 @@ option('openblas_include',

option('opencl_include',
type: 'array',
value: ['/usr/include/'],
# value: ['/usr/include/'],
Copy link
Contributor Author

@lealgo lealgo May 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to comment this path as it seems to confuse the Android NDK, it mixes the headers and fail compilation. The GNU toolchain doesn't have that problem. Any idea on how to solve this?

@lealgo

This comment has been minimized.

@baltazar388
Copy link

lealgo sir you can make a small movie for see how install weithgs good my phone is a huawei p smart but doesnt work lc0 help me please , thansa for your time :) regards

@flither
Copy link

flither commented May 20, 2019

Not working on my side (Xiaomi Mi 6, Android 9, opencl 2.0, Snapdragon 835).
Engine "thinks" for 2 minutes, then terminates with "engine error" message.
Termux method works fine though.

@lealgo

This comment has been minimized.

@ankan-ban
Copy link
Member

Works perfectly on my one plus 6 (Snapdragon 845). With a recent T40 network:

$ ls
lc0                       weights_run1_42372.pb.gz
$ ./lc0 benchmark
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 20 2019
Found pb network file: ./weights_run1_42372.pb.gz
Creating backend [blas]...
BLAS, maximum batch size set to 256
BLAS vendor: OpenBlas.
OpenBlas [OpenBLAS 0.3.6 NO_LAPACK NO_LAPACKE NO_AFFINITY ARMV8 MAX_THREADS=8].
OpenBlas found 8 ARMV8 core(s).
OpenBLAS using 1 core(s) for this backend.
BLAS max batch size is 256.
Benchmark time 2313ms, 4 nodes, 1 nps, move e2e4
Benchmark time 5318ms, 8 nodes, 1 nps, move g1f3
Benchmark time 8763ms, 10 nodes, 1 nps, move g1f3
Benchmark time 9010ms, 19 nodes, 2 nps, move g1f3
Benchmark time 10000ms, 21 nodes, 2 nps, move g1f3
bestmove g1f3
Benchmark final time 12.791s calculating 3.04901 nodes per second.
$

@lealgo lealgo mentioned this pull request Jun 25, 2019
@bicho2
Copy link

bicho2 commented Jul 3, 2019

Hi, I compared openblas with opencl engines on samsungS8+. I set the uci depth to 2 moves. Openblas needeed 10 seconds against 7 for opencl. I am quite deceived by opencl. It need 3 minutes to start and then it is just a bit quicker than blas. The engine is stronger than stockfish for a given depth but because it is slow stockfish beats it because it can go deeper thanks to its fast evaluation speed. I thank you a lot to allow me to play chess against an IA. It is more human like. Better evaluation of position and less depth in move calculation.

@lealgo

This comment has been minimized.

@lealgo

This comment has been minimized.

@lealgo

This comment has been minimized.

@lealgo
Copy link
Contributor Author

lealgo commented Aug 12, 2019

New builds for 0.22 release:

  • aarch64 builds for Android 5.0 and up:

lc0-0.22-eigen-aarch64.zip
lc0-0.22-blas-aarch64.zip
lc0-0.22-opencl-aarch64.zip

with Little Demon 2 embedded:
lc0-0.22-LD2-eigen-aarch64.zip
lc0-0.22-LD2-blas-aarch64.zip
lc0-0.22-LD2-opencl-aarch64.zip

  • armv7a builds for Android 4.1 and up:

lc0-0.22-eigen-armv7a.zip
lc0-0.22-blas-armv7a.zip

with Little Demon 2 embedded:
lc0-0.22-LD2-eigen-armv7a.zip
lc0-0.22-LD2-blas-armv7a.zip

@lealgo

This comment has been minimized.

@UA2425
Copy link

UA2425 commented Nov 22, 2019 via email

@lealgo
Copy link
Contributor Author

lealgo commented Nov 22, 2019

Hi,

@UA2425 Thanks.
I'm not really active again but at least I try to keep the builds alive for the main releases. I'm going to merge the current release back into this branch to keep it current, and from what I've seen there are no breaking changes, that's very good. This branch really has very few diffs. against main and most are for the build system. The only road block is the protobuf dependency for cross-compiling, that it's working thanks to some magic @borg323 did a while ago. But as far as I know there's no clean way to do it.

Best regards,
Leandro

@lealgo
Copy link
Contributor Author

lealgo commented Dec 2, 2019

Builds for the new release 0.23:

aarch64 for Android 5.0 and up:
lc0-0.23-blas-aarch64.zip
lc0-0.23-eigen-aarch64.zip
lc0-0.23-opencl-aarch64.zip

armv7a for Android 4.1 and up:
lc0-0.23-blas-armv7.zip
lc0-0.23-eigen-armv7.zip
lc0-0.23-opencl-armv7.zip

@borg323 borg323 mentioned this pull request Dec 5, 2019
@lealgo
Copy link
Contributor Author

lealgo commented Feb 14, 2020

After the change to c++17, the master branch is building fine with the Android NDK r21.

There's an (old) issue though when targeting armv7a and trying to build for EABI < 24 (Android 7). The problem is that recent NDK's apparently have trouble targeting 32-bit Android for older API levels. The issue is explained here:

https://android.googlesource.com/platform/bionic/+/master/docs/32-bit-abi.md

Previously I was using an older NDK as a workaround, but such toolchains don't support c++17 so now we're out of luck.

TLDR From now on the armv7a builds will require Android 7 and above. aarch64 builds are unaffected and will continue to support Android 5.

@lealgo
Copy link
Contributor Author

lealgo commented Feb 24, 2020

Big news!

First appveyor TEST build successful, by @borg323 :

https://ci.appveyor.com/project/borg323/lc0/builds/30998575

HWHWI:/data/local/tmp $ ./lc0 benchmark -w 11258-32x4-se --threads=8 --max-prefetch=0 --minibatch-size=8                                                                                                    
WARNING: linker: "/data/local/tmp/lc0" unused DT entry: type 0xf arg 0x629
       _
|   _ | |
|_ |_ |_| v0.25.0-dev+git.9be6552 built Feb 23 2020
Loading weights file from: 11258-32x4-se
Creating backend [blas]...
Using Eigen version 3.3.5
BLAS max batch size is 256.
Benchmark time 49ms, 4 nodes, 117 nps, move d2d4
Benchmark time 77ms, 6 nodes, 96 nps, move e2e4
Benchmark time 91ms, 10 nodes, 131 nps, move d2d4
...
Benchmark time 4792ms, 4349 nodes, 910 nps, move d2d4
Benchmark time 7834ms, 7769 nodes, 993 nps, move d2d4
Benchmark time 10000ms, 10354 nodes, 1036 nps, move d2d4
bestmove d2d4
Benchmark final time 10.0874s calculating 1032.77 nodes per second.

Note a small linker warning on the first line, it's produced because of the fairly old NDK r17 toolchain. Newer NDK's produce clean binaries.

@lealgo
Copy link
Contributor Author

lealgo commented Mar 2, 2020

The Android builds are official now, I think I can close this PR. Thank you.

@lealgo lealgo closed this Mar 2, 2020
@lealgo
Copy link
Contributor Author

lealgo commented Mar 2, 2020

Test builds for comparing OpenBLAS 0.3.9 vs 0.3.8

lc0-blas-aarch64.zip

@lealgo

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed wip Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants