Replacing AppVeyor with another CI for windows #568

Neiko2002 · 2018-05-30T12:45:16Z

I was wondering if it makes sense to swap AppVeyor with another CI host for windows builds. Any suitable candidate should be tested with one of the biggest javacpp-presets (e.g. mxnet or tensorflow). We can also stay with AppVeyor and just build tensorflow with another CI.

Here is a small list of CI service providers and some information about the:
https://github.com/bytedeco/javacpp-presets/wiki/Continuous-Integration-(CI)

saudet · 2018-05-30T13:47:21Z

Sure, but build times for TensorFlow are longer than 2 hours. I don't know of any service that supports that, do you?

Neiko2002 · 2018-05-30T15:05:23Z

Not yet, but I might try shippable to build the tensorflow libs using the cppbuild.sh, in order to see if its useful for us. This ticket is just for exchanging experiences of various CI service providers.

saudet · 2018-05-31T03:18:21Z

It looks like Shippable also has a limit of 2 hours, just like AppVeyor:
http://docs.shippable.com/ci/custom-timeouts/

@vb216 Would you have any suggestions?

vb216 · 2018-05-31T06:19:46Z

Do you know how long it would need to complete? They might be able to increase it a bit more, either free or commercially? Alternatively, can it be broken down into two smaller tasks? Appveyor were working on the ability to share artifacts between builds, so if you could compile half under the time limit, share to the next build task and resume the build there, might work? I think appveyor had fixed IP addresses on their build nodes too, so potentially you could whitelist that IP on a AWS S3 store and push prebuilt work there too (if cache sharing locally still doesn't work)

…

On Thu, 31 May 2018, 04:18 Samuel Audet, ***@***.***> wrote: It looks like Shippable also has a limit of 2 hours, just like AppVeyor: http://docs.shippable.com/ci/custom-timeouts/ @vb216 <https://github.com/vb216> Would you have any suggestions? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#568 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMYRJ_KHjAZ-8HqZSfENLA5N9SImioFBks5t32EBgaJpZM4UTNEW> .

saudet · 2018-05-31T07:54:28Z

We might be able to get 4 hours if we start paying for it? Maybe, we'd have to ask...

vb216 · 2018-05-31T08:12:43Z

Want me to drop them a line and cc you?

…

On Thu, 31 May 2018, 08:54 Samuel Audet, ***@***.***> wrote: We might be able to get 4 hours if we start paying for it? Maybe, we'd have to ask... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#568 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMYRJwPqJwrl5lXWmd7SF6X0e_2FfCfFks5t36G3gaJpZM4UTNEW> .

saudet · 2018-05-31T08:19:45Z

Sure, that'd be great! Thanks

Neiko2002 · 2018-05-31T12:05:49Z

Compiling the cuda kernels takes a long time on windows. Just the tf_core_gpu_kernels.vcxproj project of tensorflow takes over 2h on my machine. Creating the static libs (tensorflow_static.vcxproj) afterwards needs another hour. There are of course many dependency projects which are included in the before mentioned projects. We could run those separately and store their results beforehand, to reduce the compilation time of the bigger projects.

@vb216 If you task shippable you might also want to ask Appveyor.

vb216 · 2018-05-31T21:13:08Z

Ah I meant asking Appveyor - they've been pretty kind in the past increasing from their default 60 min build to about 110 mins nowadays. They replied back already too, suggesting this https://www.appveyor.com/docs/build-environment/#private-build-cloud as an option - would need to be on their premium service but they discount at 50% for open source projects. Seems pretty reasonable. I guess maybe there's the cost of the cloud instance to add in on top of that. Looks like two build engines so overall time would be quicker as well.

Only other suggestion I had was going back in the Jenkins sort of direction, as seems like most cloud build providers will have some sort of time limit. But, that seems a step backwards from where it's at right now.

saudet · 2018-05-31T22:22:14Z

Yeah, it would be great if we could continue using cloud services like that.

I have no problems with paying a small fee, but any ideas why those places can't provide builds longer than 2 hours even to paying customers? I feel that managing a private build cloud wouldn't give us much over Jenkins...

vb216 · 2018-06-01T06:12:55Z

Not sure why but it does seem a common time limit, I guess they're getting billed for the time they spin up VM instances, maybe they take a view of worst case incurred cost on users hitting that upper time limit? Plus prevents never ending jobs costing them alot just because they've hung. And you're right, it does drive up the management effort when none of us has a ton of spare time to work on this. Slight benefit over Jenkins is it sounds like they provide some software drop to go on your cloud at least.

…

On Thu, 31 May 2018, 23:22 Samuel Audet, ***@***.***> wrote: Yeah, it would be great if we could continue using cloud services like that. I have no problems with paying a small fee, but any ideas why those places can't provide builds longer than 2 hours even to paying customers? I feel that managing a private build cloud wouldn't give us much over Jenkins... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#568 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AMYRJ_1p_q9LGZl_HoKXUt1tSMXCtKRbks5t4G0ZgaJpZM4UTNEW> .

saudet · 2018-06-11T04:56:42Z

Status update: It sounds like with a Premium plan AppVeyor should be able to provide us with long enough build times.

@Neiko2002 Have you been able to find anything else?

Neiko2002 · 2018-06-11T12:53:20Z

@saudet The list is quite complete I think, but only for free services. Back than I didn't know you are also considering paid ones. There are much more of them.

saudet · 2018-06-18T22:02:34Z

@Neiko2002 In your opinion, what would be the best ones?

saudet · 2018-08-08T03:58:51Z

@Neiko2002 @vb216 With Python support (pull #596) and CUDA it would take about 7 hours to build on AppVeyor. Also, I still haven't been able to figure out why it insists on reporting an exit code of 259, so partial builds don't appear practical either. In any case, AppVeyor wasn't designed for long builds like that, so is our only option here to do it manually with Jenkins or something?

@wumo If you have any ideas as well, please let us know!

saudet · 2018-08-08T09:51:55Z

Microsoft-hosted build agents look like a potential solution:

Can run jobs for up to 6 hours (30 minutes on the free tier).

Currently utilizing Microsoft Azure general purpose virtual machine sizes (Standard_DS2_v2 and Standard_DS3_v2)

https://docs.microsoft.com/en-us/vsts/pipelines/agents/hosted?view=vsts

Anyone interested in giving this a try? I'll be testing the build at least on a Standard_DS3_v2 to make sure it finishes in less than 6 hours at least.

(They even have support for Linux and Mac! But Travis CI works well for those platforms, so I'm not thinking about changing anything there...)

Neiko2002 · 2018-08-08T10:09:57Z

Sounds interesting. They provide multiple cpu cores + SSD based temp storage. I've just started again building tensorflow to check #596 and will be trying out small incremental builds to see how long the different parts needs to compile.

saudet · 2018-08-09T02:00:49Z

@Neiko2002 Thanks! Oops, I think I've already done the work for more incremental builds, see the update.

saudet · 2018-08-09T12:07:47Z

I've tested the build on a Standard DS3 v2 (4 vcpus, 14 GB memory) instance on Azure with Windows 2012 R2, Visual Studio 2015, and CUDA 9.2. It took about 3 hours for the core (including Python) and additionally just over 1 hour for CUDA, so it's looking very good. We'll need to figure out how to set up VSTS and integrate it with GitHub, but this guide doesn't seem too terse:
https://docs.microsoft.com/en-us/vsts/pipelines/build/ci-build-github
It looks like it costs $40 per month:
https://visualstudio.microsoft.com/team-services/pricing/
But once we figure out how to set all that up, Skymind will be paying for it, so no worries.

saudet · 2018-08-09T22:42:34Z

It looks like we're more likely to get a Standard DS2 v2 (2 vcpus, 7 GB memory) though:
https://stackoverflow.com/questions/51725187/vsts-microsoft-hosted-agent-virtual-machine-size
Let's see how that fares...

saudet · 2018-08-10T08:53:19Z

Hum, it takes a bit more than 6.5 hours to build with CUDA (or about 5 hours without) on a Standard DS2 v2 (2 vcpus, 7 GB memory), not cool...

saudet · 2018-08-11T01:07:34Z

@Neiko2002 Could you try and see if there wouldn't be a way to split the build in even more increments? Ideally splitting the non-CUDA core build into more or less 3 parts of less than 2 hours each on 2 Xeon cores.

Neiko2002 · 2018-08-14T11:08:42Z

@saudet I just tried different modules which are needed to build the tensorflow_static lib. Every projected is compiled one other the other. Starting from the bottom. tf_c needs 27 minutes, but most of it comes from the tf_core_lib which already needs 25 minutes. The only strange behavior is tf_tools_transform_graph_lib, it only contains a few files but recompiles the tf_core_kernels project, that's why it takes 1h 32min to build (1h 25min of it are from the tf_core_kernels).

I was setting the maxcpucount in the cppbuild.sh to 2. But this option does not work as one would expect:
https://msdn.microsoft.com/en-us/library/bb385193.aspx?f=255&MSPPError=-2147217396>

/maxcpucount: the MSBuild.exe tool can build multiple projects at the same time
/MP: compiler (cl.exe) option can build multiple compilation units at the same time

Cmake activated MultiProcessorCompilation (MP) inside of the *.vcxproj files. By default it uses all available CPU cores. It might be possible to set the corresponding ProcessorNumber via the CL_MPCount parameter:
https://github.com/Microsoft/checkedc-clang/wiki/Parallel-builds-of-clang-on-Windows

I will check and create a PR if this works. But until than all the timings in my figure above are computed on a multi core processor.

Neiko2002 · 2018-08-15T11:19:44Z

Compiling tf_core_kernels with just two cores using the method described in #599 resulted in 3h 37mins. Reading the build log it seems the compiler does some unnecessary steps:

Source compilation required: input E:\G\JP\TENSORFLOW\CPPBUILD\WINDOWS-X86_64-GPU\BUILD\TF_CORE_LIB.DIR\RELEASE\MUTEX.OBJ is newer than output E:\G\JP\TENSORFLOW\CPPBUILD\WINDOWS-X86_64-GPU\BUILD\TF_CORE_LIB.DIR\RELEASE\TF_CORE_LIB.LIB.

After the message above the tf_core_lib.lib gets re-created. Even though no compilation occurs, there is no reason to build an already existing lib. Checking the modification date afterwards shows identical timestamps:
"mutex.obj" 11:27:08
"tf_core_lib.lib" 11:27:08

Looking even closer with wmic we can see a difference, but does msbuild use such fine grained timestamp? There are difference timestamp resolutions for every file system (100ns for NTFS).
"mutex.obj" 20180815112708.291985+120
"tf_core_lib.lib" 20180815112708.759880+120

Following the mutex.obj we find it was created a few lines before and does force more libraries to be recreated afterwards.

E:\G\jp\tensorflow\cppbuild\windows-x86_64-gpu\tensorflow-1.10.0-rc1\tensorflow\core\platform\default\mutex.cc will be compiled as E:\G\JP\TENSORFLOW\CPPBUILD\WINDOWS-X86_64-GPU\BUILD\EXTERNAL\NSYNC\PUBLIC\NSYNC_CV.H was modified at 15/08/2018 11:27:04.
Outputs for E:\G\JP\TENSORFLOW\CPPBUILD\WINDOWS-X86_64-GPU\TENSORFLOW-1.10.0-RC1\TENSORFLOW\CORE\PLATFORM\DEFAULT\MUTEX.CC:
E:\G\JP\TENSORFLOW\CPPBUILD\WINDOWS-X86_64-GPU\BUILD\TF_CORE_LIB.DIR\RELEASE\MUTEX.OBJ

The file NSYNC_CV.H was copied over into the BUILD\EXTERNAL\NSYNC\PUBLIC\ directory by the nsync_copy_headers_to_destination.vcxproj and its PreBuildEvent:

C:\msys64\mingw64\bin\cmake.exe -E copy_directory E:/G/jp/tensorflow/cppbuild/windows-x86_64-gpu/build/nsync/install/include/ E:/G/jp/tensorflow/cppbuild/windows-x86_64-gpu/build/external/nsync/public/

The problem is copy_directory always overwrites the content of the target directory and therefore changes the last modified timestamps:
https://bravenewmethod.com/2017/06/18/update_directory-command-for-cmake/

The nsync_copy_headers_to_destination project is a dependency of many projects and gets executed multiple times in my build process. One of the dependent project is tf_core_lib itself. It is therefore guaranteed the tf_core_lib.lib will be recreated at every run.

And this is just one example why the construction takes so long.

PS: We could use /p:BuildProjectReferences=false if we can assure all reference projects exists:
https://msdn.microsoft.com/en-us/library/bb629394.aspx?f=255&MSPPError=-2147217396

Neiko2002 · 2018-08-15T16:58:21Z

Compiling tf_core_kernels twice in a row and disabling /p:BuildProjectReferences=false for the second run, reduces the build time of 2 CPU cores from 3h 37min to 3h 17min. To lower it further we would need to divide the project (tf_core_kernels.vcxproj). In the first half just some files will be compiled, the remaining ones + linking all files is done in the second half.

saudet · 2018-08-16T00:03:40Z

Sounds good! Does this mean that we can build tensorflow_static in less than 4 hours? If so, that might be enough.

saudet · 2018-08-16T12:08:14Z

It still doesn't seem to build under 4 hours here on AppVeyor with 2 cores though:
https://ci.appveyor.com/project/Bytedeco/javacpp-presets/build/612

Neiko2002 · 2018-08-16T13:03:51Z

No just changing the CL_MPCount flag does not help much. But I will prepare a PR with a partial tensorflow build. The first part will build most of the vcxproj projects (incl. the python api and gpu kernels) in around 2h. The second part creates the missing tf_core_kernels and tensorflow_static in 3h 20min in my tests (using only two cores with CL_MPCount).

saudet · 2018-08-21T05:11:27Z

Some more uncoolness, it looks like we can't get administrator rights on Microsoft-hosted agents for VSTS: https://github.com/IvanBoyko/vsts-install-MSI/issues/3#issuecomment-342798108
https://mohitgoyal.co/2017/08/18/install-powershell-modules-on-hosted-agent-in-vsts-visual-studio-team-services/
Makes it very hard to get anything done...

saudet · 2018-10-04T00:46:11Z

@vb216 found an interesting thread on TensorFlow's repo: tensorflow/tensorflow#10521

It looks like building without __force_inline for Eigen in conv_ops works around the slow build issue. I'll be testing that, and since we're building with MKL-DNN, we might not even incur any performance hit.

saudet · 2018-10-16T00:32:54Z

In the end, it looks like the best option available out there is still AppVeyor. With the workaround above, we're able to build on 4-core VMs in about 3:30 hours for everything, and about 1:45 hours without CUDA. They also added support for Linux recently and plan to support Mac as well, so this is looking promising. Still, thanks for your time on this @Neiko2002! Much appreciated :)

Neiko2002 · 2018-10-16T08:54:03Z

With Linux and Mac they will be the first supporting all major OS. This could reduce future workload for the project.

saudet added the enhancement label May 31, 2018

saudet mentioned this issue Jun 4, 2018

Add tensorflow with gpu support for windows #567

Merged

saudet added the help wanted label Aug 8, 2018

Neiko2002 mentioned this issue Aug 15, 2018

Multi processor compilation for windows builds #599

Merged

Neiko2002 mentioned this issue Aug 18, 2018

Tensorflow partial builds for windows #603

Closed

saudet removed the help wanted label Oct 16, 2018

saudet closed this as completed Oct 16, 2018

saudet mentioned this issue Oct 9, 2020

Windows build fails on GitHub Actions tensorflow/java#125

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing AppVeyor with another CI for windows #568

Replacing AppVeyor with another CI for windows #568

Neiko2002 commented May 30, 2018

saudet commented May 30, 2018

Neiko2002 commented May 30, 2018

saudet commented May 31, 2018

vb216 commented May 31, 2018 via email

saudet commented May 31, 2018

vb216 commented May 31, 2018 via email

saudet commented May 31, 2018

Neiko2002 commented May 31, 2018 •

edited

Loading

vb216 commented May 31, 2018

saudet commented May 31, 2018

vb216 commented Jun 1, 2018 via email

saudet commented Jun 11, 2018

Neiko2002 commented Jun 11, 2018

saudet commented Jun 18, 2018

saudet commented Aug 8, 2018

saudet commented Aug 8, 2018

Neiko2002 commented Aug 8, 2018

saudet commented Aug 9, 2018

saudet commented Aug 9, 2018

saudet commented Aug 9, 2018

saudet commented Aug 10, 2018

saudet commented Aug 11, 2018

Neiko2002 commented Aug 14, 2018 •

edited

Loading

Neiko2002 commented Aug 15, 2018 •

edited

Loading

Neiko2002 commented Aug 15, 2018 •

edited

Loading

saudet commented Aug 16, 2018

saudet commented Aug 16, 2018

Neiko2002 commented Aug 16, 2018

saudet commented Aug 21, 2018

saudet commented Oct 4, 2018 •

edited

Loading

saudet commented Oct 16, 2018

Neiko2002 commented Oct 16, 2018

Replacing AppVeyor with another CI for windows #568

Replacing AppVeyor with another CI for windows #568

Comments

Neiko2002 commented May 30, 2018

saudet commented May 30, 2018

Neiko2002 commented May 30, 2018

saudet commented May 31, 2018

vb216 commented May 31, 2018 via email

saudet commented May 31, 2018

vb216 commented May 31, 2018 via email

saudet commented May 31, 2018

Neiko2002 commented May 31, 2018 • edited Loading

vb216 commented May 31, 2018

saudet commented May 31, 2018

vb216 commented Jun 1, 2018 via email

saudet commented Jun 11, 2018

Neiko2002 commented Jun 11, 2018

saudet commented Jun 18, 2018

saudet commented Aug 8, 2018

saudet commented Aug 8, 2018

Neiko2002 commented Aug 8, 2018

saudet commented Aug 9, 2018

saudet commented Aug 9, 2018

saudet commented Aug 9, 2018

saudet commented Aug 10, 2018

saudet commented Aug 11, 2018

Neiko2002 commented Aug 14, 2018 • edited Loading

Neiko2002 commented Aug 15, 2018 • edited Loading

Neiko2002 commented Aug 15, 2018 • edited Loading

saudet commented Aug 16, 2018

saudet commented Aug 16, 2018

Neiko2002 commented Aug 16, 2018

saudet commented Aug 21, 2018

saudet commented Oct 4, 2018 • edited Loading

saudet commented Oct 16, 2018

Neiko2002 commented Oct 16, 2018

Neiko2002 commented May 31, 2018 •

edited

Loading

Neiko2002 commented Aug 14, 2018 •

edited

Loading

Neiko2002 commented Aug 15, 2018 •

edited

Loading

Neiko2002 commented Aug 15, 2018 •

edited

Loading

saudet commented Oct 4, 2018 •

edited

Loading