Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious 'Unable to read beyond end of stream' in CI while collecting results #210

Closed
Cyberboss opened this issue Oct 8, 2018 · 71 comments
Labels
bug Something isn't working

Comments

@Cyberboss
Copy link

https://travis-ci.org/tgstation/tgstation-server/jobs/438449753#L1156

Not sure what causes this it rarely happens and seemingly at random. Latest coverlet.console

@p10tyr
Copy link

p10tyr commented Oct 12, 2018

Could it be that the files are cleaned up before Coverlet finishes doing its thing.

That error usually occurs when the StreamReader opens a file, it has its Length in memory then starts to stream the data... but its either 0, in very rare occasion in processes of being deleted.

For the Length to be more the length of the file is not possible unless it got modified as mentioned.

Because of this "randomness" it seems like a race condition somewhere.

@prime-gram
Copy link

prime-gram commented Oct 24, 2018

I'm also facing this, but consistently, when dealing with a huge number of testcases (~17k), i.e. in our nightly builds (where more complete testing is done).

In regular use cases, I don't see this happening.

My use case:
I'm calling coverlet, in dotnet test (via cake's DotNetCoreTest)

Error Line:
[16:34:51] : [Step 2/3] C:\Users\SVC-SS-BUILDER\.nuget\packages\coverlet.msbuild\2.3.1\build\netstandard2.0\coverlet.msbuild.targets(27,5): error : Unable to read beyond the end of the stream.

More Detailed:

[16:34:51] :     [Step 2/3] Results File: D:\tc_ba2\510a05667bb2cb9a\test\DynamicPricing.Application.Nightly.Tests\TestResults\TestResults.trx
[16:34:51] :     [Step 2/3] 
[16:34:51] :     [Step 2/3] Total tests: 17071. Passed: 17071. Failed: 0. Skipped: 0.
[16:34:51] :     [Step 2/3] Test Run Successful.
[16:34:51] :     [Step 2/3] Test execution time: 26.1204 Minutes
[16:34:51]i:     [Step 2/3] TRX (1s)
[16:34:51] :     [Step 2/3] 
[16:34:51] :     [Step 2/3] Calculating coverage result...
[16:34:51] :     [Step 2/3] C:\Users\SVC-SS-BUILDER\.nuget\packages\coverlet.msbuild\2.3.1\build\netstandard2.0\coverlet.msbuild.targets(27,5): error : Unable to read beyond the end of the stream. [D:\tc_ba2\510a05667bb2cb9a\test\DynamicPricing.Application.Nightly.Tests\DynamicPricing.Application.Nightly.Tests.csproj]
[16:34:51]E:     [Step 2/3] An error occurred when executing task 'RunTests'.
[16:34:52]E:     [Step 2/3] Error message is logged
[16:34:51]E:     [Step 2/3] Error: System.AggregateException: One or more errors occurred. ---> Cake.Core.CakeException: .NET Core CLI: Process returned an error (exit code 1).
[16:34:51]E:     [Step 2/3]    at Cake.Core.Tooling.Tool`1.ProcessExitCode(Int32 exitCode)
[16:34:51]E:     [Step 2/3]    at Cake.Core.Tooling.Tool`1.Run(TSettings settings, ProcessArgumentBuilder arguments, ProcessSettings processSettings, Action`1 postAction)
[16:34:51]E:     [Step 2/3]    at Cake.Common.Tools.DotNetCore.Test.DotNetCoreTester.Test(String project, DotNetCoreTestSettings settings)
[16:34:51]E:     [Step 2/3]    at Submission#0.DotNetCoreTest(String project, DotNetCoreTestSettings settings)
[16:34:51]E:     [Step 2/3]    at Submission#0.<<Initialize>>b__0_27(String configuration, String provider, Tuple`2 connectionInfo, Boolean nightly, Boolean coverage)
[16:34:51]E:     [Step 2/3]    at Submission#0.<<Initialize>>b__0_26()

@p10tyr
Copy link

p10tyr commented Oct 25, 2018

Yea it looks like coverlet is trying to open the file.. and its not there any more.
Again the common thing here is a CI pipeline.
Can you run this locally on your machine using dotnet test with coverlet?

@prime-gram
Copy link

I ran it locally, but was unable to recreate the situation.

In the CI, it did happen always. But not on my laptop.

@p10tyr
Copy link

p10tyr commented Oct 26, 2018

So as I thought it is a CI related issue. Its not the build or coverlet problem it is that CI starts to cleanup too quickly.
Please tell us more about how your CI is setup, which platform etc.

The next step of problem elimination is to manually run it on the CI Agent.

If it is AzureDevops you should be able to get console access to your build agent (i am not sure about self hosted agents though) and try and run the pipeline commands manually. See if it breaks the same way. If it does that means its an environment problem but I seriously doubt that.

To Consider:

I think when the pipeline is executing one after the other.. there is a cleanup operation that is happening. Please check for any options like that and try and do cleanup at the end.. or, on first step do a clean sources and then make sure there are no other cleanup options tagged

Let us know what you find please.

@prime-gram
Copy link

That is still not clear to me.

19k Total Tests = (17k nightly + 2k regular baseline)

We in fact have 19k tests, therefore, the 17k tests I was able to run (until I got the exception), was in the middle of the process. Therefore, before a possible CI cleanup (which I'm not sure we do at the end, I believe we cleanup the stuff at the beggining of a CI run).

Another hint: I had the exclusion lists not working properly (I was sending separate exclusion lists, via the command line, and that doesn't work well in powershell, there's an issue for that), which was causing that the extra 17k tests were also being "covered" by coverlet, generating a massive output (my guess, since I'm unable to see the result, because of the exception).

Cheers

@p10tyr
Copy link

p10tyr commented Oct 26, 2018

Yea... I am not entirely sure. The best thing to try and do is recreate the problem somewhere, outside of CI.
Sorry that is all I can say for this

@Cyberboss
Copy link
Author

Cyberboss commented Oct 27, 2018

A potential workaround is to invoke bash and sync after

i.e.

$HOME/.dotnet/tools/coverlet bin/$CONFIG/netcoreapp2.1/Tgstation.Server.Api.Tests.dll --target "bash" --targetargs "-c \"dotnet test -c $CONFIG --no-build && sync\"" --format opencover --output "../../TestResults/api.xml" --include "[Tgstation.Server*]*" --exclude "[Tgstation.Server.Api.Tests*]*"

@Cyberboss
Copy link
Author

Disregard, it still errored out

@prime-gram
Copy link

I have the "feeling" that this has to do with some async (write) operation of some results, and when coverlet tries to read the output of it (it has not finished yet).

but this is just a guess...

@leppie
Copy link

leppie commented Oct 30, 2018

Getting this too on AppVeyor lately (last 2 days). Nothing has changed on our side.

@tonerdo
Copy link
Collaborator

tonerdo commented Oct 31, 2018

Not sure what causes this it rarely happens and seemingly at random. Latest coverlet.console

Out of curiosity is everyone having this problem using coverlet.console?

@Cyberboss
Copy link
Author

Cyberboss commented Oct 31, 2018 via email

@leppie
Copy link

leppie commented Oct 31, 2018

Not sure what causes this it rarely happens and seemingly at random. Latest coverlet.console

Out of curiosity is everyone having this problem using coverlet.console?

I am using coverlet.msbuild and just calling dotnet test /p:CollectCoverage=true on a per project basis in AppVeyor.

Still on 2.3.0 here

@prime-gram
Copy link

" just calling dotnet test /p:CollectCoverage=true on a per project basis"

Same here. Env: Windows + Teamcity + cake build script (which calls that, then).

@tonerdo
Copy link
Collaborator

tonerdo commented Nov 4, 2018

Quick follow up question, does this only happen on a CI server? Anyone experienced it locally?

@Cyberboss
Copy link
Author

Cyberboss commented Nov 4, 2018 via email

@WiredUK
Copy link

WiredUK commented Nov 8, 2018

Just wanted to add my support to this issue. We also have intermittent failures on our Azure DevOps pipeline. Here's a snippet from the last failure, using coverlet.msbuild:

2018-11-08T17:03:48.4568642Z Results File: D:\agents\5\_work\_temp\SVC_TFSBuild_NCL-S-BUILD03_2018-11-08_17_03_47.trx
2018-11-08T17:03:48.4569469Z 
2018-11-08T17:03:48.4572151Z Total tests: 14. Passed: 14. Failed: 0. Skipped: 0.
2018-11-08T17:03:48.4572637Z Test Run Successful.
2018-11-08T17:03:48.4576890Z Test execution time: 12.9343 Seconds
2018-11-08T17:03:48.5061451Z 
2018-11-08T17:03:48.5062641Z Calculating coverage result...
2018-11-08T17:03:49.0712122Z D:\.nuget\packages\coverlet.msbuild\2.3.1\build\netstandard2.0\coverlet.msbuild.targets(27,5): error : Unable to read beyond the end of the stream. [D:\agents\5\_work\58\s\test\aaa.bbb.ccc.Unit.Tests\aaa.bbb.ccc.Unit.Tests.csproj]

@tonerdo tonerdo added the bug Something isn't working label Nov 13, 2018
@WiredUK
Copy link

WiredUK commented Nov 14, 2018

I've started doing some testing on this one. Have a custom version of coverlet with a bunch of extra logging to hopefully extract exactly where and why this is exploding. It's quite tedious though as a CI build needs to run and the hamsters in our build server are very tired. Will update when I get some more info...

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

And here is a better stack trace:

2018-11-14T14:47:34.2650390Z System.IO.EndOfStreamException: Unable to read beyond the end of the stream.
2018-11-14T14:47:34.2651311Z    at System.IO.BinaryReader.FillBuffer(Int32 numBytes)
2018-11-14T14:47:34.2652232Z    at System.IO.BinaryReader.ReadInt32()
2018-11-14T14:47:34.2652389Z    at Coverlet.Core.Coverage.CalculateCoverage()
2018-11-14T14:47:34.2652489Z    at Coverlet.Core.Coverage.GetCoverageResult()
2018-11-14T14:47:34.2652672Z    at Coverlet.MSbuild.Tasks.CoverageResultTask.Execute()

@p10tyr
Copy link

p10tyr commented Nov 14, 2018

I know that error from working with files. It basically means the file has been deleted while trying to read it.

That log is cool.. but it would be cooler if you knew what file it is trying to read. I suppose Coverlet is trying to instrument the DLL then all of a sudden it disappears...

This is where the error is propagating

https://github.com/tonerdo/coverlet/blob/7ede9d7cba383fea9709cee4d380c0de27b58b42/src/coverlet.core/Coverage.cs#L178

And by reading

https://docs.microsoft.com/en-us/dotnet/api/system.io.binaryreader.readint32?view=netframework-4.7.2

It seems like just maybe there is a bug here where Coverlet expected more things than there were for some reason in the loops doing the ReadInt32.

Also it is difficult to understand what that code is actually doing...

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

@ppumkin Yeah, I'm going to dig a little further now I know which part of the app to instrument. It's difficult to test as it's a sporadic issue, but I can queue up multiple builds at the same time which helps.

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

@ppumkin Regarding your edit, the error could also be here:

https://github.com/tonerdo/coverlet/blob/7ede9d7cba383fea9709cee4d380c0de27b58b42/src/coverlet.core/Coverage.cs#L190

Edit:
But it's not, it is throwing when trying to read hitCandidatesCount

@p10tyr
Copy link

p10tyr commented Nov 14, 2018

Yes, true. That is why I wrote its difficult to understand the code because is has a few levels of embedded loops and its reading 4 unsigned bytes each time from result.HitsFilePath,
What is in that file? Some kind of temporary file. If it it is temporary maybe that is the problem on the CI builds, lack of permissions to access the file? it gets cleaned up after build runs?

What is the file? (the path to it and contents)

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

The file that it was trying to read from varies, but for me is in the form (where xxx.yyy.zzz is a project name):

C:\Users\SVC_TFSBuild\AppData\Local\Temp\xxxx.yyyy.zzzz_d884df17-4b8f-41e4-9ebf-93b044af742d

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

I'm now wondering if it's something to do with it being in the temp directory...

@Cyberboss
Copy link
Author

Here's one I just triggered in travis https://travis-ci.org/tgstation/tgstation-server/jobs/455035648#L1719

@p10tyr
Copy link

p10tyr commented Nov 14, 2018

yes! Temp directory! I bet you something on AZ (or other CI apparently like Travis) is cleaning up the temp after BUILD completes

@WiredUK
Copy link

WiredUK commented Nov 14, 2018

Possibly yeah, but this build server we have is local, not a hosted build.

@WiredUK
Copy link

WiredUK commented Nov 15, 2018

That's a potentially good spot there @Cyberboss.

Question for all others: Are your build servers under heavy load or severely underpowered by any chance? Mine absolutely is (currently running on a shard host with a bunch of other VMs) and the disk is being hammered, so there's every possibility that it's taking too long to execute that process.

@Cyberboss
Copy link
Author

Cyberboss commented Nov 15, 2018 via email

@petli
Copy link
Collaborator

petli commented Nov 15, 2018

A way to avoid to do a lot of work in ProcessExit might be to use a memory mapped file to keep track of hit counts, which would be pretty easy to implement after #181. Instead of allocating an int[] a memory mapped file the size of HitCandidates.Count * 4 would be allocated, and each hit increment a counter at HitIndex * 4. This way there's no data to write to disk at the end of the test.

It shouldn't be a significant performance hit to use the memory mapped APIs instead of incrementing an int array directly, but that must be verified. It may not be necessary to use file-backed memory maps at all, and just use them as a shared memory area between the coverlet process and the test process.

To avoid locking threads on each others there would have to be one file per thread (much like there's one array per thread in #181), and Coverage.CalculateCoverage() would have to read all those files and tally them.

@p10tyr
Copy link

p10tyr commented Nov 16, 2018

Is this a use case for the new Span T - maybe? I never used it but listening to some conferences it sounds like that is what it could be used for?

Any way, anything other than file system writes is going to be better.. Sounds like a lot of work though.

Is there a "temporary" work around to increase this time out maybe?

@ElephantsGerald
Copy link

I can also confirm that this starts to become a problem on a CI server under load.

@Marachkouski
Copy link

We also have such problem in our build environment.

@codemzs
Copy link

codemzs commented Jan 11, 2019

@tonerdo Is there a workaround for this? I'm running into this error at the end of all test execution after I get all code coverage results. It happens locally on my machine when I run debug build but for release.

@sharwell
Copy link
Contributor

I narrowed this down with @codemzs today. The failure occurs when the hit file is truncated due to process termination while writing the file. After that point, subsequent loads of the file for coverage gathering fail because the file is either empty or contains fewer entries than claimed. I'm planning to address two parts of this and send a pull request:

  1. Improve write performance using memory mapped files, allowing the final flush to disk to occur after the process closes.
  2. During reads, handle files that are too short by assuming the rest of the data is zero-filled.

There was some mention above of possible timeouts. I'm not planning to address this part specifically, but if the allowed data flush duration can be extended then it would likely round out the situation for great overall reliability.

@codemzs
Copy link

codemzs commented Jan 14, 2019

Can we please get #276 in to address the issue and have a new nuget? @Cyberboss @tonerdo @petli

@tonerdo
Copy link
Collaborator

tonerdo commented Jan 17, 2019

@Cyberboss @codemzs New NuGet releases with the proposed fixed have been released

@ViktorHofer
Copy link
Contributor

The believe the fastest way to transfer data between the child (test process) and the host (coverlet) would be to just write the unique id + the hitcount out to stdout and read it in the host. That would remove complex references to MemoryMappedFile without any file I/O.

The problem with that solution is that coverlet.msbuild.tasks allow to instrument and assembly and collect its coverage without controlling how the process is launched. Therefore reading from stdout is not possible in the build task.

cc @petli

@petli
Copy link
Collaborator

petli commented Jan 21, 2019

@ViktorHofer Do you think named pipes be as fast as stdout? The pipe name could be passed into the instrumented code via ModuleTrackerTemplate much like the mmap name.

@sharwell
Copy link
Contributor

I'm starting to think the cause of this issue lies in the test runner itself, as opposed to the CLR. However, I haven't been able to pinpoint the sequence in the vstest runner that leads to the forced termination of the test process.

@ViktorHofer
Copy link
Contributor

ViktorHofer commented Jan 24, 2019

Correct the shutdown is triggered by vstest. This can be easily tested with the following code. Happens with both mstest and xunit which proves that its not the runner itself but the testing framework (vstest) which is responsible for stopping the execution early.

using Microsoft.VisualStudio.TestTools.UnitTesting;
using System;
using System.Diagnostics;
using System.Threading.Tasks;

namespace dotnettest
{
    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void Test1()
        {
            AppDomain.CurrentDomain.ProcessExit += CurrentDomain_ProcessExit;
        }

        private void CurrentDomain_ProcessExit(object sender, EventArgs e)
        {
            System.IO.File.WriteAllText("abc.txt", "start");
            Task.Delay(10000).Wait();
            System.IO.File.WriteAllText("abc.txt", "end");
        }
    }
}

Repro: dotnet test.

Related code paths in vstest:
https://github.com/Microsoft/vstest/blob/7b6248203164f8ea821f6795632bd22e0b69afb0/src/Microsoft.TestPlatform.CoreUtilities/Utilities/JobQueue.cs

VSTest which is triggered by dotnet test is responsible I'll try to gather more information.

As a workaround you could invoke your test runner without dotnet test, i.e. xunit.console.dll (unsupported).

@MarcoRossignoli
Copy link
Collaborator

Close thank's to new collectors https://github.com/tonerdo/coverlet#requirements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.