Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test results depends on Mac device #109

Open
navartis opened this issue Oct 29, 2019 · 21 comments
Open

Test results depends on Mac device #109

navartis opened this issue Oct 29, 2019 · 21 comments

Comments

@navartis
Copy link

After updating MacOS up to Catalina 10.15 and Xcode up to 11.1 (11A1027)

I found that tests on other Mac device are failed with reference images saved an my Mac device.

Latest stable version 6.2.0 of library are used.

I spend 2 days trying to find the reason and fix it, implement "Highlight Different Pixels" feature and create request #108 which can help investigate such situations.

But now a time I have no ideas how to fix this issue.

Screenshot 2019-10-29 at 19 44 05

Screenshot 2019-10-29 at 19 44 41

Diffed Image vs Highlighted Pixels
Failed Image_2_2028C7E3-9945-4726-9780-8FAE41926D36
Reference Image_1_2028C7E3-9945-4726-9780-8FAE41926D36

@MaikCL
Copy link

MaikCL commented Nov 19, 2019

Same happen to me, I record all images using Catalina, but running the same test in a Travis environment with macOS 10.14 It's failing for the diffs, as if they were shadows with different contrast

@mime29
Copy link

mime29 commented Dec 18, 2019

Same issue here. Any idea about the reason?

@navartis
Copy link
Author

It seems to me the reason is that Apple in Xcode 11x migrate Simulators processing from CPU to GPU.

GPU calculations faster but not accurate and it result may depends on hardware GPU venders and drivers.

I spend a lot of time, but the only working work around for me is using perPixelTolerance: 6/256
in func FBSnapshotVerifyView.

And this request #108

@luiz-brasil
Copy link

luiz-brasil commented Mar 16, 2020

It seems to me the reason is that Apple in Xcode 11x migrate Simulators processing from CPU to GPU.

GPU calculations faster but not accurate and it result may depends on hardware GPU venders and drivers.

I spend a lot of time, but the only working work around for me is using perPixelTolerance: 6/256
in func FBSnapshotVerifyView.

And this request #108

@navartis Why are you using the 6/256 number for perPixeltolerance?

@mime29
Copy link

mime29 commented Mar 19, 2020

[Q] We didn't find a good solution on my side yet but we noticed one thing: The random diff result seems to come from views containing transparent assets (images with alpha channel not set to 1)
Do you observe the same thing on your sides?

@navartis
Copy link
Author

It seems to me the reason is that Apple in Xcode 11x migrate Simulators processing from CPU to GPU.
GPU calculations faster but not accurate and it result may depends on hardware GPU venders and drivers.
I spend a lot of time, but the only working work around for me is using perPixelTolerance: 6/256
in func FBSnapshotVerifyView.
And this request #108

@navartis Why are you using the 6/256 number fo perPixeltolerance?

@luiz-brasil Unfortunately '6/256' is a magic number. I hate magic numbers in code, but in this case I have no ideas

@samskiter
Copy link

samskiter commented Apr 8, 2020

We're also seeing this issue. Interestingly it seems to only occur when using either usesDrawViewHierarchyInRect OR snapshotting during a UI test (not unit test).

Seems that it could be related to being rendered in a UIWindow.

I also got the same results between two Mac Pros (i.e. identical machines produce the same result).

Could this be graphics card related and is it time to raise a radar?

@pigeon56
Copy link

I'm so glad that I stumbled across this thread, because I think I'm seeing the same thing.

About a week ago, I started noticing failures in some of my iOS Snapshot tests. These tests passed locally - both when run via Xcode and Fastlane - but failed when run on CI. I'm still trying to figure out which OS our build machine is using (though I suspect it might be 10.14.x), but it just so happens that I updated my machine to Catalina last week, and now I'm getting these failures.

I was able to capture the diff screenshots after running the tests on CI, and they're gray. I asked a designer to take one and bump up the contract and only with a massive bump in contrast were we able to see the slightest traces of some scree elements. (And I really had to squint to see them, even after the adjustment.)

Seems like the workaround at the moment is to bump up the tolerance. Has anyone found a different solution?

@samskiter
Copy link

Just to confirm - is everyone else seeing that this not only happens between 10.14 and 10.15 (which is normal for any upgrade) but also between machines that are running the same version of macOS Catalina (10.15)?

Finally, has anyone got a simple test case that could be sent to Apple?

@luiz-brasil
Copy link

@samskiter unfortunately I don't have a specific test case, primarily because these errors occur randomly.

I can affirm that the error occurs between different machines with the same version of Mac OS Catalina.

@gspiers
Copy link

gspiers commented May 5, 2020

@samskiter Results seem to be GPU dependent. In the simulator you can get different results by using the integrate or discrete GPUs File -> GPU Selection. I've done a lot of investigation and the only fix was to increase the tolerance of snapshots.

One of the Apple engineers that works on the simulator said the way forward is to increase pixel matching tolerance. Conversation on Twitter. Hope that helps.

@navartis
Copy link
Author

navartis commented May 5, 2020

😢

@samskiter
Copy link

samskiter commented May 5, 2020 via email

@JustinDSN
Copy link

JustinDSN commented May 6, 2020

@samskiter Results seem to be GPU dependent. In the simulator you can get different results by using the integrate or discrete GPUs File -> GPU Selection. I've done a lot of investigation and the only fix was to increase the tolerance of snapshots.

One of the Apple engineers that works on the simulator said the way forward is to increase pixel matching tolerance. Conversation on Twitter. Hope that helps.

Because of this issue, what values are teams using for perPixelTolerance and overallTolerance?

@pigeon56
Copy link

pigeon56 commented May 6, 2020

I'm running a little experiment on a few tests. I'll have to follow up tomorrow to let you know if it works for all of my tests, but so far I'm having luck with it.

For almost all screens:
FBSnapshotVerifyView(UIImageView(image: croppedImage), identifier: identifier, perPixelTolerance: 6/256) // tried this number after seeing it higher up in this thread

This fixed most of my failures, but screens that use transparent elements still fail. For screens with transparent elements (where alpha < 1), this is actually passing for me:
FBSnapshotVerifyView(UIImageView(image: croppedImage), identifier: identifier, overallTolerance: 0.02)

Again, these successes are based on a very small sampling of tests that are currently passing. Nearly all screens are being verified with the per pixel value, as we don't use a whole lot of transparency. I'll update this post tomorrow when I know whether or not it's a success.

@pigeon56
Copy link

pigeon56 commented May 6, 2020

I'm running a little experiment on a few tests. I'll have to follow up tomorrow to let you know if it works for all of my tests, but so far I'm having luck with it.

For almost all screens:
FBSnapshotVerifyView(UIImageView(image: croppedImage), identifier: identifier, perPixelTolerance: 6/256) // tried this number after seeing it higher up in this thread

This fixed most of my failures, but screens that use transparent elements still fail. For screens with transparent elements (where alpha < 1), this is actually passing for me:
FBSnapshotVerifyView(UIImageView(image: croppedImage), identifier: identifier, overallTolerance: 0.02)

Again, these successes are based on a very small sampling of tests that are currently passing. Nearly all screens are being verified with the per pixel value, as we don't use a whole lot of transparency. I'll update this post tomorrow when I know whether or not it's a success.

@JustinDSN Unfortunately, I'm going to have to spend more time on this until I know exactly which values are going to work for us.

The per pixel value of 6/256 is working for most screens, but the overall tolerance is giving me problems. It seems like screens that incorporate a transparency or use the fullscreen modal that was new to iOS 13 must be compared using overall tolerance. I've seen the tests pass locally with an overall tolerance value as low as 0.02 but when I run the tests on CI, even 0.25 isn't enough for one particular screen. (For the others, it's fine.) I'm trying a value of 0.30 on CI right now to see what happens. Even if that passes, it's way too high of a threshold to give me any sort of confidence that this is an effective and worthwhile comparison. (Update: It failed at 0.30 too. Just for that particular screen.)

To make things a little more fun, I do have a screen that definitely uses transparency and passes perfectly with a per pixel comparison. It's nearly identical to that one screen that I can't get a passing screenshot comparison on using per pixel or overall tolerance. (Mentioned in the previous paragraph.) Our CI build machine uses an older version of Xcode than I do - which isn't ideal, of course - so maybe that's part of it?

I'm still working with the same small sampling of tests. I'll scale this and update this thread again as soon as I figure out the floor of this overall tolerance threshold and apply it to more tests.

@samskiter
Copy link

OOI does going up from 6/256 also fix your issuesd without having to use an overall tolerance?

@pigeon56
Copy link

pigeon56 commented May 13, 2020

@samskiter It fixed the problem in some places, but not all. Also, my results are based on images that I took from my machine as the reference images. I don't know if the result would change if my colleague would have recently uploaded a reference image.

Despite some successes with the perPixel screen comparison, I'm just not confident in these tests anymore. I've already spent too much time trying to salvage them, and it's just not sustainable. This is a huge let down because they were really valuable to our organization. In fact, we caught a bug yesterday morning in a screen where the comparison is still working.

Short answer to your question: No. At the very least, screens where some form of transparency is used are definitely going to give you problems with a perPixel tolerance measurement.

@samskiter
Copy link

That's frustrating. We also depend heavily on these!

@tirodkar
Copy link

Were folks able to find any solution or workaround for this?

@david6p2
Copy link

david6p2 commented Aug 4, 2021

No solution yet to this problem? just add perPixelTolerance: 6 / 256 where transparency, blur, etc. is used? This didn't work for me.

NickEntin added a commit to cashapp/AccessibilitySnapshot that referenced this issue Aug 15, 2023
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters.

 ## Why is this necessary?

Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109).

The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another without worrying about the tests failing due to differences in GPU rendering.
NickEntin added a commit to cashapp/AccessibilitySnapshot that referenced this issue Aug 15, 2023
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters.

 ## Why is this necessary?

Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109).

The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions.

This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination.

Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today.

 ## Why create separate "imprecise" variants?

The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler.

Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
NickEntin added a commit to cashapp/AccessibilitySnapshot that referenced this issue Aug 16, 2023
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters.

 ## Why is this necessary?

Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109).

The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions.

This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination.

Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today.

 ## Why create separate "imprecise" variants?

The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler.

Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
NickEntin added a commit to cashapp/AccessibilitySnapshot that referenced this issue Aug 16, 2023
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters.

 ## Why is this necessary?

Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109).

The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions.

This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination.

Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today.

 ## Why create separate "imprecise" variants?

The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler.

Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
NickEntin added a commit to cashapp/AccessibilitySnapshot that referenced this issue Aug 16, 2023
This adds variants of the `SnapshotVerify*(...)` methods that allow for imprecise comparisons, i.e. using the `perPixelTolerance` and `overallTolerance` parameters.

 ## Why is this necessary?

Adding tolerances has been a highly requested feature (see #63) to work around some simulator changes introduced in iOS 13. Historically the simulator has supported CPU-based rendering, giving us very stable image representations of views that we can compare pixel-by-pixel. Unfortunately, with iOS 13, Apple changed the simulator to use exclusively GPU-based rendering, which means that the resulting snapshots may differ slightly across machines (see uber/ios-snapshot-test-case#109).

The negative effects of this were mitigated in iOSSnapshotTestCase by adding two tolerances to snapshot comparisons: a **per-pixel tolerance** that controls how close in color two pixels need to be to count as unchanged and an **overall tolerance** that controls what portion of pixels between two images need to be the same (based on the per-pixel calculation) for the images to be considered unchanged. Setting these tolerances to non-zero values enables engineers to record tests on one machine and run them on another (e.g. record new reference images on their laptop and then run tests on CI) without worrying about the tests failing due to differences in GPU rendering. This is great in theory, but from our testing we've found even the lowest tolerance values to consistently handle GPU differences between machine types let through a significant number of visual regressions. In other words, there is no magic tolerance threshold that avoids false negatives based on GPU rendering and also avoids false positives based on minor visual regressions.

This is especially true for accessibility snapshots. To start, tolerances seem to be more reliable when applied to relatively small snapshot images, but accessibility snapshots tend to be fairly large since they include both the view and the legend. Additionally, the text in the legend can change meaningfully and reflect only a small number of pixel changes. For example, I ran a test of full screen snapshot on an iPhone 12 Pro with two columns of legend. Even an overall tolerance of only `0.0001` (0.01%) was enough to let through a regression where one of the elements lost its `.link` trait (represented by the text "Link." appended to the element's description in the snapshot). But this low a tolerance _wasn't_ enough to handle the GPU rendering differences between a MacBook Pro and a Mac Mini. This is a simplified example since it only uses `overallTolerance`, not `perPixelTolerance`, but we've found many similar situations arise even with the combination.

Some teams have developed infrastructure to allow snapshots to run on the same hardware consistently and have built a developer process around that infrastructure, but many others have accepted tolerances as a necessity today.

 ## Why create separate "imprecise" variants?

The simplest approach to adding tolerances would be adding the `perPixelTolerance` and `overallTolerance` parameters to the existing snapshot methods, however I feel adding separate methods with an "imprecise" prefix is better in the long run. The naming is motivated by the idea that **it needs to be very obvious when what you're doing might result in unexpected/undesirable behavior**. In other words, when using one of the core snapshot methods, you should have extremely high confidence that a test passing means there's no regressions. When you use an "imprecise" variant, it's up to you to set your confidence levels according to your chosen tolerances. This is similar to the "unsafe" terminology around memory in the Swift API. You should generally feel very confident in the memory safety of your code, but any time you see "unsafe" it's a sign to be extra careful and not gather unwarranted confidence from the compiler.

Longer term, I'm hopeful we can find alternative comparison algorithms that allow for GPU rendering differences without opening the door to regressions. We can integrate these into the core snapshot methods as long as they do not introduce opportunities for regressions, or add additional comparison variants to iterate on different approaches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants