Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test 85 failing under Windows in Debug configuration #2570

Closed
claudiofantacci opened this issue Oct 19, 2018 · 7 comments
Closed

Test 85 failing under Windows in Debug configuration #2570

claudiofantacci opened this issue Oct 19, 2018 · 7 comments

Comments

@claudiofantacci
Copy link
Contributor

claudiofantacci commented Oct 19, 2018

Here is the error:

85: Test command: C:\Users\cfantacci\GitHub\pcl\buildd\bin\test_registration.exe "C:/Users/cfantacci/GitHub/pcl/test/bun0.pcd" "C:/Users/cfantacci/GitHub/pcl/test/bun4.pcd" "C:/Users/cfantacci/GitHub/pcl/test/milk_color.pcd"
85: Test timeout computed to be: 9.99988e+06
85: [==========] Running 12 tests from 1 test case.
85: [----------] Global test environment set-up.
85: [----------] 12 tests from PCL
85: [ RUN      ] PCL.findFeatureCorrespondences
85: [       OK ] PCL.findFeatureCorrespondences (6 ms)
85: [ RUN      ] PCL.IterativeClosestPoint
85: [       OK ] PCL.IterativeClosestPoint (196 ms)
85: [ RUN      ] PCL.IterativeClosestPointWithRejectors
85: [       OK ] PCL.IterativeClosestPointWithRejectors (3641 ms)
85: [ RUN      ] PCL.JointIterativeClosestPoint
85: [       OK ] PCL.JointIterativeClosestPoint (24283 ms)
85: [ RUN      ] PCL.IterativeClosestPointNonLinear
85: [       OK ] PCL.IterativeClosestPointNonLinear (9110 ms)
85: [ RUN      ] PCL.IterativeClosestPoint_PointToPlane
85: [pcl::IterativeClosestPoint::computeTransformation] Not enough correspondences found. Relax your threshold parameters.
85: [pcl::IterativeClosestPoint::computeTransformation] Not enough correspondences found. Relax your threshold parameters.
85: [pcl::IterativeClosestPoint::computeTransformation] Not enough correspondences found. Relax your threshold parameters.
85: [pcl::IterativeClosestPoint::computeTransformation] Not enough correspondences found. Relax your threshold parameters.
85: [pcl::IterativeClosestPoint::computeTransformation] Not enough correspondences found. Relax your threshold parameters.
85: [       OK ] PCL.IterativeClosestPoint_PointToPlane (361 ms)
85: [ RUN      ] PCL.GeneralizedIterativeClosestPoint
85: [       OK ] PCL.GeneralizedIterativeClosestPoint (14878 ms)
85: [ RUN      ] PCL.GeneralizedIterativeClosestPoint6D
85: [       OK ] PCL.GeneralizedIterativeClosestPoint6D (9052 ms)
85: [ RUN      ] PCL.NormalDistributionsTransform
85: [       OK ] PCL.NormalDistributionsTransform (101659 ms)
85: [ RUN      ] PCL.SampleConsensusInitialAlignment
85: [       OK ] PCL.SampleConsensusInitialAlignment (47349 ms)
85: [ RUN      ] PCL.SampleConsensusPrerejective
85: [       OK ] PCL.SampleConsensusPrerejective (39103 ms)
85: [ RUN      ] PCL.PyramidFeatureHistogram
6/7 Test  #85: a_registration_test ..............***Failed  339.47 sec

a_registration_test

This error is quite subtle and please let me know whether my understanding of the error, and its possible fix, is correct or not.

The main point of failure is this line here

access.push_back (static_cast<size_t> (floor ((feature[dim_i] - dimension_range_target_[dim_i].first) / hist_levels[level].bin_step[dim_i])));

that pushes back into access variable an "unknown" number when feature is nan. Feature is never checked whether it could be nan or not and this may happen because of this snippet code from PPFEstiamtion feature used for the test (and maybe in user code also)
for (size_t index_i = 0; index_i < indices_->size (); ++index_i)
{
size_t i = (*indices_)[index_i];
for (size_t j = 0 ; j < input_->points.size (); ++j)
{
PointOutT p;
if (i != j)
{
if (//pcl::computePPFPairFeature
pcl::computePairFeatures (input_->points[i].getVector4fMap (),
normals_->points[i].getNormalVector4fMap (),
input_->points[j].getVector4fMap (),
normals_->points[j].getNormalVector4fMap (),
p.f1, p.f2, p.f3, p.f4))
{
// Calculate alpha_m angle
Eigen::Vector3f model_reference_point = input_->points[i].getVector3fMap (),
model_reference_normal = normals_->points[i].getNormalVector3fMap (),
model_point = input_->points[j].getVector3fMap ();
float rotation_angle = acosf (model_reference_normal.dot (Eigen::Vector3f::UnitX ()));
bool parallel_to_x = (model_reference_normal.y() == 0.0f && model_reference_normal.z() == 0.0f);
Eigen::Vector3f rotation_axis = (parallel_to_x)?(Eigen::Vector3f::UnitY ()):(model_reference_normal.cross (Eigen::Vector3f::UnitX ()). normalized());
Eigen::AngleAxisf rotation_mg (rotation_angle, rotation_axis);
Eigen::Affine3f transform_mg (Eigen::Translation3f ( rotation_mg * ((-1) * model_reference_point)) * rotation_mg);
Eigen::Vector3f model_point_transformed = transform_mg * model_point;
float angle = atan2f ( -model_point_transformed(2), model_point_transformed(1));
if (sin (angle) * model_point_transformed(2) < 0.0f)
angle *= (-1);
p.alpha_m = -angle;
}
else
{
PCL_ERROR ("[pcl::%s::computeFeature] Computing pair feature vector between points %u and %u went wrong.\n", getClassName ().c_str (), i, j);
p.f1 = p.f2 = p.f3 = p.f4 = p.alpha_m = std::numeric_limits<float>::quiet_NaN ();
output.is_dense = false;
}
}
// Do not calculate the feature for identity pairs (i, i) as they are not used
// in the following computations
else
{
p.f1 = p.f2 = p.f3 = p.f4 = p.alpha_m = std::numeric_limits<float>::quiet_NaN ();
output.is_dense = false;
}
output.points[index_i*input_->points.size () + j] = p;
}
}
}

variable access is then used to evaluate an index of a vector and since it results in nan its use is undefined (or platform dependent since it will be converted to a size_t).
I made some tests on some platform and apparently (take this with care, I'm not 100% sure) accessing a vector with nan results in accessing element at 0 (that is, the cast to size_t results in 0).
It turns out that in Debug configuration under Windows, the index evaluated using access variable resulted in a very large number, not nan. As a consequence I got the out-of-bound access fualt from which I started the investigation.

Please note that if access is an arbitrary large number (as in Debug mode under Windows), the following line

vector_position += access[i] * dim_accumulator;

that translates access to a an index sometimes go over the maximum allowed representation of the type and, depending on the system, may result in a 0, thus masking the unwanted behaviour.

The fix I propose, assuming that the expected behavior is to have nan features mapped to histogram bin 0, is to change the following lines

for (size_t dim_i = 0; dim_i < nr_dimensions; ++dim_i)
access.push_back (static_cast<size_t> (floor ((feature[dim_i] - dimension_range_target_[dim_i].first) / hist_levels[level].bin_step[dim_i])));

in

  for (size_t dim_i = 0; dim_i < nr_dimensions; ++dim_i)
  {
    if (!pcl_isnan(feature[dim_i]))
      access.push_back (static_cast<size_t>(std::floor((feature[dim_i] - dimension_range_target_[dim_i].first) / hist_levels[level].bin_step[dim_i])));
    else
      access.push_back (static_cast<size_t>(0));
  }

to force the inex to be 0 for nans. By doing this the tests then run properly.

See further comments here #2399

@SergioRAgostinho
Copy link
Member

SergioRAgostinho commented Oct 19, 2018

I'm having trouble connecting all the dots. I need to have a look at the stack trace to be able to guide myself.

Regarding your proposal, it feels to me like we're hiding the problem under the carpet. I'm not fully into the problem like you, but intuitively it feels the problem needs to be fixed somewhere else. Inside the at method I would simply throw if feature[dim_i] is nan.

@claudiofantacci
Copy link
Contributor Author

claudiofantacci commented Oct 19, 2018

I'm having trouble connecting all the dots. I need to have a look at the stack trace to be able to guide myself.

Regarding your proposal, it feels to me like we're hiding the problem under the carpet. I'm not fully into the problem like you, but intuitively it feels the problem needs to be fixed somewhere else. Inside the at method I would simply throw I feature[dim_i] is nan.

Sorry for the confusion, I updated my initial post and hopefully it is a bit clearer. Please go over it again 😄 To be fair, I'm not sure my fix is correct becuse I'm unuware of the expected behaviour.

I agree that you should also have a look at the stack trace.

My intuition is the following:

  1. Assuming the tests are correctly implemented, if I map nan features in 0 the tests pass in both Release and Debug configuration. I tried skipping nan feature alltogether and the tests failed in both Release and Debug mode.
  2. Assuming that nan features should be ingored/skipped, then we need to re-write the tests and/or check the data being used inside the tests, becuase they fail as the tolerance are not respected.

claudiofantacci added a commit to claudiofantacci/pcl that referenced this issue Oct 19, 2018
@SergioRAgostinho SergioRAgostinho added this to the pcl-1.9.0 milestone Oct 21, 2018
@SergioRAgostinho
Copy link
Member

SergioRAgostinho commented Oct 21, 2018

Just want to quickly report my partial findings.

The first occurrence of nans in features happens on line 868

PyramidFeatureHistogram<PPFSignature>::Ptr pyramid_source (new PyramidFeatureHistogram<PPFSignature> ()),
pyramid_target (new PyramidFeatureHistogram<PPFSignature> ());
pyramid_source->setInputCloud (ppf_signature_source);
pyramid_source->setInputDimensionRange (dim_range_input);
pyramid_source->setTargetDimensionRange (dim_range_target);
pyramid_source->compute ();

The first and last points of the input cloud ppf_signature_source and ppf_signature_target are set to NaN. The expressions below are being evaluated at line 298

for (size_t feature_i = 0; feature_i < input_->points.size (); ++feature_i)
{
std::vector<float> feature_vector;
convertFeatureToVector (input_->points[feature_i], feature_vector);
addFeature (feature_vector);
}

(lldb) p input_->points.__begin_[0]
(pcl::PPFSignature) $13 = (f1 = NaN, f2 = NaN, f3 = NaN, f4 = NaN, alpha_m = NaN)
(lldb) p input_->points.__begin_[1]
(pcl::PPFSignature) $14 = (f1 = -0.0932385772, f2 = 0.00952175259, f3 = 0.0544939041, f4 = 0.00722551253, alpha_m = 2.07673001)
(lldb) p input_->points.__begin_[157607]
(pcl::PPFSignature) $15 = (f1 = 0.00231693685, f2 = 0.00789476931, f3 = -0.0350695662, f4 = 0.0119710751, alpha_m = 0.39232108)
(lldb) p input_->points.__begin_[157608]
(pcl::PPFSignature) $16 = (f1 = NaN, f2 = NaN, f3 = NaN, f4 = NaN, alpha_m = NaN)

I was able to verify the same with ppf_signature_target.

My next steps would be to understand if:

  1. these points can be filtered out safely. ppf_signature_source and ppf_signature_target don't have the same number of points so it might work.
  2. understand the reason why it is apparently always the first and last point. I haven't confirmed if there are other points also set to NaN, but the first and last are consistently there.
  3. does it make sense to let pyramid signature output NaNs. Is it trying to ensure it has the same number of points as the input cloud?

@claudiofantacci
Copy link
Contributor Author

@SergioRAgostinho we are synchronized on the stack trace 👍

I also noticed that the first and last element are always NaN, but I fear is something related to the test or on a common use case of the feature (evaluate the feature on a PointCloud without providing indices). If you look at how NaN are added

for (size_t index_i = 0; index_i < indices_->size (); ++index_i)
{
size_t i = (*indices_)[index_i];
for (size_t j = 0 ; j < input_->points.size (); ++j)
{
PointOutT p;
if (i != j)
{
if (//pcl::computePPFPairFeature
pcl::computePairFeatures (input_->points[i].getVector4fMap (),
normals_->points[i].getNormalVector4fMap (),
input_->points[j].getVector4fMap (),
normals_->points[j].getNormalVector4fMap (),
p.f1, p.f2, p.f3, p.f4))
{
// Calculate alpha_m angle
Eigen::Vector3f model_reference_point = input_->points[i].getVector3fMap (),
model_reference_normal = normals_->points[i].getNormalVector3fMap (),
model_point = input_->points[j].getVector3fMap ();
float rotation_angle = acosf (model_reference_normal.dot (Eigen::Vector3f::UnitX ()));
bool parallel_to_x = (model_reference_normal.y() == 0.0f && model_reference_normal.z() == 0.0f);
Eigen::Vector3f rotation_axis = (parallel_to_x)?(Eigen::Vector3f::UnitY ()):(model_reference_normal.cross (Eigen::Vector3f::UnitX ()). normalized());
Eigen::AngleAxisf rotation_mg (rotation_angle, rotation_axis);
Eigen::Affine3f transform_mg (Eigen::Translation3f ( rotation_mg * ((-1) * model_reference_point)) * rotation_mg);
Eigen::Vector3f model_point_transformed = transform_mg * model_point;
float angle = atan2f ( -model_point_transformed(2), model_point_transformed(1));
if (sin (angle) * model_point_transformed(2) < 0.0f)
angle *= (-1);
p.alpha_m = -angle;
}
else
{
PCL_ERROR ("[pcl::%s::computeFeature] Computing pair feature vector between points %u and %u went wrong.\n", getClassName ().c_str (), i, j);
p.f1 = p.f2 = p.f3 = p.f4 = p.alpha_m = std::numeric_limits<float>::quiet_NaN ();
output.is_dense = false;
}
}
// Do not calculate the feature for identity pairs (i, i) as they are not used
// in the following computations
else
{
p.f1 = p.f2 = p.f3 = p.f4 = p.alpha_m = std::numeric_limits<float>::quiet_NaN ();
output.is_dense = false;
}
output.points[index_i*input_->points.size () + j] = p;
}
}
}

it depends on the content of indices that cannot be known a-priori. As a result (by my understanding), there could be multiple NaN in the feature.

Let me reply in order:

  • these points can be filtered out safely. ppf_signature_source and ppf_signature_target don't have the same number of points so it might work.

I tried skipping NaN altogether in L298 of

for (size_t feature_i = 0; feature_i < input_->points.size (); ++feature_i)
{
std::vector<float> feature_vector;
convertFeatureToVector (input_->points[feature_i], feature_vector);
addFeature (feature_vector);
}

but the tests failed because the tolerance on the recognition tests were not respected.

My fear is that dim_accumulator in the following lines

for (int i = static_cast<int> (access.size ()) - 1; i >= 0; --i)
{
vector_position += access[i] * dim_accumulator;
dim_accumulator *= hist_levels[level].bins_per_dimension[i];
}
return hist_levels[level].hist[vector_position];

is accomulated differently and something in the algorithm is altered for it to work properly.

That is the reason why I forced NaN to be pushed in bin 0, because it is actually what happens in the current implementation.

  • understand the reason why it is apparently always the first and last point. I haven't confirmed if there are other points also set to NaN, but the first and last are consistently there.

My initial reasoning suggests that there may be other NaN points.

  • does it make sense to let pyramid signature output NaNs. Is it trying to ensure it has the same number of points as the input cloud?

I should read the paper of the subject feature to understand what is going on to propose a fix and some more detailed reasoning about this 😭

Anyway, I'll try to skip again the NaN points to see whether I did something wrong during my first attempts and let you know. It is important however that we agree that there is an issue in the code.

claudiofantacci added a commit to claudiofantacci/pcl that referenced this issue Oct 26, 2018
@claudiofantacci claudiofantacci mentioned this issue Oct 26, 2018
16 tasks
@SergioRAgostinho SergioRAgostinho modified the milestones: pcl-1.9.0, pcl-1.10.0 Nov 6, 2018
@kunaltyagi kunaltyagi modified the milestones: pcl-1.10.0, pcl-1.11.0 Feb 21, 2020
@kunaltyagi kunaltyagi modified the milestones: pcl-1.11.0, pcl-1.11.1 May 9, 2020
@stale
Copy link

stale bot commented Jun 9, 2020

Marking this as stale due to 30 days of inactivity. It will be closed in 7 days if no further activity occurs.

@stale stale bot added the status: stale label Jun 9, 2020
@kunaltyagi
Copy link
Member

I made some tests on some platform and apparently (take this with care, I'm not 100% sure) accessing a vector with nan results in accessing element at 0 (that is, the cast to size_t results in 0).

Cast of floating NaN and Inf to integer as defined as UB. That means, we have to check the cast location and manually convert it to 0 in order to have consistent performance across platforms.

Though, @claudiofantacci mentioned trying to shoehorn NaNs into bin 0 already

@stale stale bot removed the status: stale label Jun 9, 2020
@kunaltyagi kunaltyagi added kind: todo Type of issue needs: testing Specify why not closed/merged yet labels Jun 16, 2020
@kunaltyagi kunaltyagi removed this from the pcl-1.11.1 milestone Jun 16, 2020
@mvieth
Copy link
Member

mvieth commented Aug 15, 2021

This problem was fixed by pull request #4711

@mvieth mvieth closed this as completed Aug 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants