Fix memory fault during scaling of singular matrix #205

chrhansk · 2024-06-14T21:43:53Z

Should fix #200

jfowkes · 2024-06-15T04:52:55Z

@chrhansk unfortunately this change seems to break the main SSIDS test on all platforms.

mjacobse · 2024-06-15T07:16:55Z

Function hungarian_match is not only called from the scaling API, but also for the matching-based METIS ordering of SSIDS. Function mo_match in match_order.f90 expects unmatched entries to be signaled by negative values:

spral/src/match_order.f90

Line 571 in 077bd63

if (cperm(i) .lt. 0) then

So I would suggest not to change the behaviour of hungarian_match, but instead fix how the unsymmetric scaling code deals with what hungarian_match returns.

Alternatively one could tackle

spral/src/match_order.f90

Lines 10 to 13 in 077bd63

    
           ! FIXME: At some stage replace call to mo_match() with call to 
        
           ! a higher level routine from spral_scaling instead (NB: have to cope with 
        
           ! fact we are currently expecting a full matrix, even if it means 2x more log 
        
           ! operations)

but that might require quite a bit of refactoring?

chrhansk · 2024-06-15T12:06:15Z

I was not aware of the problem. I tries to zero out the problematic entries manually in the postprocessing function. There still seems to be a problem in ssmfe_ciface_test though. Problem is that I cannot reproduce the error on my system (tests pass without any memory issues). Do you have any ideas?

jfowkes · 2024-06-15T15:20:28Z

Many thanks @chrhansk, the intermittent SSMFE C test failure is #204 (nothing to do with your changes) which annoyingly I also cannot reproduce on my system making it very difficult to debug and fix. @mjacobse could you review?

mjacobse

Thanks, this is certainly an improvement over the current behavior that causes a segfault. However I am not sure the solution of doing the negative to zero conversion in the postprocessing function match_postproc is ideal:

It gives unnecessary responsibility to match_postproc and requires changing the match argument to intent(inout) which to my mind makes the calling contract less clear
When using the auction method, the conversion in match_postproc is done redundantly for a second time after

spral/src/scaling.f90

Lines 1487 to 1488 in 662c7ac

! We expect unmatched columns to have match(col) = 0

where(match(:) .eq. -1) match(:) = 0

already did it

Instead, I think it would be better to do the conversion in hungarian_wrapper before calling match_postproc. This is pretty minor and invisible to users though, so perhaps not relevant enough on its own.

More relevant though is that the way unmatched entries are returned from hungarian_scale_sym and hungarian_scale_unsym is now inconsistent. The unsymmetric version will now return 0 while the symmetric one continues to return negative entries, since the singular symmetric case does not call match_postproc as seen here:

spral/src/scaling.f90

Lines 679 to 686 in 662c7ac

    
           if ((.not. sym) .or. (inform%matched .eq. n)) then ! Unsymmetric or symmetric and full rank 
        
              ! Note that in this case m=n 
        
              rscaling(1:m) = dualu(1:m) 
        
              cscaling(1:n) = dualv(1:n) - cmax(1:n) 
        
              call match_postproc(m, n, ptr, row, val, rscaling, cscaling, & 
        
                   inform%matched, match, inform%flag, inform%stat) 
        
              return 
        
           end if

Curiously, the current documentation incorrectly claims to return zero for unmatched in both cases (hungarian_scale_sym and hungarian_scale_unsym), perhaps copy-pasted from the description of the auction method for which it is correct. Options I can think of:

Do the conversion from negative to zero on a temporary copy of match. That way, both hungarian_scale_sym and hungarian_scale_unsym continue to return negative entries. With this option, the incorrect documentation for both cases should be fixed (perhaps in a separate issue).
Accept this inconsistency. With this option, the incorrect documentation for the symmetric case should be fixed (perhaps in a separate issue)
Make the symmetric case work with and return zeros for unmatched entries too. The necessary changes should be limited to hungarian_wrapper and should work well with the changes for the unsymmetric case (when done in hungarian_wrapper, which would add a major reason for doing so to the above). Because of that it might make sense to change both at once instead of in a seperate issue.

The latter two options would break potential users who are relying on the negative entries (despite the wrong documentation) or would like to do so in the future, but it would introduce consistency with how the auction methods returns the matching and with the documented behavior. Not sure what's the best call here.

tests/scaling.f90

mjacobse · 2024-06-16T08:42:47Z

tests/scaling.f90

+
+  allocate(a%ptr(n+1))
+  allocate(a%row(nz), a%val(nz))
+  allocate(rscaling(m), cscaling(n), match(n))


Should be match(m), not match(n)

I am not sure about that. The docs state that it should be n (similarly for all matching algorithms). I am not exactly sure why to be honest though, but calling it with m in the unit test causes a segfault on my machine.

Hm you are right, I do get invalid write with valgrind for match(m). But when doing match(n), the last two entries in the example of that test are left uninitialized which surely is not intended either? Unless the idea is to use the info struct to obtain until where the values are initialized. Though the signature for hungarian_scale_unsym does use m:

spral/src/scaling.f90

Line 188 in 077bd63

integer, dimension(m), optional, intent(out) :: match

There seems to be another (unrelated?) issue here... :(

But the existing random unsymmetric tests also use an overallocated match(maxn), i.e. not leading to invalid writes but only uninitialized return, so agree with doing the same here. We can deal with that in a separate issue.

Turns out that match(m) is correct but that this happened to reveal a secondary bug which should probably be fixed before applying the changes proposed in this PR, see #200 (comment).

mjacobse · 2024-06-16T08:58:50Z

src/scaling.f90

@@ -1619,7 +1619,7 @@ subroutine match_postproc(m, n, ptr, row, val, rscaling, cscaling, nmatch, &
   real(wp), dimension(m), intent(inout) :: rscaling
   real(wp), dimension(n), intent(inout) :: cscaling
   integer, intent(in) :: nmatch
-   integer, dimension(m), intent(in) :: match
+   integer, dimension(m), intent(inout) :: match


Could be avoided by doing the conversion from negative to zero entries at the callsite instead of here, see detailed comment

mjacobse · 2024-06-18T20:54:16Z

tests/scaling.f90

+  a%n = n
+  a%m = m
+
+  a%ptr(1:n+1)          = (/   1,   3,   5,   5,   5,   7 /)


Should this be 1, 3, 5, 6, 6, 7 as in #200? As it stands, this matrix would have a duplicate entry.

chrhansk · 2024-06-21T13:53:17Z

Looking back at this: In particular in regards to #205: What is the proposed solution that you are converging on? As I understand #200 (comment) The lines setting the unmatched rows to negative values should be commented out, which to me implies that the values of unmatched rows should be set to zero instead. Am I correct in this regard?

Such a change would necessitate corresponding changes in the implementation of SSIDS, as mentioned here: #205 (comment)

Should I make those changes to the scaling and within SSIDS?

jfowkes · 2024-06-21T13:58:56Z

Yes we have come to the conclusion that whilst negative row indices (to signal which of the rows are unmatched) make sense for square matrices, this does not make sense for general rectangular matrices. As far as we can tell SSIDS does not make use of the negative row indices themselves, but merely checks if a row is unmatched, so in theory changing the values of unmatched rows to zero should work fine provided we update SSIDS to check for zero rather than negative values.

chrhansk · 2024-06-21T18:55:25Z

I adjusted the scaling accordingly and added a test for the symmetric singular case. There is however still the problematic part here mentioned in #200. In this case it causes a segfault (if the match array has a size of m < n), so this needs to be addressed.

jfowkes · 2024-06-22T06:37:34Z

Many thanks @chrhansk, the suggestion has been to comment this problematic section out as it should no longer be required if we now return zero for unmatched entries. I guess it's a case of trying that and seeing if anything breaks?

mjacobse · 2024-06-22T09:20:42Z

Basically the suggestion was this mjacobse@b815fac and indeed, it seems to be all that's needed to fix all issues at once. Personally I would like to see randomly generated singular tests to confirm, since all the random tests right now are nonsingular.

jfowkes · 2024-06-24T07:36:17Z

Indeed many thanks! @chrhansk I suggest we apply mjacobse@b815fac to this PR and add a randomly generated singular matrix test to verify that things don't break. We think this should be all that is required now.

chrhansk · 2024-06-24T11:03:34Z

Great, I appreciate your effort.

jfowkes added the bug label Jun 15, 2024

jfowkes requested a review from mjacobse June 15, 2024 15:20

mjacobse requested changes Jun 16, 2024

View reviewed changes

mjacobse reviewed Jun 18, 2024

View reviewed changes

mjacobse mentioned this pull request Jun 19, 2024

Memory fault during unsymmetric scaling of singular matrix with Hungarian algorithm #200

Open

chrhansk added 3 commits June 21, 2024 15:34

Fix memory fault during scaling of singular matrix

1514595

Manually zero out negative matching entries

8363779

Fix wrong name of test case

8b27c63

chrhansk force-pushed the feature-singular-scaling branch from a5e5bb2 to 8b27c63 Compare June 21, 2024 13:54

chrhansk added 4 commits June 21, 2024 17:19

Fix up matrix entries in example

2b4d3ad

Avoid oversized arrays in singular scaling test

2676d78

Zero out negative entries only for unsymmetric scaling

80abd69

Add example for scaling of a symmetric singular matrix

cd86051

jfowkes mentioned this pull request Jun 25, 2024

something awry in documentation for spral_scaling #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory fault during scaling of singular matrix #205

Fix memory fault during scaling of singular matrix #205

chrhansk commented Jun 14, 2024

jfowkes commented Jun 15, 2024

mjacobse commented Jun 15, 2024

chrhansk commented Jun 15, 2024

jfowkes commented Jun 15, 2024

mjacobse left a comment •

edited

Loading

mjacobse Jun 16, 2024

chrhansk Jun 16, 2024

mjacobse Jun 16, 2024 •

edited

Loading

mjacobse Jun 16, 2024

mjacobse Jun 18, 2024

mjacobse Jun 16, 2024

mjacobse Jun 18, 2024

chrhansk commented Jun 21, 2024

jfowkes commented Jun 21, 2024

chrhansk commented Jun 21, 2024

jfowkes commented Jun 22, 2024

mjacobse commented Jun 22, 2024

jfowkes commented Jun 24, 2024

chrhansk commented Jun 24, 2024

	! We expect unmatched columns to have match(col) = 0
	where(match(:) .eq. -1) match(:) = 0

	if ((.not. sym) .or. (inform%matched .eq. n)) then ! Unsymmetric or symmetric and full rank
	! Note that in this case m=n
	rscaling(1:m) = dualu(1:m)
	cscaling(1:n) = dualv(1:n) - cmax(1:n)
	call match_postproc(m, n, ptr, row, val, rscaling, cscaling, &
	inform%matched, match, inform%flag, inform%stat)
	return
	end if

Fix memory fault during scaling of singular matrix #205

Are you sure you want to change the base?

Fix memory fault during scaling of singular matrix #205

Conversation

chrhansk commented Jun 14, 2024

jfowkes commented Jun 15, 2024

mjacobse commented Jun 15, 2024

chrhansk commented Jun 15, 2024

jfowkes commented Jun 15, 2024

mjacobse left a comment • edited Loading

Choose a reason for hiding this comment

mjacobse Jun 16, 2024

Choose a reason for hiding this comment

chrhansk Jun 16, 2024

Choose a reason for hiding this comment

mjacobse Jun 16, 2024 • edited Loading

Choose a reason for hiding this comment

mjacobse Jun 16, 2024

Choose a reason for hiding this comment

mjacobse Jun 18, 2024

Choose a reason for hiding this comment

mjacobse Jun 16, 2024

Choose a reason for hiding this comment

mjacobse Jun 18, 2024

Choose a reason for hiding this comment

chrhansk commented Jun 21, 2024

jfowkes commented Jun 21, 2024

chrhansk commented Jun 21, 2024

jfowkes commented Jun 22, 2024

mjacobse commented Jun 22, 2024

jfowkes commented Jun 24, 2024

chrhansk commented Jun 24, 2024

mjacobse left a comment •

edited

Loading

mjacobse Jun 16, 2024 •

edited

Loading