Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed-up the creation of a zero matrix of type Matrix_modn_dense_template #36093

Merged
merged 10 commits into from
Aug 27, 2023

Conversation

marizee
Copy link
Contributor

@marizee marizee commented Aug 16, 2023

Fixes #35961

For example, creating the zero matrix over GF(9001) is currently much slower than over the rationals:

sage: space = MatrixSpace(GF(9001), 50, 50)
sage: %time zmat = space.zero_matrix()
CPU times: user 131 µs, sys: 16 µs, total: 147 µs
Wall time: 150 µs
sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: %time zmat = space.zero_matrix()
CPU times: user 1.54 s, sys: 208 ms, total: 1.75 s
Wall time: 1.75 s
sage: space = MatrixSpace(QQ, 50, 50)
sage: %time zmat = space.zero_matrix()
CPU times: user 85 µs, sys: 9 µs, total: 94 µs
Wall time: 97 µs
sage: space = MatrixSpace(QQ, 10000, 10000)
sage: %time zmat = space.zero_matrix()
CPU times: user 167 ms, sys: 417 ms, total: 584 ms
Wall time: 582 ms

This can be improved by considering the iterator as sparse=True in the constructor __init__:

sage: space = MatrixSpace(GF(9001), 50,50)
sage: %time zmat = space.zero_matrix()
CPU times: user 104 µs, sys: 9 µs, total: 113 µs
Wall time: 114 µs
sage: space = MatrixSpace(GF(9001), 10000, 10000)
sage: %time zmat = space.zero_matrix()
CPU times: user 135 µs, sys: 14 µs, total: 149 µs
Wall time: 152 µs

However, it involves to calloc the memory as zero_matrix will only fill the diagonal coefficients of the new matrix. To avoid unnecessary computing for other methods calling the constructor __cinit__ that don't need to be zeroed beforehand, the parameter zeroed_alloc allows to choose between calling check_calloc or check_allocarray with a malloc currently implemented.

This also substantially improves identity_matrix() from special.py:

sage: %timeit dmat = identity_matrix(GF(9001), 10000, sparse=False)
1.81 s ± 3.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
sage: %timeit imat = identity_matrix(field, 10000,10000)
2.53 ms ± 5.24 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

📝 Checklist

  • The title is concise, informative, and self-explanatory.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation accordingly.

⌛ Dependencies

@vneiger
Copy link
Contributor

vneiger commented Aug 17, 2023

Thank you for this. At first sight, it looks good. I will run some more benchmarks to make sure there is no performance regression due to the calloc.

Could you please check if there are other places where zeroed_alloc=False would make sense? Two such places were noted in #35961 (comment)

There are currently two testing errors. Could you please check if they are due to this PR?

sage -t --long --random-seed=286735480429121101562228604801325644303 sage/schemes/curves/projective_curve.py  # 1 doctest failed
sage -t --long --random-seed=286735480429121101562228604801325644303 sage/schemes/curves/zariski_vankampen.py  # 1 doctest failed

@vneiger
Copy link
Contributor

vneiger commented Aug 18, 2023

In terms of performance, this looks good to me.

Some more comments. Creating the zero matrix is always significantly faster than before. In terms of performance regression for __neg__ and the like, where zeroed_alloc plays a role: there is sometimes a very slight performance decrease (due to the branching in __cinit__ ?) but this is largely compensated by the fact that __copy__ is much faster (sometimes by a factor more than 2) with this branching, rather than having a check_calloc followed by a memcpy.

@marizee
Copy link
Contributor Author

marizee commented Aug 18, 2023

I have added the missing zeroed_alloc=False.

Regarding the testing errors, they do not seem to come from this PR, I have seen them in other PR.

@vneiger
Copy link
Contributor

vneiger commented Aug 19, 2023

This looks good to me. (Looking closer at the new __init__ revealed a bug in the conversion... I filed this under #36104 and I think this should be treated separately from this PR.)

Some more details about performance. There is no need to comment on the significant speedups thanks to the sparse iterator (factors of several hundreds or even thousands, cf the timings above for creating the zero or identity matrix). The main question is whether the added branching in __cinit__ could have negative impact on the situation where we initialize a matrix without going through __init__, e.g. _neg_, _add_, _lmul_, all the calls to new_matrix, etc. I timed some of them, preferably those where there is not much computation beside the memory allocation. There is a very small performance penalty (< 5%), probably due to the branching in __cinit__. However I do not see better solutions:

  • to avoid the branching, one could always do calloc in __cinit__ even when a malloc would suffice: then, this wastes quite some time in mid-range matrix dimensions (say around 500 to 4000) for operations that are mainly about __cinit__, e.g. __copy__ or _neg_ (losing a factor more than 2 in some cases).
  • one could keep the current malloc, and rather use a memset at the beginning of __init__; this turns to be very slow, including for the creation of the zero matrix which becomes several times slower, which is not desirable.

Timings are given below for three variants (current sage 10.1.rc0, proposed solution in this PR, and version using always calloc), using the following piece of code:

field = GF(9001)
small_spaces = [MatrixSpace(field, dim, dim) for dim in [2,5,10,20,50,100]]
medium_spaces = [MatrixSpace(field, dim, dim) for dim in [200,500]]
large_spaces = [MatrixSpace(field, dim, dim) for dim in [1000, 2000, 5000]]
small_timeit_args = dict(number=20000, repeat=20, seconds=True)
medium_timeit_args = dict(number=5000, repeat=8, seconds=True)
large_timeit_args = dict(number=100, repeat=3, seconds=True)
spaces = [small_spaces, medium_spaces, large_spaces]
timeit_args = [small_timeit_args, medium_timeit_args, large_timeit_args]

print("dim\tneg\t\tlmul\t\tcp\t\tadd\t\tsub\t\tmod\t\tred")
for i in range(3):
    ti_args = timeit_args[i]
    smaller_ti_args = copy(ti_args)
    smaller_ti_args['repeat'] /= 2
    smaller_ti_args['number'] /= 50
    for space in spaces[i]:
        mat = space.random_element()
        mat2 = space.random_element()
        zmat = matrix.random(ZZ, space.nrows(), space.ncols(), x=-2**100, y=2**100)
        elt = field.random_element()
        tneg = timeit("nmat = mat.__neg__()", **ti_args)
        tlmul = timeit("lmulmat = elt*mat", **smaller_ti_args)
        tcp = timeit("cpmat = mat.__copy__()", **ti_args)
        tadd = timeit("addmat = mat+mat2", **ti_args)
        tsub = timeit("submat = mat-mat2", **ti_args)
        tmod = timeit("modmat = zmat._mod_int(9001)", **smaller_ti_args)
        tred = timeit("modmat = zmat._reduce([5,9001,67108879])", **smaller_ti_args)
        print(f"{space.nrows()}\t{tneg:.5e}\t{tlmul:.5e}\t{tcp:.5e}\t{tadd:.5e}\t{tsub:.5e}\t{tmod:.5e}\t{tred:.5e}")

Current sage 10.1.rc0:

dim     neg             lmul            cp              add             sub             mod             red
2       1.04920e-06     1.11478e-06     1.03573e-06     1.00645e-06     1.01389e-06     9.48737e-06     3.25811e-05
5       1.08373e-06     1.38123e-06     1.05366e-06     1.05228e-06     1.02428e-06     9.97889e-06     3.48243e-05
10      1.14022e-06     2.17511e-06     1.08747e-06     1.14064e-06     1.13249e-06     1.13768e-05     4.15419e-05
20      1.42487e-06     1.13507e-05     1.14868e-06     1.50340e-06     1.51895e-06     1.69409e-05     6.80913e-05
50      3.14611e-06     1.02661e-04     1.55741e-06     3.80037e-06     3.91791e-06     5.38073e-05     2.52186e-04
100     9.41528e-06     4.30621e-04     2.92653e-06     1.23570e-05     1.20083e-05     1.88305e-04     8.90810e-04
200     3.32081e-05     1.36992e-03     9.92656e-06     4.29698e-05     4.48893e-05     6.87796e-04     3.38875e-03
500     2.00812e-04     1.03811e-02     5.74063e-05     2.73444e-04     2.88948e-04     4.42284e-03     2.22136e-02
1000    9.96291e-04     7.72575e-01     3.75584e-04     1.32047e-03     1.36660e-03     1.83528e-02     9.21014e-02
2000    4.10982e-03     3.08341e+00     2.08157e-03     5.40352e-03     5.58764e-03     7.29788e-02     3.67008e-01
5000    7.15327e-02     1.94876e+01     6.28877e-02     8.10343e-02     8.23976e-02     5.14923e-01     2.37465e+00

With calloc:

dim     neg             lmul            cp              add             sub             mod             red
2       1.08760e-06     1.20931e-06     1.08560e-06     1.05942e-06     1.05036e-06     1.02416e-05     3.34315e-05
5       1.16667e-06     1.54108e-06     1.17219e-06     1.15436e-06     1.12991e-06     1.09015e-05     3.56833e-05
10      1.25365e-06     2.41498e-06     1.18510e-06     1.24296e-06     1.25182e-06     1.22087e-05     4.29959e-05
20      1.65854e-06     1.41743e-05     1.23270e-06     1.58763e-06     1.60569e-06     1.80272e-05     6.98943e-05
50      4.63961e-06     9.11760e-05     1.81459e-06     4.05638e-06     4.26118e-06     5.57278e-05     2.63128e-04
100     1.54378e-05     4.27696e-04     4.20998e-06     1.37665e-05     1.42754e-05     2.00314e-04     9.48851e-04
200     5.85901e-05     1.78478e-03     1.63732e-05     5.18305e-05     5.45510e-05     7.43514e-04     3.61925e-03
500     3.70120e-04     1.11751e-02     1.65742e-04     3.55436e-04     3.72249e-04     4.79963e-03     2.36530e-02
1000    1.74064e-03     8.31788e-01     8.42331e-04     1.59744e-03     1.65647e-03     1.97940e-02     9.70669e-02
2000    7.07259e-03     3.31929e+00     3.37274e-03     6.59989e-03     6.81077e-03     7.84418e-02     3.82814e-01
5000    8.65640e-02     2.10980e+01     6.47614e-02     8.36451e-02     8.52828e-02     5.32836e-01     2.46436e+00

This PR:

dim     neg             lmul            cp              add             sub             mod             red
2       1.16338e-06     1.20696e-06     1.15783e-06     1.14969e-06     1.13808e-06     1.00704e-05     3.36738e-05
5       1.22451e-06     1.58061e-06     1.18837e-06     1.20754e-06     1.17224e-06     1.04352e-05     3.56168e-05
10      1.25631e-06     2.35847e-06     1.19121e-06     1.29480e-06     1.28789e-06     1.19423e-05     4.29158e-05
20      1.54417e-06     1.39064e-05     1.25487e-06     1.71971e-06     1.68017e-06     1.74728e-05     6.84942e-05
50      3.24817e-06     9.96673e-05     1.69398e-06     4.55949e-06     4.10663e-06     5.45400e-05     2.45368e-04
100     9.33694e-06     4.30981e-04     3.02371e-06     1.46482e-05     1.28682e-05     1.96991e-04     9.01243e-04
200     3.35677e-05     1.28634e-03     1.05197e-05     5.45516e-05     4.59399e-05     7.14479e-04     3.45905e-03
500     2.01636e-04     1.08036e-02     5.95490e-05     3.43807e-04     2.95159e-04     4.61207e-03     2.23629e-02
1000    9.92690e-04     8.19338e-01     3.67429e-04     1.52554e-03     1.40650e-03     1.90070e-02     9.20208e-02
2000    4.11059e-03     3.25830e+00     2.08682e-03     6.28212e-03     5.65227e-03     7.55510e-02     3.68000e-01
5000    7.27222e-02     2.05609e+01     6.38271e-02     8.86618e-02     8.34894e-02     5.12081e-01     2.32309e+00

self._matrix = <celement **>check_allocarray(self._nrows, sizeof(celement*))
if zeroed_alloc:
self._entries = <celement *>check_calloc(self._nrows * self._ncols, sizeof(celement))
self._matrix = <celement **>check_calloc(self._nrows, sizeof(celement*))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_matrix never needs calloc since it is filled with values just after this allocation

@vneiger
Copy link
Contributor

vneiger commented Aug 20, 2023

Just added a comment about some minor improvement I had missed. There is no need to calloc the _matrix array of pointers. Relaunching the above detailed benchmark did not show any impact on timings, as expected. But running tests on a very "tall" matrix, with many rows and just a few columns, may show a small difference in favor of the version not using calloc for _matrix.

@github-actions
Copy link

Documentation preview for this PR (built with commit 513a5b5; changes) is ready! 🎉

@vbraun
Copy link
Member

vbraun commented Aug 21, 2023

Merge conflict

@vneiger
Copy link
Contributor

vneiger commented Aug 22, 2023

Merge conflict

Working with @marizee , we have assumed this conflict was between this PR and another one also authored by @marizee 36059. So she has worked on merging the former into the latter. I have checked the merge and it looks good to me so I would propose to close this PR and merge 36059 into develop.

One question still. This is pure guess about the conflict, as we have not found where to look for more information about this merge conflict you have announced. Is there some way for us to see that there is a merge conflict, and to get some information about it?

@vneiger
Copy link
Contributor

vneiger commented Aug 23, 2023

(waiting for an answer by @vbraun to the last question, and then we should be able to close this PR as resolved via #36059 , unless the merge conflict has been misunderstood)

@vbraun vbraun merged commit f42998b into sagemath:develop Aug 27, 2023
13 of 14 checks passed
@mkoeppe mkoeppe added this to the sage-10.2 milestone Aug 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Accelerating the construction of matrices of type Matrix_modn_dense
5 participants