-
-
Notifications
You must be signed in to change notification settings - Fork 438
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up Time for Coverage Combine #1483
Comments
Can you provide some more details? Ideally, you'd link us to your repo so that we can see the behavior you are seeing. But even if you can't do that: how long is combine taking? How many data files do you have, and how large are they? Can you show us the output of |
Linking the repo would not be possible but, i can adda little context here. p.s My major goal is to report coverage on a PR so if combine cannot be improved is there a better way to do this? |
Why are you only updating to 5.0.0? Can you try 6.5.0? |
Eventual goal is to move to the latest version but currently need python 2.7 support thats why 5.0 |
Please try 5.5 ( |
Two more ideas for you:
import collections
import glob
import hashlib
files = list(glob.glob(".coverage.*"))
print(f"{len(files)} files")
hashes = collections.defaultdict(list)
for file in files:
with open(file, "rb") as f:
h = hashlib.new("sha256")
h.update(f.read())
sha = h.hexdigest()
hashes[sha].append(file)
print(f"{len(hashes)} hashes")
for sha, files in hashes.items():
if len(files) > 1:
print(sha)
for file in files:
print(f" {file}") |
Oh, coverage.py writes a timestamp into the data file. If I remove that, then my 1111 files have only 761 unique hashes. |
So i found a similar issue https://github.com/nedbat/coveragepy/pull/765/files
Not sure if this works as expected with multiprocessing though. So i my case the parent process creates a .coverage file and all the child process keep rewriting that file so i was having issues with it. switching to parallel helped solve the issue. I will try with append though but is the behaviour same with multiprocessing set as concurrency.
How do we do that? |
When generating many parallel data files, often some data files will be exact copies of each other. Checking the hashes, we can avoid combining the duplicates, speeding the process.
When generating many parallel data files, often some data files will be exact copies of each other. Checking the hashes, we can avoid combining the duplicates, speeding the process.
You can install coverage.py from an experimental branch:
It will check the hashes of the files, and skip duplicates automatically. Let me know if you can try it, and what the results are. The final reports should all be exactly the same, but the |
When generating many parallel data files, often some data files will be exact copies of each other. Checking the hashes, we can avoid combining the duplicates, speeding the process.
Ahh Ok. I ll try and do it with coverage 5.5 as base for the experimental branch and update any improvement here |
When generating many parallel data files, often some data files will be exact copies of each other. Checking the hashes, we can avoid combining the duplicates, speeding the process. On a coverage.py metacov, we had 651 duplicates out of 2189 files (29%). The time to combine was reduced by 17%.
This is now merged to coverage.py master, but it won't run on Python 2. I'd be interested if you could find a way to run it in your environment. |
Earlier today I discovered that I had ~68000 coverage files inside my .tox folder. We started to use coveragepy using multiprocessing and thread as we use xdist and we also use subprocesses. It was so bad that I was not even able to use globbing to clean the files as I was reaching too many args error. I managed to clean using I have ~700 tests and a single run produces ~600 coverage files. I really wonder why so many as I clearly do not run so many subprocesses. I am not sure when this happened but I suspect there is a regression in recent versions of coverage, maybe in 6.x? My version is 6.4.4 with C extensions.
As combine is run only on a different tox environment I endup with a growing pile of files. Any ideas on how to limit its growing? Tx |
@ssbarnea I'd be interested to investigate this, but it sounds like it should be a new issue. |
I made some more improvements to the de-duplication during combining. If anyone wants to try it, install from master, or the specific commit:
Let me know how it works for you. |
I'm going to close this now as "improved" :) If people have further problems, feel free to open a new issue. |
This is now released as part of coverage 7.0.0b1. |
Version 7.2.5 — 2023-04-30 -------------------------- - Fix: ``html_report()`` could fail with an AttributeError on ``isatty`` if run in an unusual environment where sys.stdout had been replaced. This is now fixed. Version 7.2.4 — 2023-04-28 -------------------------- PyCon 2023 sprint fixes! - Fix: with ``relative_files = true``, specifying a specific file to include or omit wouldn't work correctly (`issue 1604`_). This is now fixed, with testing help by `Marc Gibbons <pull 1608_>`_. - Fix: the XML report would have an incorrect ``<source>`` element when using relative files and the source option ended with a slash (`issue 1541`_). This is now fixed, thanks to `Kevin Brown-Silva <pull 1608_>`_. - When the HTML report location is printed to the terminal, it's now a terminal-compatible URL, so that you can click the location to open the HTML file in your browser. Finishes `issue 1523`_ thanks to `Ricardo Newbery <pull 1613_>`_. - Docs: a new :ref:`Migrating page <migrating>` with details about how to migrate between major versions of coverage.py. It currently covers the wildcard changes in 7.x. Thanks, `Brian Grohe <pull 1610_>`_. .. _issue 1523: nedbat/coveragepy#1523 .. _issue 1541: nedbat/coveragepy#1541 .. _issue 1604: nedbat/coveragepy#1604 .. _pull 1608: nedbat/coveragepy#1608 .. _pull 1609: nedbat/coveragepy#1609 .. _pull 1610: nedbat/coveragepy#1610 .. _pull 1613: nedbat/coveragepy#1613 Version 7.2.3 — 2023-04-06 -------------------------- - Fix: the :ref:`config_run_sigterm` setting was meant to capture data if a process was terminated with a SIGTERM signal, but it didn't always. This was fixed thanks to `Lewis Gaul <pull 1600_>`_, closing `issue 1599`_. - Performance: HTML reports with context information are now much more compact. File sizes are typically as small as one-third the previous size, but can be dramatically smaller. This closes `issue 1584`_ thanks to `Oleh Krehel <pull 1587_>`_. - Development dependencies no longer use hashed pins, closing `issue 1592`_. .. _issue 1584: nedbat/coveragepy#1584 .. _pull 1587: nedbat/coveragepy#1587 .. _issue 1592: nedbat/coveragepy#1592 .. _issue 1599: nedbat/coveragepy#1599 .. _pull 1600: nedbat/coveragepy#1600 Version 7.2.2 — 2023-03-16 -------------------------- - Fix: if a virtualenv was created inside a source directory, and a sourced package was installed inside the virtualenv, then all of the third-party packages inside the virtualenv would be measured. This was incorrect, but has now been fixed: only the specified packages will be measured, thanks to `Manuel Jacob <pull 1560_>`_. - Fix: the ``coverage lcov`` command could create a .lcov file with incorrect LF (lines found) and LH (lines hit) totals. This is now fixed, thanks to `Ian Moore <pull 1583_>`_. - Fix: the ``coverage xml`` command on Windows could create a .xml file with duplicate ``<package>`` elements. This is now fixed, thanks to `Benjamin Parzella <pull 1574_>`_, closing `issue 1573`_. .. _pull 1560: nedbat/coveragepy#1560 .. _issue 1573: nedbat/coveragepy#1573 .. _pull 1574: nedbat/coveragepy#1574 .. _pull 1583: nedbat/coveragepy#1583 Version 7.2.1 — 2023-02-26 -------------------------- - Fix: the PyPI page had broken links to documentation pages, but no longer does, closing `issue 1566`_. - Fix: public members of the coverage module are now properly indicated so that mypy will find them, fixing `issue 1564`_. .. _issue 1564: nedbat/coveragepy#1564 .. _issue 1566: nedbat/coveragepy#1566 Version 7.2.0 — 2023-02-22 -------------------------- - Added a new setting ``[report] exclude_also`` to let you add more exclusions without overwriting the defaults. Thanks, `Alpha Chen <pull 1557_>`_, closing `issue 1391`_. - Added a :meth:`.CoverageData.purge_files` method to remove recorded data for a particular file. Contributed by `Stephan Deibel <pull 1547_>`_. - Fix: when reporting commands fail, they will no longer congratulate themselves with messages like "Wrote XML report to file.xml" before spewing a traceback about their failure. - Fix: arguments in the public API that name file paths now accept pathlib.Path objects. This includes the ``data_file`` and ``config_file`` arguments to the Coverage constructor and the ``basename`` argument to CoverageData. Closes `issue 1552`_. - Fix: In some embedded environments, an IndexError could occur on stop() when the originating thread exits before completion. This is now fixed, thanks to `Russell Keith-Magee <pull 1543_>`_, closing `issue 1542`_. - Added a ``py.typed`` file to announce our type-hintedness. Thanks, `KotlinIsland <pull 1550_>`_. .. _issue 1391: nedbat/coveragepy#1391 .. _issue 1542: nedbat/coveragepy#1542 .. _pull 1543: nedbat/coveragepy#1543 .. _pull 1547: nedbat/coveragepy#1547 .. _pull 1550: nedbat/coveragepy#1550 .. _issue 1552: nedbat/coveragepy#1552 .. _pull 1557: nedbat/coveragepy#1557 Version 7.1.0 — 2023-01-24 -------------------------- - Added: the debug output file can now be specified with ``[run] debug_file`` in the configuration file. Closes `issue 1319`_. - Performance: fixed a slowdown with dynamic contexts that's been around since 6.4.3. The fix closes `issue 1538`_. Thankfully this doesn't break the `Cython change`_ that fixed `issue 972`_. Thanks to Mathieu Kniewallner for the deep investigative work and comprehensive issue report. - Typing: all product and test code has type annotations. .. _Cython change: nedbat/coveragepy#1347 .. _issue 972: nedbat/coveragepy#972 .. _issue 1319: nedbat/coveragepy#1319 .. _issue 1538: nedbat/coveragepy#1538 Version 7.0.5 — 2023-01-10 -------------------------- - Fix: On Python 3.7, a file with type annotations but no ``from __future__ import annotations`` would be missing statements in the coverage report. This is now fixed, closing `issue 1524`_. .. _issue 1524: nedbat/coveragepy#1524 Version 7.0.4 — 2023-01-07 -------------------------- - Performance: an internal cache of file names was accidentally disabled, resulting in sometimes drastic reductions in performance. This is now fixed, closing `issue 1527`_. Thanks to Ivan Ciuvalschii for the reproducible test case. .. _issue 1527: nedbat/coveragepy#1527 Version 7.0.3 — 2023-01-03 -------------------------- - Fix: when using pytest-cov or pytest-xdist, or perhaps both, the combining step could fail with ``assert row is not None`` using 7.0.2. This was due to a race condition that has always been possible and is still possible. In 7.0.1 and before, the error was silently swallowed by the combining code. Now it will produce a message "Couldn't combine data file" and ignore the data file as it used to do before 7.0.2. Closes `issue 1522`_. .. _issue 1522: nedbat/coveragepy#1522 Version 7.0.2 — 2023-01-02 -------------------------- - Fix: when using the ``[run] relative_files = True`` setting, a relative ``[paths]`` pattern was still being made absolute. This is now fixed, closing `issue 1519`_. - Fix: if Python doesn't provide tomllib, then TOML configuration files can only be read if coverage.py is installed with the ``[toml]`` extra. Coverage.py will raise an error if TOML support is not installed when it sees your settings are in a .toml file. But it didn't understand that ``[tools.coverage]`` was a valid section header, so the error wasn't reported if you used that header, and settings were silently ignored. This is now fixed, closing `issue 1516`_. - Fix: adjusted how decorators are traced on PyPy 7.3.10, fixing `issue 1515`_. - Fix: the ``coverage lcov`` report did not properly implement the ``--fail-under=MIN`` option. This has been fixed. - Refactor: added many type annotations, including a number of refactorings. This should not affect outward behavior, but they were a bit invasive in some places, so keep your eyes peeled for oddities. - Refactor: removed the vestigial and long untested support for Jython and IronPython. .. _issue 1515: nedbat/coveragepy#1515 .. _issue 1516: nedbat/coveragepy#1516 .. _issue 1519: nedbat/coveragepy#1519 Version 7.0.1 — 2022-12-23 -------------------------- - When checking if a file mapping resolved to a file that exists, we weren't considering files in .whl files. This is now fixed, closing `issue 1511`_. - File pattern rules were too strict, forbidding plus signs and curly braces in directory and file names. This is now fixed, closing `issue 1513`_. - Unusual Unicode or control characters in source files could prevent reporting. This is now fixed, closing `issue 1512`_. - The PyPy wheel now installs on PyPy 3.7, 3.8, and 3.9, closing `issue 1510`_. .. _issue 1510: nedbat/coveragepy#1510 .. _issue 1511: nedbat/coveragepy#1511 .. _issue 1512: nedbat/coveragepy#1512 .. _issue 1513: nedbat/coveragepy#1513 Version 7.0.0 — 2022-12-18 -------------------------- Nothing new beyond 7.0.0b1. Version 7.0.0b1 — 2022-12-03 ---------------------------- A number of changes have been made to file path handling, including pattern matching and path remapping with the ``[paths]`` setting (see :ref:`config_paths`). These changes might affect you, and require you to update your settings. (This release includes the changes from `6.6.0b1 <changes_6-6-0b1_>`_, since 6.6.0 was never released.) - Changes to file pattern matching, which might require updating your configuration: - Previously, ``*`` would incorrectly match directory separators, making precise matching difficult. This is now fixed, closing `issue 1407`_. - Now ``**`` matches any number of nested directories, including none. - Improvements to combining data files when using the :ref:`config_run_relative_files` setting, which might require updating your configuration: - During ``coverage combine``, relative file paths are implicitly combined without needing a ``[paths]`` configuration setting. This also fixed `issue 991`_. - A ``[paths]`` setting like ``*/foo`` will now match ``foo/bar.py`` so that relative file paths can be combined more easily. - The :ref:`config_run_relative_files` setting is properly interpreted in more places, fixing `issue 1280`_. - When remapping file paths with ``[paths]``, a path will be remapped only if the resulting path exists. The documentation has long said the prefix had to exist, but it was never enforced. This fixes `issue 608`_, improves `issue 649`_, and closes `issue 757`_. - Reporting operations now implicitly use the ``[paths]`` setting to remap file paths within a single data file. Combining multiple files still requires the ``coverage combine`` step, but this simplifies some single-file situations. Closes `issue 1212`_ and `issue 713`_. - The ``coverage report`` command now has a ``--format=`` option. The original style is now ``--format=text``, and is the default. - Using ``--format=markdown`` will write the table in Markdown format, thanks to `Steve Oswald <pull 1479_>`_, closing `issue 1418`_. - Using ``--format=total`` will write a single total number to the output. This can be useful for making badges or writing status updates. - Combining data files with ``coverage combine`` now hashes the data files to skip files that add no new information. This can reduce the time needed. Many details affect the speed-up, but for coverage.py's own test suite, combining is about 40% faster. Closes `issue 1483`_. - When searching for completely un-executed files, coverage.py uses the presence of ``__init__.py`` files to determine which directories have source that could have been imported. However, `implicit namespace packages`_ don't require ``__init__.py``. A new setting ``[report] include_namespace_packages`` tells coverage.py to consider these directories during reporting. Thanks to `Felix Horvat <pull 1387_>`_ for the contribution. Closes `issue 1383`_ and `issue 1024`_. - Fixed environment variable expansion in pyproject.toml files. It was overly broad, causing errors outside of coverage.py settings, as described in `issue 1481`_ and `issue 1345`_. This is now fixed, but in rare cases will require changing your pyproject.toml to quote non-string values that use environment substitution. - An empty file has a coverage total of 100%, but used to fail with ``--fail-under``. This has been fixed, closing `issue 1470`_. - The text report table no longer writes out two separator lines if there are no files listed in the table. One is plenty. - Fixed a mis-measurement of a strange use of wildcard alternatives in match/case statements, closing `issue 1421`_. - Fixed internal logic that prevented coverage.py from running on implementations other than CPython or PyPy (`issue 1474`_). - The deprecated ``[run] note`` setting has been completely removed. .. _implicit namespace packages: https://peps.python.org/pep-0420/ .. _issue 608: nedbat/coveragepy#608 .. _issue 649: nedbat/coveragepy#649 .. _issue 713: nedbat/coveragepy#713 .. _issue 757: nedbat/coveragepy#757 .. _issue 991: nedbat/coveragepy#991 .. _issue 1024: nedbat/coveragepy#1024 .. _issue 1212: nedbat/coveragepy#1212 .. _issue 1280: nedbat/coveragepy#1280 .. _issue 1345: nedbat/coveragepy#1345 .. _issue 1383: nedbat/coveragepy#1383 .. _issue 1407: nedbat/coveragepy#1407 .. _issue 1418: nedbat/coveragepy#1418 .. _issue 1421: nedbat/coveragepy#1421 .. _issue 1470: nedbat/coveragepy#1470 .. _issue 1474: nedbat/coveragepy#1474 .. _issue 1481: nedbat/coveragepy#1481 .. _issue 1483: nedbat/coveragepy#1483 .. _pull 1387: nedbat/coveragepy#1387 .. _pull 1479: nedbat/coveragepy#1479 Version 6.6.0b1 — 2022-10-31 ---------------------------- (Note: 6.6.0 final was never released. These changes are part of `7.0.0b1 <changes_7-0-0b1_>`_.) - Changes to file pattern matching, which might require updating your configuration: - Previously, ``*`` would incorrectly match directory separators, making precise matching difficult. This is now fixed, closing `issue 1407`_. - Now ``**`` matches any number of nested directories, including none. - Improvements to combining data files when using the :ref:`config_run_relative_files` setting: - During ``coverage combine``, relative file paths are implicitly combined without needing a ``[paths]`` configuration setting. This also fixed `issue 991`_. - A ``[paths]`` setting like ``*/foo`` will now match ``foo/bar.py`` so that relative file paths can be combined more easily. - The setting is properly interpreted in more places, fixing `issue 1280`_. - Fixed environment variable expansion in pyproject.toml files. It was overly broad, causing errors outside of coverage.py settings, as described in `issue 1481`_ and `issue 1345`_. This is now fixed, but in rare cases will require changing your pyproject.toml to quote non-string values that use environment substitution. - Fixed internal logic that prevented coverage.py from running on implementations other than CPython or PyPy (`issue 1474`_). .. _issue 991: nedbat/coveragepy#991 .. _issue 1280: nedbat/coveragepy#1280 .. _issue 1345: nedbat/coveragepy#1345 .. _issue 1407: nedbat/coveragepy#1407 .. _issue 1474: nedbat/coveragepy#1474 .. _issue 1481: nedbat/coveragepy#1481
I'm working on a pretty big project - our coverage report is 412mb big. The entire codebase is about 8 million lines. We use combine to build the coverage database, and it's painfully slow - takes almost 3 hours to build. It combines 690 different coverage files together. Are there any plans to introduce multithreading / multiprocessing into the process? This would be a huge win for us. |
Three hours is a really long time, that's surprising. What version of coverage and Python are you using? I haven't thought about multithreading or multiprocessing. It would be instructive to have a reproducible case so we can see where the time is being spent. |
@jbkkd (see above) |
@jbkkd BTW, I can consult hourly with an NDA if needed to work on private code. |
I fixed a few quadratic-time behaviors in the combining code. The 3+ hour combine now takes 7 minutes. |
Is your feature request related to a problem? Please describe.
I am trying to setup a code coverage representation for my pull requests for specific number of tests. Using coverage combine to combine all the data files takes a lot of time which is not very ideal. Is there a quick resolution for this. I tried moving to 5.0 but the combine performance degraded from previous versions which has pickled dictionary as the datastore instead of sqlite.
Describe the solution you'd like
My idea is to somehow bring down the combine step in multiprocessing cases where we parallely run the tests and combine them at a later stage
Describe alternatives you've considered
For doing a per PR coverage i did consider using collector itself to have the data but i would want to avoid any usage of database from my side and if possible i would prefer using that coverage data file as that is much more convenient to debug
Additional context
Add any other context about the feature request here.
The text was updated successfully, but these errors were encountered: