Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support of Git mailmap files #303

Merged
merged 13 commits into from
Oct 14, 2024
Merged

Add support of Git mailmap files #303

merged 13 commits into from
Oct 14, 2024

Conversation

HelgeCPH
Copy link
Contributor

PyDriller does not support Git .mailmap files. That is, authors or committers, which are represented as Developer objects always get assigned the name and email values that are stated in respective commits.

I would like that PyDriller supported use of .mailmap files directly. So far, I have worked around lacking mailmap support by letting my external scripts map users with multiple email addresses or slightly different names to canonical data. I believe there might be more users than me who could be interested in this feature.

Currently, PyDriller does not support .mailmap files since it relies on GitPython to receive names and emails for
Developers. GitPython does not support .mailmap files either.
Since they state that GitPython is in maintenance mode, I contribute support of .mailmap files to PyDriller and not to the underlying dependency. I believe that this functionality is also better suited to PyDriller, a library with which users want to run analysis of Git repositories, where analysis allow for "deduplication" of physical authors into logical authors.

Since parsing of .mailmap files is not straight forward and since I do not want to introduce an algorithm that might produce results dissimilar to Git, I decided to wrap the git check-mailmap CLI command. Most of this is done in the new file pydriller/utils/mailmap.py. To make the feature work, I had to extend the Developer class slightly.

Together with the feature implementation, I provide a set of tests. They rely on an example repository, which I adapted from here. I needed a small repository with a .mailmap file and few commits in which some are made by users with different name/email values. I do not have the time to craft one from scratch. Therefore, I hope it is okay to include that repository in the test repositories. I believe there is no issue, since the tests as well as the commit message state the source of the data explicitly.

PyDriller does not support
[Git `.mailmap` files](https://git-scm.com/docs/gitmailmap). That is, authors
or committers, which are represented as `Developer` objects always get assigned
the name and email values that are stated in respective commits.

I would like that PyDriller supported use of `.mailmap` files directly. So far,
I have worked around lacking mailmap support by letting my external scripts map
users with multiple email addresses or slightly different names to canonical
data. I believe there might be more users than me who could be interested in
this feature.

Currently, PyDriller does not support `.mailmap` files since it relies on
GitPython to receive names and emails for  `Developer`s. GitPython does not
support `.mailmap` files either, see
[here](gitpython-developers/GitPython#764).
Since they state that GitPython is in maintenance mode, I contribute support of
`.mailmap` files to PyDriller and not to the underlying dependency. I believe
that this functionality is also better suited to PyDriller, a library with
which users want to run analysis of Git repositories, where analysis allow for
"deduplication" of physical authors into logical authors.

Since parsing of `.mailmap` files is not straight forward and since I do not
want to introduce an algorithm that might produce results dissimilar to Git,
I decided to wrap the `git check-mailmap`
[CLI command](https://git-scm.com/docs/git-check-mailmap/2.31.0).
Most of this is done in the new file
[pydriller/utils/mailmap.py](utils/mailmap.py). To make the feature work, I had
to extend the [`Developer`](pydriller/domain/developer.py) class slightly.
Together with the feature implementation, I provide a set of tests. They rely
on an example repository, which I adapted from
[here](https://github.com/ContentMine/mailmap). I needed a small repository
with a `.mailmap` file and few commits in which some are made by users with
different name/email values. I do not have the time to craft one from scratch.
Therefore, I hope it is okay to include that repository in the test
repositories. I believe there is no issue, since the tests as well as the
commit message state the source of the data explicitly.
pydriller/utils/mailmap.py Fixed Show fixed Hide fixed
pydriller/utils/mailmap.py Fixed Show fixed Hide fixed
@HelgeCPH
Copy link
Contributor Author

HelgeCPH commented Oct 11, 2024

Hmm, I can see the above messages. I would need help with them.

First, I never used mypy before, so I do not know directly how to address these issues. I believe I just fixed the issue comming from CodeQL/mypy with the signature/return issues. However, there seem to be some mypy checks in my new tests that create problems that I do not understand directly.

Second, I do not know what makes the other checks fail. Locally, I can run and pass all tests :)

Please let me know what I should do. Thank you in advance.

I believe that I figured out what mypy was complaining about and committed respective changes.

Copy link

testpulseio bot commented Oct 11, 2024

TestPulse report

Test execution

🥳 Congrats, all your tests have passed!
See all builds of the PR: https://app.testpulse.io/oss/ishepard/pydriller/builds/pr/303

Files without coverage

⚠️ Some files changed in the PR are not covered by tests, careful!

  • pydriller/domain/commit.py
  • pydriller/domain/developer.py
  • pydriller/repository.py
  • pydriller/utils/conf.py
  • pydriller/utils/mailmap.py
  • tests/integration/test_mailmap.py
  • tests/test_developer.py
  • tests/test_mailmap.py

Coverage information

Unit tests

  • mailmap.py went from 0.00% to 100.00% (+100.00%)

Copy link

codecov bot commented Oct 11, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.38%. Comparing base (51510ab) to head (4e6cb39).
Report is 14 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #303      +/-   ##
==========================================
+ Coverage   97.27%   97.38%   +0.10%     
==========================================
  Files          17       18       +1     
  Lines        1102     1146      +44     
==========================================
+ Hits         1072     1116      +44     
  Misses         30       30              
Files with missing lines Coverage Δ
pydriller/domain/commit.py 97.27% <100.00%> (ø)
pydriller/domain/developer.py 100.00% <100.00%> (ø)
pydriller/repository.py 93.00% <ø> (ø)
pydriller/utils/conf.py 96.21% <100.00%> (+0.08%) ⬆️
pydriller/utils/mailmap.py 100.00% <100.00%> (ø)

@ishepard
Copy link
Owner

Amazing PR! Thanks for adding many tests as well!
I don't see any issue with this, and all tests pass, so good to go for me!

@ishepard ishepard merged commit d99955f into ishepard:master Oct 14, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants