[BREAKING CHANGE] New Top Language Detection Method #481

anuraghazra · 2020-09-20T07:38:27Z

As you all might know there are various bugs/issues regarding the top languages calculation.

The problem

The main issue i see is that people often get confused by how the calculations are done.

Currently the top languages are calculated based on how much code in bytes you have in a particular language and then we choose the top languages.
This method is the main reason people are confused about the calculations, because normally users perceive how much they code in languages by how many repositories they have with that particular language.

Quirks with the current calculation method

Confusion about the calculations. (Top Languages Card not working properly #136 (comment))
Repositories might have vendor code or auto generated code which would make the calculations wrong. ([Top Languages] Blog repositories(xxxx.github.io) should not be counted. #153)
If some language have exaggerated code bytes then it becomes the dominant language. (Top language card not showing Python #358 (comment))
Users aren't satisfied with the method.

The Solution

The most straight forward solution I see is that instead of calculating how much code they have, we can calculate how many repositories they have with the languages.

Related issues

#432 #403 #270 #136 #358

saurabhdaware · 2020-09-20T07:54:54Z

Not sure but the current methods seems right to me. Imagine having 2 HTML, JS repositories with more 51% HTML in both.

According to the new proposed method, I wouldn't know JavaScript.

So in my opinion, the current method makes more sense.

anuraghazra · 2020-09-20T08:01:48Z

@saurabhdaware We would not just take the primary language of the individual repo, we would also calculate top 10 langs of individual repos too.

That is we are already doing :-

github-readme-stats/src/fetchers/top-languages-fetcher.js

Line 14 in 6e73a00

languages(first: 10, orderBy: {field: SIZE, direction: DESC}) {

So it would look like -:

HTML ---------- some%
Javascript ------ some%

saurabhdaware · 2020-09-20T08:11:21Z

I am not sure if I understood. Wouldn't calculating top 10 languages of each repository same as calculating how much code the user has in bytes? that's how GitHub calculates the percentage as well no?

anuraghazra · 2020-09-20T08:14:56Z

No we would just "count" them and in current method we get the "language.size" reduce and sum it up and then sort it.

saurabhdaware · 2020-09-20T08:23:16Z

Oh ok so in this example

Imagine having 2 HTML, JS repositories with more 51% HTML in both.

I would have
51% HTML
49% JavaScript
right?

anuraghazra · 2020-09-20T08:26:41Z

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

saurabhdaware · 2020-09-20T09:40:39Z

Oh cool. Seems good to me then.

DenverCoder1 · 2020-09-21T09:50:15Z

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

DenverCoder1 · 2020-09-21T09:53:46Z

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

anuraghazra · 2020-09-21T14:04:45Z

If there are people that are more happy with the current way, there could possibly be an additional parameter that will switch between the different calculation modes?

Unfortunately we cannot do that, it would make the logic complex & we would have two different statistics. it would hamper the consistency.

anuraghazra · 2020-09-21T14:05:58Z

I will firstly publish experimental query param to enable this and then if people likes it i would make it default.

Bas950 · 2020-09-23T19:31:05Z

I personally wouldnt use this new one, I think the current one is better.

I think I like this new way better. I did one very large C# project earlier this year and now it says C# is my top language (over 52%) even though it is definitely not what I do the most of.

You could exclude the language, but in my PR (#480) I am making it so you can just exclude a repo. Which is probs better.

anuraghazra · 2020-09-24T06:18:23Z

@Bas950 yeah I can understand what you are saying, but the main reason why soo many people uses github-readme-stats is because of it's simplicity and ease of use. Of course we can add "exclude_repo" options and make it better but the thing is that not many people have the time/patience to go through all of their repositories and check which one has some vendor code and exclude them one by one, not to mention this is impractical for users who have lot of repos.

So this is why i'm considering this new approach which would mitigate these issues.

crazy-max · 2020-09-24T09:10:15Z

@anuraghazra Check https://github.com/github/linguist

mchelen-gov · 2020-09-24T18:05:43Z

Quirks with the current calculation method

Confusion about the calculations. (#136 (comment))

Repositories might have vendor code or auto generated code which would make the calculations wrong. ([Top Languages] Blog repositories(xxxx.github.io) should not be counted. #153)

If some language have exaggerated code bytes then it becomes the dominant language. (#358 (comment))

Users aren't satisfied with the method.

@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to?

ghost · 2020-09-25T10:26:10Z

Consider i have five repositories.

C++ 50k line Machine Learning Library i wrote.
html page with hello
html page with hello
html page with hello
html page with hello

Now repo approcah should give my top language as HTML! right??

anuraghazra · 2020-09-25T11:21:11Z

Consider i have five repositories.

C++ 50k line Machine Learning Library i wrote.

html page with hello

html page with hello

html page with hello

html page with hello

Now repo approcah should give my top language as HTML! right??

Bas950 · 2020-09-25T11:37:06Z

Quirks with the current calculation method

Confusion about the calculations. (#136 (comment))

Repositories might have vendor code or auto generated code which would make the calculations wrong. ([Top Languages] Blog repositories(xxxx.github.io) should not be counted. #153)

If some language have exaggerated code bytes then it becomes the dominant language. (#358 (comment))

Users aren't satisfied with the method.

@anuraghazra Does the current method only look at repos owned by the user or does it include other repos the user has contributed to?

I will be add a feature soon in PR tho that will allow you to opt-in to forks or opt-out of using forks, depends of what @anuraghazra wants to use.

benstigsen · 2020-10-21T13:32:47Z

Will this make it support other languages, or should I make a seperate PR for this? Currently I can't see GDScript on my stats.

jcubic · 2021-01-06T15:01:51Z

Please check issue #450 where I've provied GraphQL (I don't remember if it works and if test it) query that show most stared repos, alternative maybe to get repos with most commits (but there are no order by number of commits yet).

Using default 100 repos is stupid because the order can be random and those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

Potherca · 2021-01-30T21:08:26Z

Please check issue #450 where I've provied GraphQL query

I think the suggestion (in that ticket) to use orderBy might be a good one. The next issue will, of course, be "Sort by what?", which will undoubtedly lead to "I want to sort by X, not Y, can you make it configurable?". But I think the basic premise is a good addition to resolving this rather sticky puzzle.

(I don't remember if it works and if test it)

I just ran it through the graphql explorer, it works 👍

those repos can have code that you didn't added single commit and get the code from somewhere else. Not only forks have forked code, if code is not on GitHub you can't fork you need to copy the code into your own repo.

For such repos, I would suggest using the exclude_repo setting (or just creating a separate org and moving such repos there).
I don't think we should really expect such a level of intelligence from a project such as this. I think the KISS principle would apply here.

Using default 100 repos is stupid

Not stupid. Easy. The API won't let you get more in a single request, so multiple requests would need to be made.
Making more calls means more code, more work, potentially more issues, etc.

jcubic · 2021-01-30T21:23:24Z

By stupid I mean default 100 by sorting like this. Even 10 repos is better if the sorting is done right, most faved repos or repos with most commits maybe most recent commits. Anything but default order which looks like random with fixed seed.

Potherca · 2021-01-31T08:48:25Z

Thank you for clarifying. 🙌

Even 10 repos is better if the sorting is done right

Yes! I completely agree with this!

I was thinking about this some more and I think the main issue here is conflicting use-cases... The GH API gives language stats for Repositories, not for Users. It might be easier to add a separate card and/or split the use-case to support both sides?

@anuraghazra If @jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

It would be a shame to let all the hard work and thoughts that went into this issue come to nothing...

anuraghazra · 2021-01-31T10:12:11Z

anuraghazra If jcubic and myself were to set up some examples (using various queries and parameters mentioned in this thread), would you be available to play around with them? We could ask some of the other people in this (an linked) issues for feedback as well.

@Potherca feel free to experiment with different ways to make it more accurate I can surely take a look at them and give some feedbacks on it.

anuraghazra · 2021-01-31T10:20:34Z

Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code

ghost · 2021-01-31T15:15:45Z

see this api https://codetabs.com/count-loc/count-loc-online.html It could be helpful in taking into consideration total lines of codes of specific repository and its language,

…

On Sun, Jan 31, 2021 at 3:20 PM Anurag Hazra ***@***.***> wrote: Another possible way to count language stats is by using github's search api https://docs.github.com/en/rest/reference/search#search-code [image: image] <https://user-images.githubusercontent.com/35374649/106381019-0606dc80-63dc-11eb-9749-d4f3d2e90df6.png> — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#481 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL7VWCMF2OV46WNDEEBQC3LS4UVH5ANCNFSM4RTPUQVQ> .

jcubic · 2021-01-31T15:24:19Z

it will not work for the case where somone fork repo that is not git fork, but copy that have single commit. I have one repo like this that is written in C and Lua and it will give me those languages that I've never written even single line of it.

mushahidq · 2021-03-25T05:02:09Z

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1

HTML - 50%
JS - 50%

We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

anuraghazra · 2021-03-25T14:13:24Z

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

andreped · 2021-03-30T23:47:32Z

Repo 1 - JS x1 & HTML x1
Repo 2 - JS x1 & HTML x1
HTML - 50%
JS - 50%
We would just count them.

Is the proposed solution, just to count the languages which appear most in the result of the GitHub API response?

No that sucks #481 (comment)

Is it that bad though? I guess the problem here is that we are unsure what we want these numbers to represent. I wasn't expecting these numbers to represent the distribution of total number of lines I wrote using each language. That would also be a bad estimate as in C I write quite many more lines for a simple sum compared to in Python. Doesn't necessarily mean I do more C than Python. Again, it really depends what one want these estimates to measure.

At least for me doing multiple projects, it would be cool to have an estimate stating repository-wise what is the most common language you have used. Initially, this is actually what I thought these numbers represented. But there are scenarios where such a measure might be suboptimal, as aforementioned, if one assume otherwise.

I do not think there is an optimum here that suits all users. Perhaps it could be an option to support both designs, or even multiple? That would at least solve my issue, and thus make me happy :]

But perhaps having more than one design that estimate these measures might introduce even more noise into how to interpret these values... Idk anymore

foxt · 2021-04-13T14:42:23Z

It'd also be nice to be able to exclude forks.

My card

Currently shows 86% Python because I'm making a minor PR to a Python repo.

andreped · 2021-04-13T14:58:39Z

I agree with @theLMGN .

I have a fork on my repo, which I haven't contributed to yet, but which apparently contain a shit ton of C#, which I have never used. This results in my "Most used languages" to be roughly 80% C#, which is sort of funny considering I have contributed to roughly 40 open repos, of which are Python/C++.

I'm also wondering if C# and C++ are switched, or if C++ code is misinterpreted as C# in github-readme-stats. According to github-readme-stats I do not do any C++, but I have contributed to C++ projects (forks).

andreped · 2021-04-13T15:02:17Z

@theLMGN couldn't you just use the exclude_repo option to exclude that one repository? Since you only did a minor PR, including this repo in the calculating is probably not necessary?

anuraghazra added help wanted Extra attention is needed. stats-card Feature, Enhancement, Fixes related to stats the stats card. lang-card Issues related to the language card. labels Sep 20, 2020

anuraghazra pinned this issue Sep 20, 2020

This comment has been minimized.

Sign in to view

stale bot added the stale Issue is marked as stale. label Dec 6, 2020

anuraghazra removed the stale Issue is marked as stale. label Dec 8, 2020

Potherca mentioned this issue Jan 31, 2021

Language detection show languages that I've never used #450

Open

Repository owner deleted a comment from stale bot Jan 31, 2021

anuraghazra closed this as completed Apr 27, 2021

Repository owner locked and limited conversation to collaborators Apr 27, 2021

anuraghazra unpinned this issue Apr 27, 2021

This issue was moved to a discussion.

[BREAKING CHANGE] New Top Language Detection Method #481

[BREAKING CHANGE] New Top Language Detection Method #481

Comments

anuraghazra commented Sep 20, 2020

The problem

Quirks with the current calculation method

The Solution

Related issues

saurabhdaware commented Sep 20, 2020

anuraghazra commented Sep 20, 2020 • edited Loading

saurabhdaware commented Sep 20, 2020

anuraghazra commented Sep 20, 2020 • edited Loading

saurabhdaware commented Sep 20, 2020

anuraghazra commented Sep 20, 2020

saurabhdaware commented Sep 20, 2020

DenverCoder1 commented Sep 21, 2020

DenverCoder1 commented Sep 21, 2020

anuraghazra commented Sep 21, 2020

anuraghazra commented Sep 21, 2020

Bas950 commented Sep 23, 2020

anuraghazra commented Sep 24, 2020

crazy-max commented Sep 24, 2020

mchelen-gov commented Sep 24, 2020

Quirks with the current calculation method

ghost commented Sep 25, 2020

anuraghazra commented Sep 25, 2020

Bas950 commented Sep 25, 2020

Quirks with the current calculation method

This comment has been minimized.

This comment has been minimized.

benstigsen commented Oct 21, 2020

This comment has been minimized.

jcubic commented Jan 6, 2021

Potherca commented Jan 30, 2021 • edited Loading

jcubic commented Jan 30, 2021 • edited Loading

Potherca commented Jan 31, 2021 • edited Loading

anuraghazra commented Jan 31, 2021

anuraghazra commented Jan 31, 2021

ghost commented Jan 31, 2021 via email

jcubic commented Jan 31, 2021 • edited Loading

mushahidq commented Mar 25, 2021

anuraghazra commented Mar 25, 2021

andreped commented Mar 30, 2021 • edited Loading

foxt commented Apr 13, 2021 • edited Loading

andreped commented Apr 13, 2021 • edited Loading

andreped commented Apr 13, 2021

This issue was moved to a discussion.

anuraghazra commented Sep 20, 2020 •

edited

Loading

anuraghazra commented Sep 20, 2020 •

edited

Loading

Potherca commented Jan 30, 2021 •

edited

Loading

jcubic commented Jan 30, 2021 •

edited

Loading

Potherca commented Jan 31, 2021 •

edited

Loading

jcubic commented Jan 31, 2021 •

edited

Loading

andreped commented Mar 30, 2021 •

edited

Loading

foxt commented Apr 13, 2021 •

edited

Loading

andreped commented Apr 13, 2021 •

edited

Loading