Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

please add an additional check for the chromium DGA test #161

Open
RoyArends opened this issue Apr 2, 2020 · 4 comments
Open

please add an additional check for the chromium DGA test #161

RoyArends opened this issue Apr 2, 2020 · 4 comments

Comments

@RoyArends
Copy link

Christian, the chromium DGA will only issue a single label as top level domain. Your code tests for this top level domain, but also classifies top level domains with subdomains as DGA, such as:

BT1SVWQM.NOE.BOUYGUESTELECOM
GT7TRSFP0.APPLIS.SI.INTERNE

etc.

Maybe an additional classification where one is DGA (in general) of which Chromium DGA is a subset.

Roy

@huitema
Copy link
Collaborator

huitema commented Apr 11, 2020

@RoyArends There is a fundamental problem there. If the recursive resolver does QName minimization, the root will see "BOUYGUESTELECOM" even in cases when the client looked for "BT1SVWQM.NOE.BOUYGUESTELECOM". With your suggestion, the client's requests for BOUYGUESTELECOM will be split in two different buckets.

My first suggestion would be to investigate how big the problem really is. We can do that by dumping out all the NXDOMAIN name targets in which the name meets the DGA classification, and then see how many have multipart names. Count that, and see how much of the DGA total that is. If it is just a tiny fraction, then there is not much to worry about.

If the fraction is significant, then we need secondary analysis. I would do a count the number of occurrences of TLD in the "multipart DGA" category. If we saw some TLD used with significant frequency, we can add it to a list of "TLD that should not be mistaken for DGA", and use that list as part of the DGA classification.

If I remember correctly, BOUYGHESTELECOM used to be a registered TLD. Maybe, as precaution, we could special case all these "formerly registered TLD" to the special case list.

@huitema
Copy link
Collaborator

huitema commented May 7, 2020

@RoyArends I am looking into this issue, and there is a tension between more precise accounting and compatibility with historic series. The current algorithm can be summarized as:

On capture, for all nx domain queries:
     If the TLD name matches one of "special use names", classify as RFC6761 (metric M3.3.1);
     Else if the TLD name belongs to the "most frequent" list, classify as frequent (M3.3.2);
     Else if the TLD matches one of the special patterns (numeric, ipv4, bad syntax...),
         list in the corresponding category;
     Else classify as "size(length)".

When computing metrics:
    Look at the most frequent patterns among length_N, numeric, IPv4, etc.
    If the pattern is seen often enough: display and count as part of metric M3.3.3,
    else do not display, summarize as part of M3.3.4.

When performing sum_m3 analysis:
    For all "length" patterns:
         If the length is between 7 and 15, count as "dga"
         If length is 16 or larger, count as "jumbo"

To avoid breaking the compatibility with the existing statistics, I propose leaving the existing accounting unchanged, but adding 2 new listing in the summary files:

  • multi_N: number of names found with unknown TLD of length N and multiple name parts
  • alpha_N: number of single-part names of length N, names only include letters.

Alpha_7 to Alpha_15 would map the generation algorithm used by Google. This would become the new definition of "dga". We could then compute:

  • Other multi_part: sum of multi_N
  • Other single_part: sum of all Length_N, minus DGA, and minus "multi_part".

I think that would match expectations, but I would like confirmation.

@huitema
Copy link
Collaborator

huitema commented May 8, 2020

Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.

1 similar comment
@huitema
Copy link
Collaborator

huitema commented May 16, 2020

Actually, the way the ithitools program is structured, the "leak type" has to be a function of just the TLD. We could easily see some requests for "subdomain.no-such-domain" and others for "no-such-domain". What we can easily do is for each such domain count both the total number of references, and also in parallel the number of references with subdomains, and export all that in the "address and names" report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants