Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise root server analysis script #164

Open
huitema opened this issue May 5, 2020 · 0 comments
Open

Revise root server analysis script #164

huitema opened this issue May 5, 2020 · 0 comments

Comments

@huitema
Copy link
Collaborator

huitema commented May 5, 2020

The secondary analysis of M3 data starts with the script
load_l_root_folders.py, which uses the python module m3summary.py. I
just wrote a description of what this script does in the ithitools wiki
page:
https://github.com/private-octopus/ithitools/wiki/Secondary-root-server-statistics.

We had discussion in the past about the output of this script. If I
summarize correctly, the issues were:

  1. The name "useless" carries a value judgement. It would be better to
    replace it by something neutral, like "repeated".

  2. The summary lines mixes atomic counts, like the number of DGA
    queries, and subtotals, like the number of NX domain queries or the
    number of "other" queries.

  3. It is unclear how categories related to published M3 submetrics
    (M3.1, etc.)

  4. We may want to add a few more atomic categories.

I think there are some easy fixes, without changing the ithitools code:

  1. Replace "useless" by "repeated". Possibly replace "useful" by
    something else. Suggestion?

  2. Remove the "subtotals": "queries", "nx_domain", and "others".

  3. Replace the name "rfc6761" by "other_rfc6761" for clarity -- implying
    it excludes .local and .localhost. The sum of local, localhost and
    other_rfc6761 can be used to recompute the metric M3.3.1.

  4. Add "other_frequent_names" after .mail -- total queries for
    "frequently found TLD" excluding home, lan, internal, ip, localdomain,
    corp, and mail. The sum of other_frequent_names, home, lan, internal,
    ip, localdomain, corp, and mail can be used to recompute the metric M3.3.2.

  5. Move dga and jumbo after other_frequent_names, to make it clear that
    these do not include queries to RFC6761 or frequent names.

  6. Add columns for the frequently found categories listed in M3.3:
    bad_syntax, binary, ipv4, numeric. Question: should I also add entries
    for domain names of length 1 to 6? This is a bit more than 1% of all
    total queries, and the total of dga, jumbo, bad_syntax, binary, ipv4,
    numeric and length 1-6 would match the definition of M3.3.3.

  7. Define "other_names" as the count of all NX domain queries minus
    these listed in the previous categories, so the sum of all columns from
    "local" to "other_names" equals the total number of queries.

If we do change the ithitools code, we can add the following:

  • Change the way the overflow of the "frequent names" is computed, so
    the names matching the encoded "frequent name" list are listed as
    "other_frequent_names", instead of being lumped with the dga, jumbo and
    short names categories.

  • Split the dga definition between dga_single (1 part dga name) and
    dga_multi (multi_part dga names).

  • Add the "jumbo" pattern to the M3.3.3 list.

  • Maybe remove the metric M3.3.4, since pretty much every component of
    that will be matching the "frequent names" or "frequent patterns" classes.

  • Isolate other categories per suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant