Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Centroid-based Classifier #5103

Merged
merged 18 commits into from
Aug 29, 2024
Merged

Commits on Jun 18, 2021

  1. New Centroid-based Classifier

    Training:
    
    * A fixed vocabulary is set to all tokens that appear in, at least, 2
      samples.
    * All out-of-vocabulary tokens are discarded.
    * For every token, we set its Inverse Class Frequency (ICF) to
    `log(ct / cf) + 1` where `ct` is the total number of classes and `cf` is
    the number of classes where the token occurs.
    * Each sample is converted to a vector of `tf * icf` for every token in
    the vocabulary. `tf` is `1 + log(freq)`, where `freq` is the
    number of occurrences of the token in the given sample.
    * Samples are L2-normalized.
    * For each class (language), we compute the centroid of all its training
    samples by averaging them and L2-normalizing the result.
    
    Classification:
    
    * For a new sample, we get the L2-normalized vector with `tf * icf`
    terms for every known token, then classify the sample using the nearest
    centroid. Cosine similarity is used as similarity measure for this.
    smola committed Jun 18, 2021
    Configuration menu
    Copy the full SHA
    6644c34 View commit details
    Browse the repository at this point in the history
  2. Fixture file is now detected as Raku

    lildude authored and smola committed Jun 18, 2021
    Configuration menu
    Copy the full SHA
    20b33ee View commit details
    Browse the repository at this point in the history
  3. Update lib/linguist/samples.rb

    Co-authored-by: Colin Seymour <colin@github.com>
    smola and lildude committed Jun 18, 2021
    Configuration menu
    Copy the full SHA
    ec2ca35 View commit details
    Browse the repository at this point in the history
  4. Update test/test_classifier.rb

    Co-authored-by: Colin Seymour <colin@github.com>
    smola and lildude committed Jun 18, 2021
    Configuration menu
    Copy the full SHA
    9b6fa51 View commit details
    Browse the repository at this point in the history

Commits on Jul 1, 2022

  1. Configuration menu
    Copy the full SHA
    6c13235 View commit details
    Browse the repository at this point in the history

Commits on Oct 20, 2022

  1. Configuration menu
    Copy the full SHA
    1712974 View commit details
    Browse the repository at this point in the history

Commits on Nov 14, 2022

  1. Configuration menu
    Copy the full SHA
    a96276c View commit details
    Browse the repository at this point in the history
  2. Add exec bit

    lildude committed Nov 14, 2022
    Configuration menu
    Copy the full SHA
    7b80d5e View commit details
    Browse the repository at this point in the history

Commits on Mar 6, 2023

  1. Configuration menu
    Copy the full SHA
    8b709be View commit details
    Browse the repository at this point in the history
  2. Adjust acceptable errors

    lildude committed Mar 6, 2023
    Configuration menu
    Copy the full SHA
    91de502 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    afc0417 View commit details
    Browse the repository at this point in the history

Commits on Sep 8, 2023

  1. Configuration menu
    Copy the full SHA
    37db40b View commit details
    Browse the repository at this point in the history

Commits on Jun 8, 2024

  1. Configuration menu
    Copy the full SHA
    8e475ff View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2024

  1. Remove two useless samples

    lildude committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    1d559dd View commit details
    Browse the repository at this point in the history
  2. Add a better R sample

    lildude committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    170ebda View commit details
    Browse the repository at this point in the history
  3. Remove fixmes

    lildude committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    43716ce View commit details
    Browse the repository at this point in the history
  4. Remove empty lines

    lildude committed Aug 6, 2024
    Configuration menu
    Copy the full SHA
    1d50126 View commit details
    Browse the repository at this point in the history

Commits on Aug 14, 2024

  1. Configuration menu
    Copy the full SHA
    9d922e3 View commit details
    Browse the repository at this point in the history