Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Student's t-test aggregation support #53692

Closed
imotov opened this issue Mar 17, 2020 · 3 comments · Fixed by #54980
Closed

Add Student's t-test aggregation support #53692

imotov opened this issue Mar 17, 2020 · 3 comments · Fixed by #54980
Assignees

Comments

@imotov
Copy link
Contributor

imotov commented Mar 17, 2020

I would like to discuss adding a multivalued metrics aggregation that will apply unpaired and paired two-sample t-tests to two samples selected based on filters or fields or a combination of both.

So, unpaired t-test might look like this:

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "test" : {
      "t_test" : {
        "filters" : [
          { "match" : { "group" : "A" }},
          { "match" : { "group" : "B" }}
        ],
        "field": "value"
      }
    }
  }
}

The paired t-test might look something like this:

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "test" : {
      "t_test" : {
        "fields" : ["before", "after"]
      }
    }
  }
}

We can also add support for scripts.

The type of the test can be specified by the user with defaults based on the presence of absence of filters. We can support a type parameter that can be specified as paired (default and only supported if filters are not present), homoscedastic (equal variance) or heteroscedastic (unequal variance, default if filters are present.

The output will be a typical metrics aggregation with t and p values.

Alternatively, we can implement this as a pipeline aggregation, but in this case it will simplify implementation, but might make usage a bit more difficult and can complicate kibana adoption. We can also consider implementing it as both pipeline and metric aggregation similar to stats.

cc: @jtibshirani, @polyfractal

@imotov imotov self-assigned this Mar 17, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@polyfractal
Copy link
Contributor

Just to throw out another option, we could also have unpaired_t_test and paired_t_test if we wanted to keep the syntax completely separate.

At first I was against having field vs fields... but it might simplify parsing/validation and actually be easier for the user. Still not crazy about the distinction, but I don't think it's too terrible either. I think I prefer it over having a paired: {} or unpaired: {} settings object inside the agg... those arrangements never seem to work out well.

Alternatively, we can implement this as a pipeline aggregation, but in this case it will simplify implementation, but might make usage a bit more difficult and can complicate kibana adoption. We can also consider implementing it as both pipeline and metric aggregation similar to stats.

Agreed, I think this would make adoption and usage a lot more difficult even though it's technically possible (and probably less code to maintain). Relying on the user to set it up correctly sounds fragile and error prone. I'd be ++ a metric, and maybe later determine if we want to add a pipeline equivalent (e.g. could be useful to compare bucket values rather than raw docs).

@imotov
Copy link
Contributor Author

imotov commented Mar 25, 2020

After experimenting with parser a little bit and talking to @polyfractal we have decided to modify the syntax a bit and start with paired t-test implementation. The request will look like this:

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "test" : {
      "t_test" : {
        "a" : { "field" : "A" },
        "b" : { "filed" : "B" } 
       }
      }
    }
  }
}

The unpaired t-test will be implemented in a follow up PR and will look like this:

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "test" : {
      "t_test" : {
        "a" : { "filter" : { "match" : { "group" : "A" } }, "field": "val" },
        "b" : { "filter" : { "match" : { "group" : "B" } }, "field": "val" },
        "type" : "heteroscedastic"
       }
      }
    }
  }
}

imotov added a commit to imotov/elasticsearch that referenced this issue Mar 30, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to elastic#53692
imotov added a commit that referenced this issue Apr 3, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
imotov added a commit to imotov/elasticsearch that referenced this issue Apr 3, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to elastic#53692
imotov added a commit to imotov/elasticsearch that referenced this issue Apr 3, 2020
Update version in the t-test agg usage stats serialization
after backport to 7.8.0

Relates to elastic#53692
imotov added a commit that referenced this issue Apr 6, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
imotov added a commit that referenced this issue Apr 6, 2020
Update version in the t-test agg usage stats serialization
after backport to 7.8.0

Relates to #53692
imotov added a commit to imotov/elasticsearch that referenced this issue Apr 8, 2020
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes elastic#53692
imotov added a commit that referenced this issue Apr 10, 2020
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes #53692
imotov added a commit to imotov/elasticsearch that referenced this issue Apr 10, 2020
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes elastic#53692
imotov added a commit that referenced this issue Apr 13, 2020
Adds support for filters to T-Test aggregation. The filters can be used to
select populations based on some criteria and use values from the same or
different fields.

Closes #53692
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants