-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AppMesh metric integration test #247
Conversation
863d878
to
828a85e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see aws_appmesh_mesh so you created the mesh without terraform? You might want to add the instruction here.
I don't create appmesh using terraform. All the test resources are created using a single yaml. aws-otel-test-framework/terraform/testcases/container_insight_appmesh/appmesh_traffic_sample.tpl Lines 1 to 8 in 828a85e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some small nits, thanks
import lombok.Data; | ||
|
||
@Data | ||
public class MetricContext { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this validator aims to support multiple component, not just cloudwatch, maybe giving it a concrete name would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to CloudWatchContext
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this context include "log" and "metric"? or just metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Prometheus metrics are the only one being tested. However these metrics are pushed to CloudWatch using structured log, which explains why we test metrics and logs both.
validator/src/main/java/com/amazon/aoc/services/CloudWatchService.java
Outdated
Show resolved
Hide resolved
validator/src/main/java/com/amazon/aoc/services/CloudWatchService.java
Outdated
Show resolved
Hide resolved
validator/src/main/java/com/amazon/aoc/validators/ContainerInsightMetricsValidator.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public void validate() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between appmesh validation and the normal cloudwatch metric validation? seems we just do a simple check on it? is it intended or we had a "todo" for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I guess we need a discussion on how to merge the two validators. There're some presumption that breaks container insight validations in the original CWMetricValidator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, what's the path forward now? we're gonna use this simple validation instead of making the validator more generic? i'd prefer we do the latter, but if we have to do so, let's add a "todo" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer we add todo and deal with it later, this PR is already pretty big and we are not enabling it until the related features merge into adot repo. We have plenty time to refactor those validators, both me and @pxaws will use cw metrics and log validators for other container insight related features and we can figure out the generic way along the way.
c5301cd
to
6385cf4
Compare
please let me know when you are ready for review the code :) |
366f548
to
5315575
Compare
Thanks for keeping track of the PR. Please review again. |
terraform/eks-cloudwatch/main.tf
Outdated
image_pull_policy = "Always" | ||
args = [ | ||
"--config", | ||
"/etc/eks/prometheus/config-all.yaml"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: should we use config-all.yaml
? Or make it as variable and each test case (eg, appmesh, nginx. etc) only apply their own configurations?
validator/src/main/java/com/amazon/aoc/services/CloudWatchService.java
Outdated
Show resolved
Hide resolved
68bdee0
to
19ba1a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my major concern is still, we are creating a new folder for eks, i'm still thinking about how we can minimize the duplicate parts, for example:
- we are using testing_id to create namespace,
- we are deploying aoc with configmap.
- we will need to support mocked server which is already supported in eks folder today.
the eks folder today can serve all the testcases, cortex
is just an optional parameter. this file lists all the testcases which could be running via eks folder.
i'd like to ask myself if we want to have a separate folder for some testcases,
- if the difference is in ot config, we might be able to just configure it in testcase?
- if the difference is on how to deploy otcollector, why can't we add a new "starting mode" into eks folder?
- if the difference is on how to deploy sample apps, why can't we make the deployment of sample apps configurable? or add a file under eks folder?
- if the difference is on validation, we might be able to just configure the parameters in testcase?
- how hard would it be to make eks folder more generic?
If we can have a plan around how do we maintain the separate folders, either we have solid reasons to keep them separate, or we have a vision to merge them someday, i will be okay with that.
# permissions and limitations under the License. | ||
# ------------------------------------------------------------------------- | ||
|
||
variable "provider_url" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better put variables in variables.tf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure we're on the same page, appmesh works as an independent stateless module, which only accepts input vars, and export variables that would be used by its caller. It self doesn't rely on any other module, that's why we don't put the variables in a common place.
} | ||
|
||
variable "sample_app_image_repo" { | ||
type = string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder do we need to specify type here as string is the default type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and i remember this is a var in common.tf, why's the reason to create one here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the variables in common.tf
is poorly documented. I'm not sure how it is used and how it should be used. As you may see appmesh requires multiple app image, and apparently a single variable naming sample_app_image
can't fulfill my needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you want to add a comment above sample_app_image_repo
to explain it's for multiple app images, which i was not aware of when i saw this code.
default = "" | ||
} | ||
|
||
variable "region" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like many vars are duplicate with the ones in common.tf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default = "kubeconfig" | ||
} | ||
|
||
output "metric_dimension_namespace" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use it for debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
variable "sample_app_image_repo" { | ||
default = "611364707713.dkr.ecr.us-west-2.amazonaws.com/otel-test/container-insight-samples" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where's the source code of this sample app? i wonder if we could put it under the sample-apps folder so that we can manage it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source code can be found following our official setup guidance, it is maintained in another AWS repo: https://github.com/aws/aws-app-mesh-examples/tree/master/walkthroughs/howto-k8s-http-headers.
Ref:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ContainerInsights-Prometheus-Sample-Workloads-appmesh-EKS.html
I didn't put it under the sample-apps folder because I personally think:
- It increases the effort to sync this copy with the original one.
- I can't see any potential user that is interested in running the integration tests locally. If this image would only be used by aoc integration test, I don't see a strong reason to maintain it. What's the benefit?
- We don't aim to test sample apps, they serve for testing aoc. A stable version of sample app is enough to meet the requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It increases the effort to sync this copy with the original one. make sense.
- I can't see any potential user that is interested in running the integration tests locally. If this image would only be used by aoc integration test, I don't see a strong reason to maintain it. What's the benefit? -> who is maintaining this sample app now? does it have workflow for auto-build? how do we track the version?
- We don't aim to test sample apps, they serve for testing aoc. A stable version of sample app is enough to meet the requirement. -> would we need performance test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this sample app is just for aoc integration-test, why don't we move it into this repo? i'm okay with that if this sample app has multiple usages than aoc inregration-test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
who is maintaining this sample app now? does it have workflow for auto-build?
I believe it's EKS team, checking the contributors.
Per README, they've provided a deploy script that will build the image, and deploy the relating resources to the cluster (cleansing should still be taken cared of by users).
how do we track the version?
Images are for sample apps, we don't need version tracking, or so I thought.
would we need performance test?
I'd like to see it happen. We don't have concrete plan. Yet that's unrelated to sample app images IIUC.
this sample app is just for aoc integration-test, why don't we move it into this repo?
Apparently it's not for aoc integration-test, but is recommended by CloudWatch for AppMesh quick start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls help to leave source code link in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added def and source code links to the variable.
@@ -26,6 +26,7 @@ services: | |||
- "ecsTaskDefFamily=${ecs_taskdef_family}" | |||
- "--ecs-context" | |||
- "ecsTaskDefVersion=${ecs_taskdef_version}" | |||
- "--cloudwatch-context=${metric_dimension_json}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these dimension values from eks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hoping containerinsight-ecs-prometheus would reuse this configuration too.
@@ -0,0 +1,2 @@ | |||
# this file is defined in validator/src/main/resources/validations | |||
validation_config="eks-container-insight.yml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i guess this folder is just for eks container insight but not for ecs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it. i still have many questions which i hope we address them with opening an issue though.
Add some thoughts for people who are interested in this comment. Please note this is still an open question needs to be addressed later in the future. Correction: Can containerinsight-eks as well as containerinsight-ecs be merged into existing modules? Without these instructions and agreed contracts, it's hard for contributors to follow what is unspoken. |
what's the biggest intention bring us to create a new folder? can we define more test cases in this folder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. thx
Just to make myself clear, I'm not opposed to merge container insight test cases to
They clearly wouldn't be guided by my PR, but documents of
Add more thoughts on the testing,
|
Created an issue: #258. I thought merging the folders is the only one left unresolved, isn't it? |
and merging the validation part i guess? |
Yup. Added. Thanks! |
i'm not sure if it's easy to achieve running multiple test together, we made assumption that each test is running separately and rely on github workflow for parallelization. |
A separate |
probably check https://github.com/nektos/act, which enable us run workflow locally. It was not too mature last year where credentials can not be used, not sure if it's okay to use now. And we will need to think about do we have such requirement of running test simultaneously in local, normally we develop one testcase one time, and should be able to use workflow to run regression test. From resource consumption wise, although tests are running together, the resources are still the same, the cpu/memory consumption should be the same, each sample apps/collector will still take its resource. Furthermore, since these tests are short-time test, i'm not sure if we really care about the consumption. We might be concerning on soaking and performance test, but it's better to isolate these tests on instance level so that we can get a better tuning result. So far I don't see too much benefit to run test simultaneously, happy to know if we have any idea on it. |
Some statistics: We had 6 nodes in the testing cluster, for all the workload in containerinsight, it requires a successful creation of 19 Pods (sample, traffic generator, component control plane). How many integration flow can be executed at one time without possible impact of resource shortage. If contributors want to run on their on EKS cluster, how many nodes is recommended to be created for our framework? They pay for it. |
@wyTrivail Thanks for the careful review and patient discussion with me about the understanding of design and what the framework should/would be! It was a great honour to share opinions with you. But I think there's too much information in this PR and I like to end the discussion for now. I believe most of the topics can be better understood by future maintainers. |
Why do we need it?
Add container insight integration test of collecting AppMesh metrics on EKS