Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for indexing to data streams #4409

Merged
merged 15 commits into from
Nov 19, 2020

Conversation

axw
Copy link
Member

@axw axw commented Nov 11, 2020

Motivation/summary

This PR introduces initial support for indexing events into data streams, with the new apm-server.data_streams.enabled config. When this is set, the server will:

  • add data_stream.{type,dataset,namespace} fields to each published event
  • use these fields to route the events to the appropriate data stream ($type-$dataset-$namespace)
  • disable ILM setup by default

Still TODO, in followups:

  • system tests
  • set apm-server.data_streams.enabled when in Fleet mode
  • make data_stream.namespace configurable (using config received from Elastic Agent)
  • improve config validation, to ensure things like output.elasticsearch.indices cannot be set in conjunction apm-server.data_streams.enabled

Checklist

I have considered changes for:
- [ ] documentation (later maybe; more likely this will be undocumented and set indirectly by using Fleet)
- [ ] logging (add log lines, choose appropriate log selector, etc.)
- [ ] metrics and monitoring (create issue for Kibana team to add metrics to visualizations, e.g. Kibana#44001)

How to test these changes

  • Run apm-server with -E apm-server.data_streams.enabled=true
  • Send some transactions, spans, errors, and runtime metrics
  • Confirm that transactions and spans go into traces-*, errors into logs-*, and runtime metrics into metrics-*.

Related issues

Partially addresses #4378

@apmmachine
Copy link
Contributor

apmmachine commented Nov 11, 2020

💔 Tests Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: [axw commented: jenkins run the tests please]

  • Start Time: 2020-11-19T02:59:58.618+0000

  • Duration: 58 min 21 sec

Test stats 🧪

Test Results
Failed 7
Passed 4747
Skipped 145
Total 4899

Test errors 7

Expand to view the tests failures

Build and Test / APM Integration Tests / test_dotnet_error – tests.agent.test_dotnet

    Expand to view the error details

     AssertionError: Expected 1, queried 3555 
    

    Expand to view the stacktrace

     dotnet = <tests.fixtures.agents.Agent object at 0x7f2a80eec850>
    
        @pytest.mark.version
        @pytest.mark.dotnet
        def test_dotnet_error(dotnet):
            utils.check_agent_error(
    >           dotnet.oof, dotnet.apm_server.elasticsearch)
    
    tests/agent/test_dotnet.py:18: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3555
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_go_nethttp_error – tests.agent.test_go

    Expand to view the error details

     AssertionError: Expected 1, queried 3549 
    

    Expand to view the stacktrace

     go_nethttp = <tests.fixtures.agents.Agent object at 0x7f2a783c7d10>
    
        @pytest.mark.version
        @pytest.mark.go_nethttp
        def test_go_nethttp_error(go_nethttp):
            utils.check_agent_error(
    >           go_nethttp.oof, go_nethttp.apm_server.elasticsearch)
    
    tests/agent/test_go.py:18: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3549
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_java_spring_error – tests.agent.test_java

    Expand to view the error details

     AssertionError: Expected 1, queried 3525 
    

    Expand to view the stacktrace

     java_spring = <tests.fixtures.agents.Agent object at 0x7f2a80e69490>
    
        @pytest.mark.java_spring
        def test_java_spring_error(java_spring):
            utils.check_agent_error(
    >           java_spring.oof, java_spring.apm_server.elasticsearch)
    
    tests/agent/test_java.py:16: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3525
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_express_error – tests.agent.test_nodejs

    Expand to view the error details

     AssertionError: Expected 1, queried 3593 
    

    Expand to view the stacktrace

     express = <tests.fixtures.agents.Agent object at 0x7f2a5828d250>
    
        @pytest.mark.version
        def test_express_error(express):
            utils.check_agent_error(
    >           express.oof, express.apm_server.elasticsearch)
    
    tests/agent/test_nodejs.py:15: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3593
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_flask_error – tests.agent.test_python

    Expand to view the error details

     AssertionError: Expected 2, queried 3630 
    

    Expand to view the stacktrace

     flask = <tests.fixtures.agents.Agent object at 0x7f2a58294190>
    
        @pytest.mark.version
        @pytest.mark.flask
        def test_flask_error(flask):
            # one from exception handler, one from logging handler
    >       utils.check_agent_error(flask.oof, flask.apm_server.elasticsearch, ct=2)
    
    tests/agent/test_python.py:17: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 2, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 2, queried 3630
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_django_error – tests.agent.test_python

    Expand to view the error details

     AssertionError: Expected 1, queried 3621 
    

    Expand to view the stacktrace

     django = <tests.fixtures.agents.Agent object at 0x7f2a5828d610>
    
        @pytest.mark.version
        @pytest.mark.django
        def test_django_error(django):
    >       utils.check_agent_error(django.oof, django.apm_server.elasticsearch)
    
    tests/agent/test_python.py:66: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3621
    
    tests/utils.py:62: AssertionError 
    

Build and Test / APM Integration Tests / test_rails_error – tests.agent.test_ruby

    Expand to view the error details

     AssertionError: Expected 1, queried 3655 
    

    Expand to view the stacktrace

     rails = <tests.fixtures.agents.Agent object at 0x7f2a783ea110>
    
        @pytest.mark.version
        @pytest.mark.rails
        def test_rails_error(rails):
            utils.check_agent_error(
    >           rails.oof, rails.apm_server.elasticsearch)
    
    tests/agent/test_ruby.py:18: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    tests/utils.py:23: in check_agent_error
        check_elasticsearch_count(elasticsearch, ct, processor='error')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    elasticsearch = <tests.fixtures.es.es.<locals>.Elasticsearch object at 0x7f2a80eec9d0>
    expected = 1, processor = 'error'
    query = {'query': {'term': {'processor.name': 'error'}}}
    
        def check_elasticsearch_count(elasticsearch,
                                      expected,
                                      processor='transaction',
                                      query=None):
            if query is None:
                query = {'query': {'term': {'processor.name': processor}}}
        
            actual = 0
            retries = 0
            max_retries = 3
            while actual != expected and retries < max_retries:
                try:
                    actual = elasticsearch.count(query)
                    retries += 1
                    time.sleep(10)
                except TimeoutError:
                    retries += 1
                    actual = -1
        
            assert actual == expected, "Expected {}, queried {}".format(
    >           expected, actual)
    E       AssertionError: Expected 1, queried 3655
    
    tests/utils.py:62: AssertionError 
    

Steps errors 3

Expand to view the steps failures

Compress

  • Took 0 min 0 sec . View more details on here
  • Description: tar --exclude=coverage-files.tgz -czf coverage-files.tgz coverage

Compress

  • Took 0 min 0 sec . View more details on here
  • Description: tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests

Test Sync

  • Took 3 min 23 sec . View more details on here
  • Description: ./script/jenkins/sync.sh

Log output

Expand to view the last 100 lines of log output

[2020-11-19T03:28:30.546Z] --- PASS: TestTransactionAggregationShutdown (3.21s)
[2020-11-19T03:28:30.546Z] === RUN   TestServiceDestinationAggregation
[2020-11-19T03:28:30.546Z] --- PASS: TestServiceDestinationAggregation (2.43s)
[2020-11-19T03:28:30.546Z] === RUN   TestAPIKeyCreate
[2020-11-19T03:28:30.546Z] --- PASS: TestAPIKeyCreate (2.44s)
[2020-11-19T03:28:30.546Z] === RUN   TestAPIKeyCreateExpiration
[2020-11-19T03:28:30.546Z] --- PASS: TestAPIKeyCreateExpiration (2.02s)
[2020-11-19T03:28:30.546Z] === RUN   TestAPIKeyInvalidateName
[2020-11-19T03:28:30.546Z] --- PASS: TestAPIKeyInvalidateName (3.04s)
[2020-11-19T03:28:30.546Z] === RUN   TestAPIKeyInvalidateID
[2020-11-19T03:28:30.546Z] --- PASS: TestAPIKeyInvalidateID (2.01s)
[2020-11-19T03:28:30.546Z] === RUN   TestAPMServerEnvironment
[2020-11-19T03:28:30.546Z] === RUN   TestAPMServerEnvironment/container
[2020-11-19T03:28:30.546Z] === PAUSE TestAPMServerEnvironment/container
[2020-11-19T03:28:30.546Z] === RUN   TestAPMServerEnvironment/systemd
[2020-11-19T03:28:30.546Z] === PAUSE TestAPMServerEnvironment/systemd
[2020-11-19T03:28:30.546Z] === RUN   TestAPMServerEnvironment/macos_service
[2020-11-19T03:28:30.546Z] === PAUSE TestAPMServerEnvironment/macos_service
[2020-11-19T03:28:30.546Z] === RUN   TestAPMServerEnvironment/windows_service
[2020-11-19T03:28:30.546Z] === PAUSE TestAPMServerEnvironment/windows_service
[2020-11-19T03:28:30.546Z] === CONT  TestAPMServerEnvironment/container
[2020-11-19T03:28:30.546Z] === CONT  TestAPMServerEnvironment/macos_service
[2020-11-19T03:28:30.546Z] === CONT  TestAPMServerEnvironment/systemd
[2020-11-19T03:28:30.546Z] === CONT  TestAPMServerEnvironment/windows_service
[2020-11-19T03:28:30.546Z] --- PASS: TestAPMServerEnvironment (0.00s)
[2020-11-19T03:28:30.546Z]     --- PASS: TestAPMServerEnvironment/container (0.40s)
[2020-11-19T03:28:30.546Z]     --- PASS: TestAPMServerEnvironment/macos_service (0.43s)
[2020-11-19T03:28:30.546Z]     --- PASS: TestAPMServerEnvironment/windows_service (0.43s)
[2020-11-19T03:28:30.546Z]     --- PASS: TestAPMServerEnvironment/systemd (0.49s)
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServerInstrumentation
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServerInstrumentation (2.87s)
[2020-11-19T03:28:30.547Z] === RUN   TestJaegerGRPC
[2020-11-19T03:28:30.547Z] --- PASS: TestJaegerGRPC (2.64s)
[2020-11-19T03:28:30.547Z] === RUN   TestJaegerGRPCSampling
[2020-11-19T03:28:30.547Z] --- PASS: TestJaegerGRPCSampling (2.40s)
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServerRequestLoggingValid
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServerRequestLoggingValid (0.23s)
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServerMonitoring
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServerMonitoring (1.28s)
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServerMonitoringBuiltinUser
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServerMonitoringBuiltinUser (1.99s)
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServerOnboarding
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServerOnboarding (2.32s)
[2020-11-19T03:28:30.547Z] === RUN   TestRUMXForwardedFor
[2020-11-19T03:28:30.547Z] --- PASS: TestRUMXForwardedFor (2.43s)
[2020-11-19T03:28:30.547Z] === RUN   TestKeepUnsampled
[2020-11-19T03:28:30.547Z] === RUN   TestKeepUnsampled/false
[2020-11-19T03:28:30.547Z] === RUN   TestKeepUnsampled/true
[2020-11-19T03:28:30.547Z] --- PASS: TestKeepUnsampled (4.86s)
[2020-11-19T03:28:30.547Z]     --- PASS: TestKeepUnsampled/false (2.44s)
[2020-11-19T03:28:30.547Z]     --- PASS: TestKeepUnsampled/true (2.42s)
[2020-11-19T03:28:30.547Z] === RUN   TestTailSampling
[2020-11-19T03:28:30.547Z]     sampling_test.go:136: waiting for 100 "parent" transactions
[2020-11-19T03:28:30.547Z]     sampling_test.go:136: waiting for 100 "child" transactions
[2020-11-19T03:28:30.547Z] --- PASS: TestTailSampling (3.86s)
[2020-11-19T03:28:30.547Z] === RUN   TestTailSamplingUnlicensed
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:54 Starting container id: 993fe2cc0a34 image: quay.io/testcontainers/ryuk:0.2.3
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:54 Waiting for container id 993fe2cc0a34 image: quay.io/testcontainers/ryuk:0.2.3
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:55 Container is ready id: 993fe2cc0a34 image: quay.io/testcontainers/ryuk:0.2.3
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:55 Starting container id: 0b71551a704a image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:55 Waiting for container id 0b71551a704a image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT
[2020-11-19T03:28:30.547Z] 2020/11/19 03:28:14 Container is ready id: 0b71551a704a image: docker.elastic.co/elasticsearch/elasticsearch:8.0.0-SNAPSHOT
[2020-11-19T03:28:30.547Z] --- PASS: TestTailSamplingUnlicensed (35.43s)
[2020-11-19T03:28:30.547Z] PASS
[2020-11-19T03:28:30.547Z] ok  	github.com/elastic/apm-server/systemtest	81.014s
[2020-11-19T03:28:30.547Z] === RUN   TestAPMServer
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:07 Building apm-server...
[2020-11-19T03:28:30.547Z] 2020/11/19 03:27:09 Built /var/lib/jenkins/workspace/pm-server_apm-server-mbp_PR-4409/src/github.com/elastic/apm-server/apm-server
[2020-11-19T03:28:30.547Z] --- PASS: TestAPMServer (4.80s)
[2020-11-19T03:28:30.547Z] === RUN   TestUnstartedAPMServer
[2020-11-19T03:28:30.547Z] --- PASS: TestUnstartedAPMServer (0.00s)
[2020-11-19T03:28:30.547Z] === RUN   TestExpvar
[2020-11-19T03:28:30.547Z] --- PASS: TestExpvar (0.28s)
[2020-11-19T03:28:30.547Z] PASS
[2020-11-19T03:28:30.547Z] ok  	github.com/elastic/apm-server/systemtest/apmservertest	5.088s
[2020-11-19T03:28:30.547Z] ?   	github.com/elastic/apm-server/systemtest/estest	[no test files]
[2020-11-19T03:28:30.547Z] + cleanup
[2020-11-19T03:28:30.547Z] + rm -rf /tmp/tmp.jHyWJbYSyx
[2020-11-19T03:28:30.547Z] + .ci/scripts/docker-get-logs.sh
[2020-11-19T03:28:31.529Z] Post stage
[2020-11-19T03:28:31.539Z] Running in /var/lib/jenkins/workspace/pm-server_apm-server-mbp_PR-4409/src/github.com/elastic/apm-server
[2020-11-19T03:28:31.563Z] Archiving artifacts
[2020-11-19T03:28:31.868Z] Recording test results
[2020-11-19T03:28:32.679Z] [WARN] tar: pathPrefix parameter is deprecated.
[2020-11-19T03:28:33.036Z] + tar --version
[2020-11-19T03:28:33.362Z] + tar --exclude=system-tests-linux-files.tgz -czf system-tests-linux-files.tgz system-tests
[2020-11-19T03:28:33.362Z] tar: system-tests: Cannot stat: No such file or directory
[2020-11-19T03:28:33.362Z] tar: Exiting with failure status due to previous errors
[2020-11-19T03:28:33.374Z] [INFO] system-tests-linux-files.tgz was not compressed or archived : script returned exit code 2
[2020-11-19T03:57:54.941Z] [INFO] For detailed information see: https://apm-ci.elastic.co/job/apm-integration-tests-selector-mbp/job/master/11890/display/redirect
[2020-11-19T03:58:16.786Z] Copied 64 artifacts from "APM Integration Test MBP Selector » master" build number 11890
[2020-11-19T03:58:17.852Z] Post stage
[2020-11-19T03:58:17.861Z] Recording test results
[2020-11-19T03:58:19.054Z] Running on Jenkins in /var/lib/jenkins/workspace/pm-server_apm-server-mbp_PR-4409
[2020-11-19T03:58:19.136Z] [INFO] getVaultSecret: Getting secrets
[2020-11-19T03:58:19.409Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2020-11-19T03:58:20.148Z] + chmod 755 generate-build-data.sh
[2020-11-19T03:58:20.149Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-server/apm-server-mbp/PR-4409/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-server/apm-server-mbp/PR-4409/runs/18 UNSTABLE 3501269
[2020-11-19T03:58:20.699Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-server/apm-server-mbp/PR-4409/runs/18/steps/?limit=10000 -o steps-info.json
[2020-11-19T03:58:20.950Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-server/apm-server-mbp/PR-4409/runs/18/tests/?status=FAILED -o tests-errors.json

Introduce `apm-server.data_streams.enabled` config,
which will be used by idxmgmt to route events to
data streams based on data_stream.* fields that are
expected to be in each published event.
When transforming model objects into beat.Events,
set the data_stream.{type,dataset} fields. We add
data_stream.namespace elsewhere, using an event
processor.
Handle the new apm-server.data_streams.enabled
config:
 - when enabled, we add data_stream.namespace to
   all published events
 - when disabled, we remove data_stream.* fields
   from all published events

For now we just set data_stream.namespace to "default".
Later this will be based on the config received from Fleet.
There is a hack in here to inject data_stream.namespace
in all published events, since the tests do not use the
standard libbeat pipeline code.
@codecov-io
Copy link

codecov-io commented Nov 11, 2020

Codecov Report

Merging #4409 (1e7379e) into master (f0dc152) will increase coverage by 0.05%.
The diff coverage is 86.02%.

@@            Coverage Diff             @@
##           master    #4409      +/-   ##
==========================================
+ Coverage   76.11%   76.17%   +0.05%     
==========================================
  Files         156      158       +2     
  Lines        9713     9782      +69     
==========================================
+ Hits         7393     7451      +58     
- Misses       2320     2331      +11     
Impacted Files Coverage Δ
tests/fields.go 35.77% <0.00%> (-0.34%) ⬇️
idxmgmt/supporter.go 77.98% <58.33%> (-6.40%) ⬇️
beater/beater.go 63.28% <88.23%> (+1.82%) ⬆️
beater/config/config.go 65.00% <100.00%> (+0.59%) ⬆️
beater/config/data_streams.go 100.00% <100.00%> (ø)
beater/http.go 68.33% <100.00%> (+0.53%) ⬆️
beater/telemetry.go 84.00% <100.00%> (+0.66%) ⬆️
datastreams/servicename.go 100.00% <100.00%> (ø)
idxmgmt/supporter_factory.go 69.69% <100.00%> (+1.95%) ⬆️
model/error.go 98.59% <100.00%> (+0.03%) ⬆️
... and 11 more

@axw axw requested a review from a team November 11, 2020 05:37
@axw axw marked this pull request as ready for review November 11, 2020 05:37
Copy link
Contributor

@simitt simitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me, left a couple of comments.

I realized that apm-* templates are still set up when data streams are enabled, while they wouldn't be applied any more. Doesn't have to be changed in this PR but we could disable template setup in this case.

Out of curiosity - Is there a Kibana/APM issue already to adapt the UI to the data stream changes? (right now it is ofc broken when data streams are enabled)

datastreams/constants.go Outdated Show resolved Hide resolved
datastreams/servicename.go Show resolved Hide resolved
idxmgmt/supporter.go Show resolved Hide resolved
idxmgmt/supporter.go Show resolved Hide resolved
changelogs/head.asciidoc Outdated Show resolved Hide resolved
@axw
Copy link
Member Author

axw commented Nov 16, 2020

I realized that apm-* templates are still set up when data streams are enabled, while they wouldn't be applied any more. Doesn't have to be changed in this PR but we could disable template setup in this case.

Existing integrations disable this by injecting config, e.g. https://github.com/elastic/beats/blob/b6896ee6892da8ce4567d63ee0fa962512fb8701/x-pack/elastic-agent/spec/filebeat.yml#L3. I don't mind disabling it by default when enabling data streams though. Probably a bit neater.

Out of curiosity - Is there a Kibana/APM issue already to adapt the UI to the data stream changes? (right now it is ofc broken when data streams are enabled)

Not yet AFAIK. It's broken by default, but you can load a template and update the APM app settings in Kibana to fetch transactions and spans from traces-* instead of apm-*, etc.

datastreams/servicename.go Show resolved Hide resolved
docs/fields.asciidoc Outdated Show resolved Hide resolved
@@ -111,7 +112,13 @@ func (e *Transaction) fields() common.MapStr {
func (e *Transaction) Transform(_ context.Context, _ *transform.Config) []beat.Event {
transactionTransformations.Inc()

// Transactions are stored in a "traces" data stream along with spans.
dataset := datastreams.NormalizeServiceName(e.Metadata.Service.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption that we don't need to differenciate spans from transactions is strong, but what if we are wrong? Isn't cheap and harmless to add an apm.span or apm.transaction dataset prefix, just in case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't cheap and harmless to add an apm.span or apm.transaction dataset prefix, just in case

It's not necessarily harmless, as more datasets will lead to more shards. We can always revisit this in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how much it would hurt?
I guess its fine, but a similar argument can be done the other way around: by separating spans and transactions in different indices, it is likely that some queries in the APM UI can be changed to faster alternatives.
We would not get this benefit if we revisit this decision in the future because the UI would have to query for both index name and index content...
The opposite is not true: if we go with separated spans and transactions to start with, and later we find they don't add value (performance or user-flexibility wise), we would be able to optimize for less shards without any drawbacks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So how much it would hurt?

We won't know for sure until we test.

This was discussed in the proposal, and the discussion concluded with this comment (from me):

"It seems to me that there's no clear answer here. In order to make progress I propose that we start out conservatively w.r.t. sharding, and have a combined data stream as described.
This will go out as experimental to start with, and we can adjust as necessary. We can replay some real data through the options to compare performance."

I guess its fine, but a similar argument can be done the other way around: by separating spans and transactions in different indices, it is likely that some queries in the APM UI can be changed to faster alternatives.

Span docs are currently used in two places in the UI, AFAIK:

  1. displaying an individual trace waterfall
  2. walking traces for the service map

I'm confident (1) is going to be cheap either way. We know (2) is a sub-optimal approach to building the service map, and we're investigation options to move away from it.

We would not get this benefit if we revisit this decision in the future because the UI would have to query for both index name and index content...

Not sure I follow. We don't query spans by themselves; only ever spans and transactions together. If we did want to query only spans in the future, then we could split the data sets and make processor.event a constant_keyword; that would mean only the spans data streams would be considered in the query. This can be done without changing the index pattern, and without breaking backwards compatibility.

Anyway, as mentioned above: let's go with this for now and test it at scale. We can use an ingest node processor to split the transactions and spans into separate streams for testing.

model/profile.go Outdated Show resolved Hide resolved
model/metricset.go Outdated Show resolved Hide resolved
@axw
Copy link
Member Author

axw commented Nov 17, 2020

jenkins run the tests please

@axw axw closed this Nov 17, 2020
@axw axw reopened this Nov 17, 2020
@axw
Copy link
Member Author

axw commented Nov 17, 2020

jenkins run the tests please

@axw
Copy link
Member Author

axw commented Nov 19, 2020

jenkins run the tests please

@axw
Copy link
Member Author

axw commented Nov 19, 2020

Something very wrong is going on in the integration tests...

@axw
Copy link
Member Author

axw commented Nov 19, 2020

(I'm going to try reproducing this locally, so I can debug...)

@axw
Copy link
Member Author

axw commented Nov 19, 2020

Same error is occurring in #4445 ...

@axw
Copy link
Member Author

axw commented Nov 19, 2020

The integration tests have been failing for a while, and are unrelated to the server as far as I can see. I've captured my findings in elastic/apm-integration-testing#987.

@axw axw merged commit 09951ba into elastic:master Nov 19, 2020
@axw axw deleted the apm-indexing-strategy branch November 19, 2020 09:40
@jalvz jalvz mentioned this pull request Nov 23, 2020
15 tasks
axw added a commit to axw/apm-server that referenced this pull request Dec 10, 2020
* idxmgmt: add support for data streams

Introduce `apm-server.data_streams.enabled` config,
which will be used by idxmgmt to route events to
data streams based on data_stream.* fields that are
expected to be in each published event.

* _meta/fields.common.yml: add data_stream.* fields

* model: add data_stream.{type,dataset} fields

When transforming model objects into beat.Events,
set the data_stream.{type,dataset} fields. We add
data_stream.namespace elsewhere, using an event
processor.

* beater: handle apm-server.data_streams.enabled

Handle the new apm-server.data_streams.enabled
config:
 - when enabled, we add data_stream.namespace to
   all published events
 - when disabled, we remove data_stream.* fields
   from all published events

For now we just set data_stream.namespace to "default".
Later this will be based on the config received from Fleet.

* processor/stream/package_tests: update tests

There is a hack in here to inject data_stream.namespace
in all published events, since the tests do not use the
standard libbeat pipeline code.

* Update approvals

* Add changelog entry

* datastreams: address review comments

* changelogs: clarify feature is experimental

* _meta: specify allowed values for data_stream.type

* model: update datasets

Use a common "apm." prefix, and place the service name last.
axw added a commit that referenced this pull request Dec 10, 2020
* idxmgmt: add support for data streams

Introduce `apm-server.data_streams.enabled` config,
which will be used by idxmgmt to route events to
data streams based on data_stream.* fields that are
expected to be in each published event.

* _meta/fields.common.yml: add data_stream.* fields

* model: add data_stream.{type,dataset} fields

When transforming model objects into beat.Events,
set the data_stream.{type,dataset} fields. We add
data_stream.namespace elsewhere, using an event
processor.

* beater: handle apm-server.data_streams.enabled

Handle the new apm-server.data_streams.enabled
config:
 - when enabled, we add data_stream.namespace to
   all published events
 - when disabled, we remove data_stream.* fields
   from all published events

For now we just set data_stream.namespace to "default".
Later this will be based on the config received from Fleet.

* processor/stream/package_tests: update tests

There is a hack in here to inject data_stream.namespace
in all published events, since the tests do not use the
standard libbeat pipeline code.

* Update approvals

* Add changelog entry

* datastreams: address review comments

* changelogs: clarify feature is experimental

* _meta: specify allowed values for data_stream.type

* model: update datasets

Use a common "apm." prefix, and place the service name last.
@simitt simitt self-assigned this Dec 23, 2020
@simitt
Copy link
Contributor

simitt commented Dec 23, 2020

Tested with BC1 - sent some event data for errors, spans, transactions and metrics - everything ended up in the expected data streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants