Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data partition algorithm failling with parallel indexing tasks #334

Closed
fdeschamps opened this issue Oct 19, 2017 · 1 comment
Closed

Data partition algorithm failling with parallel indexing tasks #334

fdeschamps opened this issue Oct 19, 2017 · 1 comment
Labels
bug Something's wrong :Track Management New operations, changes in the track format, track download changes and the like
Milestone

Comments

@fdeschamps
Copy link

Rally version: esrally 0.7.2

Invoked command: esrally --track=logs --offline --target-hosts=127.0.0.1:9200 --pipeline=benchmark-only --team-repository=private --challenge=index-logs

Configuration file (located in ~/.rally/rally.ini)):

[meta]
config.version = 11

[system]
env.name = lu_bench

[node]
root.dir = /home/fdeschamps/.rally/benchmarks

[runtime]
java.home = /usr/lib/jvm/jdk-8-oracle-x64

[benchmarks]
local.dataset.cache = ${node:root.dir}/data

[reporting]
datastore.type = elasticsearch
datastore.host = localhost
datastore.port = 9200
datastore.secure = False
datastore.user = 
datastore.password = 

[tracks]
default.url = https://github.com/elastic/rally-tracks

[teams]
default.url = https://github.com/elastic/rally-teams

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.1.url = https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.2.url = https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/{{VERSION}}/elasticsearch-{{VERSION}}.tar.gz
release.url = https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.cache = true

JVM version:

java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)

OS version: Ubuntu 16.04.2 LTS
Description of the problem including expected versus actual behavior:
When using several index operation in parallel, esrally fail to partition correctly the data between the clients, resulting in docs not being indexed for the 2nd and subsequent indices.

Steps to reproduce:

  1. Use this track :
{
  "challenges": [
    {
      "name": "index-logs",
      "schedule": [
        {
          "parallel": {
            "tasks": [
              {
                "operation": "index_logs1"
              },
              {
                "operation": "index_logs2"
              }
            ]
          }
        }
      ],
      "description": "parallel indexing of logs",
      "default": true,
      "index-settings": {
        "index.number_of_replicas": 1
      }
    }
  ],
  "description": "This test indexes all logs from 2 sources in parallel",
  "operations": [
    {
      "name": "index_logs1",
      "bulk-size": 5000,
      "operation-type": "index",
      "index": "logs1"
    },
    {
      "name": "index_logs2",
      "bulk-size": 5000,
      "operation-type": "index",
      "index": "logs2"
    }
  ],
  "indices": [
    {
      "name": "logs1",
      "types": [
        {
          "name": "logs",
          "document-count": 303058,
          "documents": "logs1.json.bz2",
          "uncompressed-bytes": 242118291,
          "compressed-bytes": 5147740,
          "mapping": "mappings.json"
        }
      ]
    },
    {
      "name": "logs2",
      "types": [
        {
          "name": "logs",
          "document-count": 15483,
          "documents": "logs2.json.bz2",
          "uncompressed-bytes": 8046294,
          "compressed-bytes": 124503,
          "mapping": "mappings.json"
        }
      ]
    }
  ],
  "short-description": "Parallel logs indexing"
}

  1. Launch the bench :
    esrally --track=logs --offline --target-hosts=127.0.0.1:9200 --pipeline=benchmark-only --team-repository=private --challenge=index-logs

  2. In output, we can see that rally is using a bad offset for index logs2:

2017-10-19 09:49:35,715 PID:6043 rally.track INFO Client [0] will index [303058] docs starting from line offset [0] for [logs1/logs]
2017-10-19 09:49:35,715 PID:6044 rally.track INFO Client [1] will index [15483] docs starting from line offset [15483] for [logs2/logs]

It should be starting at offset 0 for both indices.

From what i've seen in the code, the partition algorithm is using the client_id and total_clients to partition data. This doesn't work with parallel index tasks because it does not take into account that the data source is different for each task.

@fdeschamps fdeschamps changed the title Data partition algorithm failing with parellel indexing tasks Data partition algorithm failling with parallel indexing tasks Oct 19, 2017
@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Oct 20, 2017

That's a valid point @fdeschamps. Without having looked into the code closely right now I am also pretty sure that this is not really handled so far. Thanks for reporting!

@danielmitterdorfer danielmitterdorfer added :Track Management New operations, changes in the track format, track download changes and the like bug Something's wrong labels Oct 20, 2017
@danielmitterdorfer danielmitterdorfer modified the milestones: 0.7.x, 0.7.4 Oct 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

No branches or pull requests

2 participants