Minimize interference of concurrent index requests #58

danielmitterdorfer · 2016-02-04T14:17:54Z

This ticket originates from a finding in #9. There we noted two things:

Indexing throughput drops over time when using multiple threads. This is related to contention when reading the data file
Total indexing throughput decreases as the number of threads increase (contrary to what is expected).

In this ticket we want to tackle these two issues. The idea revolves around mmapping the data file and using the Python multiprocessing library for indexing instead of threads. We'll implement the following steps:

mmap the file in the main process and determine file offsets so that all child bulk processes do roughly the same amount of work. We mmap the file to allow for fast concurrent random access.
Spawn subprocesses (instead of threads) with multiprocessing.Pool which will the bulk index the relevant parts of the data file.
Collect and aggregate the results in the main process

The text was updated successfully, but these errors were encountered:

danielmitterdorfer · 2016-02-06T19:53:49Z

Another idea I've got is that we have one reader thread which puts data on a queue and all other threads just take data from this queue. This could provide similar benefits without adding too much complexity.

danielmitterdorfer · 2016-02-11T14:49:04Z

I've written a demo program that mmaps a file and reads with 1, 2, 4 and 8 subprocesses (using the multiprocessing library) and we see a speedup indeed:

python3 demo.py 1  2.83s user 0.60s system 93% cpu 3.665 total
python3 demo.py 2  2.53s user 0.51s system 183% cpu 1.658 total
python3 demo.py 4  2.81s user 0.56s system 325% cpu 1.036 total
python3 demo.py 8  4.66s user 0.77s system 580% cpu 0.935 total

Contrast this with reading the same file with 1, 2, 4 and 8 threads:

python3 testReadDocs.py  1  3.51s user 0.78s system 99% cpu 4.314 total
python3 testReadDocs.py  2  3.52s user 0.78s system 99% cpu 4.316 total
python3 testReadDocs.py  4  3.48s user 0.79s system 99% cpu 4.285 total
python3 testReadDocs.py  8  3.53s user 0.78s system 99% cpu 4.318 total

Test platform in both cases:

dm@io:~ $ uname -a
Darwin io 15.3.0 Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64 x86_64
dm@io:~ $ python3 --version
Python 3.5.1
dm@io:~ $ system_profiler | grep "Apple SSD Controller" -A 21
    Apple SSD Controller:

      Vendor: Apple
      Product: SSD Controller
      Physical Interconnect: PCI
      Link Width: x4
      Link Speed: 8.0 GT/s
      Description: AHCI Version 1.30 Supported

        APPLE SSD SM0256G:

          Capacity: 251 GB (251.000.193.024 bytes)
          Model: APPLE SSD SM0256G
          Revision: BXW1SA0Q
          Serial Number: XXXXXXXXXXXXXX
          Native Command Queuing: Yes
          Queue Depth: 32
          Removable Media: No
          Detachable Drive: No
          BSD Name: disk0
          Medium Type: Solid State
          TRIM Support: Yes

But implementing this based on multiprocessing adds significant complexity which we'd like to avoid. Hence, the ticket moves to the backlog for now but stays open as a reminder.

danielmitterdorfer added the :Correctness Issue label Feb 4, 2016

danielmitterdorfer added this to the 0.1.0 milestone Feb 4, 2016

danielmitterdorfer mentioned this issue Feb 4, 2016

Check for accidental bottlenecks #9

Closed

5 tasks

danielmitterdorfer removed this from the 0.1.0 milestone Feb 4, 2016

danielmitterdorfer added this to the Backlog milestone Feb 11, 2016

danielmitterdorfer modified the milestones: 0.2.3, Backlog Apr 23, 2016

danielmitterdorfer self-assigned this Jun 29, 2016

This was referenced Jun 29, 2016

Allow to define operation parameters dynamically #107

Closed

Decouple operation scheduling from execution in load generator #108

Closed

danielmitterdorfer modified the milestones: 0.4.0, 0.3.1 Jul 27, 2016

danielmitterdorfer closed this as completed in fe5dba3 Aug 31, 2016

danielmitterdorfer removed their assignment Nov 28, 2016

danielmitterdorfer mentioned this issue Dec 20, 2019

An actorless load generator #852

Closed

danielmitterdorfer added bug Something's wrong and removed :Correctness Issue labels Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize interference of concurrent index requests #58

Minimize interference of concurrent index requests #58

danielmitterdorfer commented Feb 4, 2016

danielmitterdorfer commented Feb 6, 2016

danielmitterdorfer commented Feb 11, 2016

Minimize interference of concurrent index requests #58

Minimize interference of concurrent index requests #58

Comments

danielmitterdorfer commented Feb 4, 2016

danielmitterdorfer commented Feb 6, 2016

danielmitterdorfer commented Feb 11, 2016