Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize interference of concurrent index requests #58

Closed
danielmitterdorfer opened this issue Feb 4, 2016 · 2 comments
Closed

Minimize interference of concurrent index requests #58

danielmitterdorfer opened this issue Feb 4, 2016 · 2 comments
Labels
bug Something's wrong
Milestone

Comments

@danielmitterdorfer
Copy link
Member

This ticket originates from a finding in #9. There we noted two things:

  1. Indexing throughput drops over time when using multiple threads. This is related to contention when reading the data file
  2. Total indexing throughput decreases as the number of threads increase (contrary to what is expected).

In this ticket we want to tackle these two issues. The idea revolves around mmapping the data file and using the Python multiprocessing library for indexing instead of threads. We'll implement the following steps:

  1. mmap the file in the main process and determine file offsets so that all child bulk processes do roughly the same amount of work. We mmap the file to allow for fast concurrent random access.
  2. Spawn subprocesses (instead of threads) with multiprocessing.Pool which will the bulk index the relevant parts of the data file.
  3. Collect and aggregate the results in the main process
@danielmitterdorfer
Copy link
Member Author

Another idea I've got is that we have one reader thread which puts data on a queue and all other threads just take data from this queue. This could provide similar benefits without adding too much complexity.

@danielmitterdorfer
Copy link
Member Author

I've written a demo program that mmaps a file and reads with 1, 2, 4 and 8 subprocesses (using the multiprocessing library) and we see a speedup indeed:

python3 demo.py 1  2.83s user 0.60s system 93% cpu 3.665 total
python3 demo.py 2  2.53s user 0.51s system 183% cpu 1.658 total
python3 demo.py 4  2.81s user 0.56s system 325% cpu 1.036 total
python3 demo.py 8  4.66s user 0.77s system 580% cpu 0.935 total

Contrast this with reading the same file with 1, 2, 4 and 8 threads:

python3 testReadDocs.py  1  3.51s user 0.78s system 99% cpu 4.314 total
python3 testReadDocs.py  2  3.52s user 0.78s system 99% cpu 4.316 total
python3 testReadDocs.py  4  3.48s user 0.79s system 99% cpu 4.285 total
python3 testReadDocs.py  8  3.53s user 0.78s system 99% cpu 4.318 total

Test platform in both cases:

dm@io:~ $ uname -a
Darwin io 15.3.0 Darwin Kernel Version 15.3.0: Thu Dec 10 18:40:58 PST 2015; root:xnu-3248.30.4~1/RELEASE_X86_64 x86_64
dm@io:~ $ python3 --version
Python 3.5.1
dm@io:~ $ system_profiler | grep "Apple SSD Controller" -A 21
    Apple SSD Controller:

      Vendor: Apple
      Product: SSD Controller
      Physical Interconnect: PCI
      Link Width: x4
      Link Speed: 8.0 GT/s
      Description: AHCI Version 1.30 Supported

        APPLE SSD SM0256G:

          Capacity: 251 GB (251.000.193.024 bytes)
          Model: APPLE SSD SM0256G
          Revision: BXW1SA0Q
          Serial Number: XXXXXXXXXXXXXX
          Native Command Queuing: Yes
          Queue Depth: 32
          Removable Media: No
          Detachable Drive: No
          BSD Name: disk0
          Medium Type: Solid State
          TRIM Support: Yes

But implementing this based on multiprocessing adds significant complexity which we'd like to avoid. Hence, the ticket moves to the backlog for now but stays open as a reminder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something's wrong
Projects
None yet
Development

No branches or pull requests

1 participant