Implement Batch doc indexing #24

polyfractal · 2013-03-23T17:26:41Z

The most basic form of indexing involves adding single documents at a time. This is easily accomplished with the document() method:

$doc = $sherlock->document()->index('testindexing')
                            ->type('tweet')
                            ->document(array("field" => "test"));
$response = $doc->execute();

However, there are many situations where you have a large quantity of docs that need to be indexed, but may not have enough to justify a full Bulk index. If you attempt to index 1000 documents using the singular interface above, each request will block the next. All requests will process serially and, in many cases, be very slow.

In these situation, Sherlock supports a Batch method, which allows parallel execution of the commands without blocking. This uses cURL's "multi-handle" feature, which allows cURL to perform threaded HTTP requests even though PHP is single-threaded.

Sherlock uses RollingCurl, an efficient implementation of cURL's multi-handle. This library processes requests as they complete, "rolling" through the pending queue rather than blocking on individual groups.

To use this feature, simply add more documents to your request before calling execute():

$doc = $sherlock->document()->index('testindexing')->type('tweet');

for ($i = 0; $i < 2000; $i++) {
   $doc->document('{"field":"test"}');
}

$response = $doc->execute();

The provided document syntax is the cleanest, but can also be inflexible to work with. For example, what if some of your documents require an ID, while others should have ES autogenerate the ID? The above syntax is insufficient.

In these cases, it may be required to surface an internal Sherlock class: Sherlock\requests\Command. This class provides full access to the properties of a command. This example shows both PUTs (docs with IDs) and POSTS (docs without IDs) being inserted at the same time. You could even insert to different types or indices in the same request:

$batch = array();
for ($i = 0; $i < 2000; $i++) {
    $tDoc = new Sherlock\requests\Command();
    $tDoc->action('post')
         ->index('testindexing')
         ->type('tweet')
         ->data('{"field":"test"}');

    $batch[] = $tDoc;
}

for ($i = 0; $i < 2000; $i++) {
    $tDoc = new Sherlock\requests\Command();
    $tDoc->action('put')
         ->index('testindexing')
         ->type('tweet')
         ->data('{"field":"test"}')
         ->id($i);

    $batch[] = $tDoc;
}

$batchDocs = $sherlock->document();
$response = $batchDocs->documents($batch)->execute();

However, even this interface does not provide enough flexibility for all circumstances. For example, what if your data set is too large to load into memory? This is a common situation for users of ElasticSearch. Rather than managing batch loading and unloading from memory, Sherlock provides a BatchCommandInterface class which you can implement on top of your own classes.

The BatchCommandInterface is effectively a wrapper for an iterable object. This provides maximum flexibility in how you implement the loading of data into Sherlock. A common use would be streaming data from disk, one document at a time. Rather than loading the entire data set, the iterator streams through the data only as needed.

This example shows an implementation of BatchCommandInterface, where the data is pregenerated in the constructor. Obviously, this is not a good usage of the interface, but it is sufficient to demonstrate the usage:

class CustomBatch implements Sherlock\requests\BatchCommandInterface
{
    private $commands = array();

    /**
     * Pregenerate 2000 docs to insert, just as a demonstration
     * This could easily be opening a filestream, etc
     */
    public function __construct()
    {

        for ($i = 0; $i < 2000; $i++) {
            $tDoc = new Sherlock\requests\Command();
            $tDoc->action('post')
                 ->index('testindexing')
                 ->type('tweet')
                 ->data('{"field":"test"}');

            $this->commands[] = $tDoc;
        }
    }

    /**
     *
     */
    public function rewind()
    {
        reset($this->commands);
    }

    /**
     * @return Command
     */
    public function current()
    {
        return current($this->commands);
    }

    /**
     * @return mixed
     */
    public function key()
    {
        return key($this->commands);
    }

    /**
     * @return Command|void
     */
    public function next()
    {
        return next($this->commands);
    }

    /**
     * @return bool
     */
    public function valid()
    {
        return false !== current($this->commands);
    }
}

The CustomBatch class implements the required functions for an Iterator (current, next, etc). It is then used with Sherlock like this:

 $batch = new CustomBatch();
 $batchDocs = $sherlock->document();

$response = $batchDocs->documents($batch)->execute();

Sherlock will then internally manage streaming data from your CustomBatch object, using parallel curl handles, etc

$@polyfractal$ polyfractal closed this as completed in c92fdab Mar 24, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Batch doc indexing #24

Implement Batch doc indexing #24

polyfractal commented Mar 23, 2013

Implement Batch doc indexing #24

Implement Batch doc indexing #24

Comments

polyfractal commented Mar 23, 2013