Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Batch doc indexing #24

Closed
polyfractal opened this issue Mar 23, 2013 · 0 comments
Closed

Implement Batch doc indexing #24

polyfractal opened this issue Mar 23, 2013 · 0 comments

Comments

@polyfractal
Copy link
Owner

The most basic form of indexing involves adding single documents at a time. This is easily accomplished with the document() method:

$doc = $sherlock->document()->index('testindexing')
                            ->type('tweet')
                            ->document(array("field" => "test"));
$response = $doc->execute();

However, there are many situations where you have a large quantity of docs that need to be indexed, but may not have enough to justify a full Bulk index. If you attempt to index 1000 documents using the singular interface above, each request will block the next. All requests will process serially and, in many cases, be very slow.

In these situation, Sherlock supports a Batch method, which allows parallel execution of the commands without blocking. This uses cURL's "multi-handle" feature, which allows cURL to perform threaded HTTP requests even though PHP is single-threaded.

Sherlock uses RollingCurl, an efficient implementation of cURL's multi-handle. This library processes requests as they complete, "rolling" through the pending queue rather than blocking on individual groups.

To use this feature, simply add more documents to your request before calling execute():

$doc = $sherlock->document()->index('testindexing')->type('tweet');

for ($i = 0; $i < 2000; $i++) {
   $doc->document('{"field":"test"}');
}

$response = $doc->execute();

The provided document syntax is the cleanest, but can also be inflexible to work with. For example, what if some of your documents require an ID, while others should have ES autogenerate the ID? The above syntax is insufficient.

In these cases, it may be required to surface an internal Sherlock class: Sherlock\requests\Command. This class provides full access to the properties of a command. This example shows both PUTs (docs with IDs) and POSTS (docs without IDs) being inserted at the same time. You could even insert to different types or indices in the same request:

$batch = array();
for ($i = 0; $i < 2000; $i++) {
    $tDoc = new Sherlock\requests\Command();
    $tDoc->action('post')
         ->index('testindexing')
         ->type('tweet')
         ->data('{"field":"test"}');

    $batch[] = $tDoc;
}

for ($i = 0; $i < 2000; $i++) {
    $tDoc = new Sherlock\requests\Command();
    $tDoc->action('put')
         ->index('testindexing')
         ->type('tweet')
         ->data('{"field":"test"}')
         ->id($i);

    $batch[] = $tDoc;
}

$batchDocs = $sherlock->document();
$response = $batchDocs->documents($batch)->execute();

However, even this interface does not provide enough flexibility for all circumstances. For example, what if your data set is too large to load into memory? This is a common situation for users of ElasticSearch. Rather than managing batch loading and unloading from memory, Sherlock provides a BatchCommandInterface class which you can implement on top of your own classes.

The BatchCommandInterface is effectively a wrapper for an iterable object. This provides maximum flexibility in how you implement the loading of data into Sherlock. A common use would be streaming data from disk, one document at a time. Rather than loading the entire data set, the iterator streams through the data only as needed.

This example shows an implementation of BatchCommandInterface, where the data is pregenerated in the constructor. Obviously, this is not a good usage of the interface, but it is sufficient to demonstrate the usage:

class CustomBatch implements Sherlock\requests\BatchCommandInterface
{
    private $commands = array();

    /**
     * Pregenerate 2000 docs to insert, just as a demonstration
     * This could easily be opening a filestream, etc
     */
    public function __construct()
    {

        for ($i = 0; $i < 2000; $i++) {
            $tDoc = new Sherlock\requests\Command();
            $tDoc->action('post')
                 ->index('testindexing')
                 ->type('tweet')
                 ->data('{"field":"test"}');

            $this->commands[] = $tDoc;
        }
    }

    /**
     *
     */
    public function rewind()
    {
        reset($this->commands);
    }

    /**
     * @return Command
     */
    public function current()
    {
        return current($this->commands);
    }

    /**
     * @return mixed
     */
    public function key()
    {
        return key($this->commands);
    }

    /**
     * @return Command|void
     */
    public function next()
    {
        return next($this->commands);
    }

    /**
     * @return bool
     */
    public function valid()
    {
        return false !== current($this->commands);
    }
}

The CustomBatch class implements the required functions for an Iterator (current, next, etc). It is then used with Sherlock like this:

 $batch = new CustomBatch();
 $batchDocs = $sherlock->document();

$response = $batchDocs->documents($batch)->execute();

Sherlock will then internally manage streaming data from your CustomBatch object, using parallel curl handles, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant