You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, there are many situations where you have a large quantity of docs that need to be indexed, but may not have enough to justify a full Bulk index. If you attempt to index 1000 documents using the singular interface above, each request will block the next. All requests will process serially and, in many cases, be very slow.
In these situation, Sherlock supports a Batch method, which allows parallel execution of the commands without blocking. This uses cURL's "multi-handle" feature, which allows cURL to perform threaded HTTP requests even though PHP is single-threaded.
Sherlock uses RollingCurl, an efficient implementation of cURL's multi-handle. This library processes requests as they complete, "rolling" through the pending queue rather than blocking on individual groups.
To use this feature, simply add more documents to your request before calling execute():
The provided document syntax is the cleanest, but can also be inflexible to work with. For example, what if some of your documents require an ID, while others should have ES autogenerate the ID? The above syntax is insufficient.
In these cases, it may be required to surface an internal Sherlock class: Sherlock\requests\Command. This class provides full access to the properties of a command. This example shows both PUTs (docs with IDs) and POSTS (docs without IDs) being inserted at the same time. You could even insert to different types or indices in the same request:
However, even this interface does not provide enough flexibility for all circumstances. For example, what if your data set is too large to load into memory? This is a common situation for users of ElasticSearch. Rather than managing batch loading and unloading from memory, Sherlock provides a BatchCommandInterface class which you can implement on top of your own classes.
The BatchCommandInterface is effectively a wrapper for an iterable object. This provides maximum flexibility in how you implement the loading of data into Sherlock. A common use would be streaming data from disk, one document at a time. Rather than loading the entire data set, the iterator streams through the data only as needed.
This example shows an implementation of BatchCommandInterface, where the data is pregenerated in the constructor. Obviously, this is not a good usage of the interface, but it is sufficient to demonstrate the usage:
classCustomBatchimplementsSherlock\requests\BatchCommandInterface
{
private$commands = array();
/** * Pregenerate 2000 docs to insert, just as a demonstration * This could easily be opening a filestream, etc */publicfunction__construct()
{
for ($i = 0; $i < 2000; $i++) {
$tDoc = newSherlock\requests\Command();
$tDoc->action('post')
->index('testindexing')
->type('tweet')
->data('{"field":"test"}');
$this->commands[] = $tDoc;
}
}
/** * */publicfunctionrewind()
{
reset($this->commands);
}
/** * @return Command */publicfunctioncurrent()
{
returncurrent($this->commands);
}
/** * @return mixed */publicfunctionkey()
{
returnkey($this->commands);
}
/** * @return Command|void */publicfunctionnext()
{
returnnext($this->commands);
}
/** * @return bool */publicfunctionvalid()
{
returnfalse !== current($this->commands);
}
}
The CustomBatch class implements the required functions for an Iterator (current, next, etc). It is then used with Sherlock like this:
The most basic form of indexing involves adding single documents at a time. This is easily accomplished with the document() method:
However, there are many situations where you have a large quantity of docs that need to be indexed, but may not have enough to justify a full Bulk index. If you attempt to index 1000 documents using the singular interface above, each request will block the next. All requests will process serially and, in many cases, be very slow.
In these situation, Sherlock supports a Batch method, which allows parallel execution of the commands without blocking. This uses cURL's "multi-handle" feature, which allows cURL to perform threaded HTTP requests even though PHP is single-threaded.
Sherlock uses RollingCurl, an efficient implementation of cURL's multi-handle. This library processes requests as they complete, "rolling" through the pending queue rather than blocking on individual groups.
To use this feature, simply add more documents to your request before calling execute():
The provided document syntax is the cleanest, but can also be inflexible to work with. For example, what if some of your documents require an ID, while others should have ES autogenerate the ID? The above syntax is insufficient.
In these cases, it may be required to surface an internal Sherlock class: Sherlock\requests\Command. This class provides full access to the properties of a command. This example shows both PUTs (docs with IDs) and POSTS (docs without IDs) being inserted at the same time. You could even insert to different types or indices in the same request:
However, even this interface does not provide enough flexibility for all circumstances. For example, what if your data set is too large to load into memory? This is a common situation for users of ElasticSearch. Rather than managing batch loading and unloading from memory, Sherlock provides a BatchCommandInterface class which you can implement on top of your own classes.
The BatchCommandInterface is effectively a wrapper for an iterable object. This provides maximum flexibility in how you implement the loading of data into Sherlock. A common use would be streaming data from disk, one document at a time. Rather than loading the entire data set, the iterator streams through the data only as needed.
This example shows an implementation of BatchCommandInterface, where the data is pregenerated in the constructor. Obviously, this is not a good usage of the interface, but it is sufficient to demonstrate the usage:
The CustomBatch class implements the required functions for an Iterator (current, next, etc). It is then used with Sherlock like this:
Sherlock will then internally manage streaming data from your CustomBatch object, using parallel curl handles, etc
The text was updated successfully, but these errors were encountered: