Skip to content
This repository has been archived by the owner on May 29, 2019. It is now read-only.
Alex Paliarush edited this page Feb 8, 2018 · 8 revisions

Technical Vision

1. Introduce Separate Bulk API Endpoints

There were multiple proposals to introduce Queue for the Product import, both from Balance Internet and ComWrap. The purpose of the queue might be completely opposite:

  • parallelize products save operation, when multiple workers will be reading products from the queue. The problem with such approach guarantee only one product save a time

Support of the queue is necessary, for scalability, because allows to process requests asynchronously and allows to regulate a number of the workers which processing the product. However, having Magento support for the Message Queue working on top of the service contracts, the support of the queue is a secondary task relating to the Bulk API endpoint.

Bulk API should solve following problems:

  • have an API which accepts multiple products to be persisted a time
  • internal implementation of the Bulk API should guarantee that the products which are coming into Bulk API are persisted without deadlocks.
  • an implementation may do performance optimization of the Bulk API persistence but moreover, it should be backward compatible with current customizations around persistence operations.
  • It must be possible for the client to send multiple Bulk API requests in parallel
  • It should be possible for Magento to process multiple Bulk API workers in parallel
  • Allow for support of other entities in the future
  • HTTP requests must complete quickly. Processing of bulk data is performed asynchronously.

Design

Interfaces will consist of the List of entities persist with some metadata information. Metadata can contain things like the id of the global operation or id of an individual batch of entities.

Clients may supply UUID for bulk items, bulk groups or bulk requests to assist with later interrogation of errors or operation / item status. This allows a client to bulk-generate its request bodies, and then assign identifiers, persist and monitor these in an efficient manner.

If a client does not specify an identifier (such as may be the case where contracts are used within Magento, or a client does not have the capability to generate one), these will be allocated by Magento.

Bulk API endpoints should be introduced separately for every entity which is persisted in bulk.

2. Incremental Import

Support for incremental import should address the problem of the third-party systems which persists multiple entities without checking did they really change. Every entity which is persisted should have hash generated for the data. When persistence operation starts, it will load the entity from the database and calculate the hash on key data which is imported. In case if the hash is not changed, an entity should be skipped from the persistence.

3. Partial Updates

Additionally, instead of sending the whole entity if only one field changed, Bulk API can support partial update to the entity.

4. Bulk API Queue

Will be built on top of Bulk API and will map bulk api messages to the queues and consumers. Optionally can introduce interfaces to track batches of entities and data synchronization process

5. Import/Export on Top of Bulk API

Additionally, the Bulk API can be used to serve the Import / Export functionality.

Pros:

  • solve the whole bunch of inconsistencies between import/export and current APIS

Cons:

  • Bulk API performance might be worse in comparison to importing export

6. Events

The bulk API should ideally support existing model events to the extent that performance is not adversely impacted.

Third-party extensions commonly perform HTTP requests, or other long-running synchronous operations based on dispatch of events such as product save. Many hundreds or thousands of these events executing during the creation or update of individual entities has the potential to reduce performance of the bulk API. In some cases, these have been implemented in a manner that will require their execution to occur within a certain context or execute some logic (such as validation) prior to an entity being persisted.

Currently, the Import / Export module and mass update actions within Magento Admin do not trigger model events, so the addition of a specific set of bulk-API related events may be considered a suitable resolution to support high-throughput, while still allowing third party modules to be notified of changes to model data.

Should new bulk events be required to achieve required performance, those considered 'blocking' may be designed to allow for parallel execution in one or more workers. Parallel processing of bulk API events will:

  • Reduce the overall processing time of the bulk API worker(s)
  • Allow an accurate operation status to be reported at the time that the bulk API worker has completed processing a message (when compared to non-blocking event execution)
  • Reduce the amount of time that the worker thread needs to sustain a database transaction (where events dispatch during existing save implementations), thereby reducing the opportunity for database deadlocks

Additionally, should new bulk API events be added these should include (and be documented towards a preference for) events which do not block the processing and persistence of the initial request to reduce the amount of time that a database transaction must be maintained.

Examples of "blocking" events:

  • bulk_update_products_before
  • bulk_update_products_prepare_before

The results of these may affect the eventual status of the bulk / bulk item operation which is reported by the API.

Examples of "non-blocking" events:

  • bulk_update_products_after

The results of these are outside of the concern for the bulk API client, so the operation status can be reported to the bulk API client prior to their execution.

7. Persistence Layer

The bulk API should use service contracts and thereby maintain support for existing customisations (events, plugins etc) made to entity resource models wherever possible.

Currently, deadlocks created during simultaneous insert (and to a lesser degree, update) operations occur as a result of many concurrent database connections each saving a different (single) product per HTTP request.

Persisting multiple entities via a single database transaction (and therefore connection) in the form of a Bulk API persistence worker may provide sufficient throughput while maintaining compatibility with existing customisations made to resource models when compared to a separate performance-oriented implementation of the persistence layer. This would not necessarily prevent multiple workers from pre-processing or preparing data / deltas for eventual insert via a single worker where data preparation proves to be an intensive operation.