Significant API call overhead and substantial additional processing overhead on the server side when doing batch processing #381

CS76 · 2023-08-31T10:44:57Z

CS76
Aug 31, 2023
Maintainer

A significant shortcoming of the API presented in the paper is that all its methods process just one item (most typically, a chemical structure encoded as a SMILES string) per method call. This results in a significant API call overhead and substantial additional processing overhead on the server side when doing batch processing.

For example, each getCDKSDGMol(smiles) call makes several JClass(java_class_name) calls, which are quite computationally expensive.

I tested the overall processing speed of this method against the public API endpoint (https://api.naturalproducts.net/latest) with a relatively simple molecule (’CC1(C)OC2COC3(COS(N)(=O)=O)OC(C)(C)OC3C2O1’) and toolkit=’cdk’, and it maxed out at about 6 molecules/s with serial calls (one call at a time in a loop) and about 17 molecules/s when I attempted sending a large number of requests concurrently.

It does not seem there’s any server-side caching, so calling the function with the same molecule repeatedly (as opposed to supplying a different molecule in each invocation of the function) does not introduce bias into the results of the timing experiment.

With toolkit=’rdkit’, the measured speed was 8 molecules/s and 36 molecules/s, respectively. I don’t have CDK installed, but the ‘native’ 2d-generation speed for the same example molecule with RDKit via Python API on my modest ThinkPad P51 laptop was 450 molecules/s without any parallel processing optimization. I suggest adding ‘bulk processing’ functions to the API, which would take arrays of structures (encoded as SMILES strings or molfile strings) and return arrays of results. These ‘bulk’ methods should use the HTTP POST rather than GET.

Enable/Support batch processing #380

CS76 · 2023-09-08T10:37:53Z

CS76
Sep 8, 2023
Maintainer Author

We stress-tested our current public instance (https://api.naturalproducts.net/latest) converting the example molecule (’CC1(C)OC2COC3(COS(N)(=O)=O)OC(C)(C)OC3C2O1’) to 2D Mol (invokes getCDKSDGMol for CDK and corresponding RDKit equivalent) the server performance is as follows. RDKit-based SMILES to 2D Mol conversion has a 100% success rate for up to 2000 requests per second (latency is less than 1s), and the server managed to process up to 10000 requests but at a higher latency. CDK conversion maxed out at 200 requests per second (latency is up to 10s). There could be several factors influencing the times here, but we agree that bulk processing could improve CDK-based endpoint performance a lot with classed loaded once for the batch instead of per molecule.

RDKIT