Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement creation of ip2geo feature #257

Merged
merged 10 commits into from
Apr 26, 2023

Conversation

heemin32
Copy link
Collaborator

  • Implementation of ip2geo datasource creation
  • Implementation of ip2geo processor creation

Description

With this implementation, a user can create a IP2Geo datasource which will get updated automatically with a specified interval. After datasource is created, a user can create a Ip2Geo processor with the datasource to convert IP address to Geolocation data

Issues Resolved

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov-commenter
Copy link

codecov-commenter commented Apr 17, 2023

Codecov Report

Merging #257 (2752338) into feature/ip2geo (e594bf3) will decrease coverage by 24.80%.
The diff coverage is 26.69%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@                  Coverage Diff                  @@
##             feature/ip2geo     #257       +/-   ##
=====================================================
- Coverage             84.89%   60.09%   -24.80%     
- Complexity              378      424       +46     
=====================================================
  Files                    52       67       +15     
  Lines                  1185     2065      +880     
  Branches                 98      187       +89     
=====================================================
+ Hits                   1006     1241      +235     
- Misses                  137      777      +640     
- Partials                 42       47        +5     
Impacted Files Coverage Δ
...al/ip2geo/action/PutDatasourceTransportAction.java 0.00% <0.00%> (ø)
...geospatial/ip2geo/common/Ip2GeoExecutorHelper.java 0.00% <0.00%> (ø)
...atial/ip2geo/jobscheduler/DatasourceExtension.java 0.00% <0.00%> (ø)
...ospatial/ip2geo/jobscheduler/DatasourceRunner.java 0.00% <0.00%> (ø)
...h/geospatial/ip2geo/processor/Ip2GeoProcessor.java 5.22% <5.22%> (ø)
...arch/geospatial/ip2geo/common/GeoIpDataHelper.java 17.64% <17.64%> (ø)
...rch/geospatial/ip2geo/common/DatasourceHelper.java 31.42% <31.42%> (ø)
...rch/geospatial/ip2geo/jobscheduler/Datasource.java 41.32% <41.32%> (ø)
...geospatial/ip2geo/action/PutDatasourceRequest.java 49.12% <49.12%> (ø)
...patial/ip2geo/action/RestPutDatasourceHandler.java 50.00% <50.00%> (ø)
... and 6 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@heemin32
Copy link
Collaborator Author

Will increase test coverage through integration test while working on delete datasource API.

@navneet1v
Copy link
Collaborator

Will increase test coverage through integration test while working on delete datasource API.

did we not add enough unit tests?

Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR, is there a way we can split the PR into smaller PRs?

@heemin32
Copy link
Collaborator Author

Will increase test coverage through integration test while working on delete datasource API.

did we not add enough unit tests?

I did but some class is better to test in RestTestCase. I am planning to write it when delete datasource API is ready.

@heemin32
Copy link
Collaborator Author

This is a big PR, is there a way we can split the PR into smaller PRs?

Sorry for the big PR. Wanted to have an end to end complete feature for datasource creation and Ip2Geo processor. As they are all new files, it will be multiple PRs for each packages if I split them now. Will to better on next PR for GET and DELET API.

@navneet1v
Copy link
Collaborator

@heemin32 why we have added a skip change log label?

@heemin32
Copy link
Collaborator Author

@heemin32 why we have added a skip change log label?

Because the feature is disabled now. Therefore, there is no changed in end users perspective. Planning to add a change log when I enable them later.

1 similar comment
@heemin32
Copy link
Collaborator Author

@heemin32 why we have added a skip change log label?

Because the feature is disabled now. Therefore, there is no changed in end users perspective. Planning to add a change log when I enable them later.

@heemin32 heemin32 requested a review from VijayanB April 20, 2023 17:44
build.gradle Outdated Show resolved Hide resolved
build.gradle Show resolved Hide resolved
Comment on lines 123 to 133
if (clusterService == null) {
throw new IllegalStateException("ClusterService is not initialized.");
}

if (threadPool == null) {
throw new IllegalStateException("ThreadPool is not initialized.");
}

if (client == null) {
throw new IllegalStateException("Client is not initialized.");
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all can be avoided if you use a AllArgsConstructor and check everything parameters to be null in that function. Doing these null checks here pollutes the function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllArgsConstructor cannot be used here because of the way how job scheduler create an instance.
Let me think what I can do here.

@heemin32 heemin32 requested a review from VijayanB April 25, 2023 04:46
Copy link
Member

@VijayanB VijayanB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@VijayanB
Copy link
Member

Will increase test coverage through integration test while working on delete datasource API.

did we not add enough unit tests?

I did but some class is better to test in RestTestCase. I am planning to write it when delete datasource API is ready.

Both unit test and integration tests go hand in hand. I would recommend writing both as they solve different purposes.

The coverage test reports include both unit and integ tests. If I add integ tests to increase coverage, can you identify and tell me I should add more unit tests? Do we want to set the coverage report to include only unit test? Just want to make sure that there is no double standard.

Updated: Let me add more unit tests. Please continue the review.

IMO, we should cover as much as possible with unit test and shouldn’t table a use case to integration test if it is possible with unit test. Integration test will help us to test all connections which is not possible from unit test.

@heemin32
Copy link
Collaborator Author

Will increase test coverage through integration test while working on delete datasource API.

did we not add enough unit tests?

I did but some class is better to test in RestTestCase. I am planning to write it when delete datasource API is ready.

Both unit test and integration tests go hand in hand. I would recommend writing both as they solve different purposes.

The coverage test reports include both unit and integ tests. If I add integ tests to increase coverage, can you identify and tell me I should add more unit tests? Do we want to set the coverage report to include only unit test? Just want to make sure that there is no double standard.
Updated: Let me add more unit tests. Please continue the review.

IMO, we should cover as much as possible with unit test and shouldn’t table a use case to integration test if it is possible with unit test. Integration test will help us to test all connections which is not possible from unit test.

Yes. Looks like the coverage report is generated solely on unit test. As this PR is big enough already, let me follow up with another PR to increase coverage. Cut an GH issue as well. #262

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
try {
manifest = DatasourceManifest.Builder.build(url);
} catch (Exception e) {
log.info("Error occurred while reading a file from {}", url, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be info or error?

Copy link
Collaborator Author

@heemin32 heemin32 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a customer input error not system error.

this.clusterService = clusterService;
this.threadPool = threadPool;
timeout = Ip2GeoSettings.TIMEOUT_IN_SECONDS.get(settings);
clusterSettings.addSettingsUpdateConsumer(Ip2GeoSettings.TIMEOUT_IN_SECONDS, newValue -> timeout = newValue);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add settings update consumer here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that timeout value can be updated dynamically.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. So if there are multiple datasources, they will all be controlled by the same setting?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

.source(jobParameter.toXContent(JsonXContent.contentBuilder(), null))
.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE)
.opType(DocWriteRequest.OpType.CREATE);
client.index(indexRequest, new ActionListener<>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/opensearch-project/k-NN/pull/859/files for interacting with system index when security is enabled. Maybe take this for a future PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@Override
public void onResponse(final IndexResponse indexResponse) {
// This is user initiated request. Therefore, we want to handle the first datasource update task in a generic thread
// pool.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Could you explain this a little bit more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two flow where geoip data is created. 1. when user call create datasource api. 2. auto update after specified interval. For the second one, we are using special thread pool for Ip2Geo which has very limited resource like single thread to minimized the impact to the cluster availability. If we use the thread pool for Ip2Geo for the first case, the very first datasource creation will get delayed when there are another update going on. Therefore, I choose to use generic thread pool for the first case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would another update be going on? Im not sure we should use a different threadpool for initial vs. update.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple datasource. One of them might be running auto update process. During that time, a user can call a create another datasource request. If we use same thread pool, this initial creation will get delayed as it cannot start before the ongoing update finishes.
Also, it might get more difficult to handle race condition between multiple request into same datasource if the first request gets delayed to be handled.

* PUT /_plugins/geospatial/ip2geo/datasource/{id}
* {
* "endpoint": {endpoint},
* "update_interval_in_days": 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this of type "time" so that we dont have to include "in_days" in the parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there type "time"? Or do we want to create our own rule to parse it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*
* When data source is created, it starts with PREPARING state. Once the first GeoIP data is generated, the state changes to AVAILABLE.
* Only when the first GeoIP data generation failed, the state changes to FAILED.
* Subsequent GeoIP data failure won't change data source state from AVAILABLE to FAILED.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wont it change state to failed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The state here is about entire datasource itself not just one failure case during update. When update fails, still a user can use the datasource until it expires. I think I should change the state name from FAILED to CREATE_FAILED.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me handle that in the next PR.

Comment on lines +108 to +116
} catch (Exception e) {
log.error("Failed to create datasource for {}", jobParameter.getId(), e);
jobParameter.getUpdateStats().setLastFailedAt(Instant.now());
jobParameter.setState(DatasourceState.FAILED);
try {
DatasourceHelper.updateDatasource(client, jobParameter, timeout);
} catch (Exception ex) {
log.error("Failed to mark datasource state as FAILED for {}", jobParameter.getId(), ex);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case even when there is a failure we are sending ack response as true. any reason why we are doing that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is long running process in the background. Therefore, cannot return the final result back to a user as a response of an API. If something goes wrong during the background process, we put the state as FAILED.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the failed state will there be failure reasons?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as of now but will add in the next PR.

Comment on lines 137 to 139
} catch (IOException e) {
throw new RuntimeException(e);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw a proper exception from here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think a proper exception here?
throw new OpenSearchException("failed to read GeoIP data", e);?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace RuntimeException with some other exception which tells what happened. OpenSearchException can be a good candidate along with the exception message,

/**
* The in-memory cache for the ip2geo data. There should only be 1 instance of this class.
*/
public class Ip2GeoCache {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need cache? there is already a query cache present then why we are building another cache layer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this cache, we even don't need to serialize and deserialize the request and response. I can do a performance test on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add this cache implementation later on. The problem with caching is invalidation logic, size of cache, space in the heap.

Benchmarks should tell us if we should add a cache or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache can be turned off. I will do some benchmark test and remove it if there is not performance advantage in the next PR.
This caching strategy was there for legacy geoip processor.

data,
new ActionListener<>() {
@Override
public void onResponse(final Object obj) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we combine the logic of converting list to geo data and single element in 1 common place. I see a lot of commonalities.

I think there is an opportunity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed some duplicated code in my next PR which is ready. Combining them are not straight forward.

* The second call is made by ClusterApplierService. Therefore, we cannot access cluster state in the call.
* That means, we cannot even query an index inside the call.
*
* Because the processor is validated in the first call, we skip the validation in the second call.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if don't skip the validation what will happen?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws exception.

Comment on lines +20 to +22
/**
* The in-memory cache for the ip2geo data. There should only be 1 instance of this class.
*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we are enforcing a single instance creation of this class?

Copy link
Collaborator Author

@heemin32 heemin32 Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are creating the cache in createComponents method of plugin and pass it to Ip2Geo factory. Nothing actually enforcing it but it is a comment how the cache should be used.

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
1. Added ip2geo thread pool
2. Code refactoring
3. Added more comments

Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Copy link
Collaborator

@navneet1v navneet1v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my side on this PR. We discussed some improvements in this PR and I am expecting them to get added in later PRs.

This was a big PR. We could have split into 3 PRs.

Please make sure to resolve other comments before merging the PR.

@heemin32 heemin32 merged commit 4b29b4e into opensearch-project:feature/ip2geo Apr 26, 2023
@heemin32 heemin32 deleted the ip2geo branch May 4, 2023 01:07
heemin32 added a commit to heemin32/geospatial that referenced this pull request Jul 14, 2023
* Update gradle version to 7.6 (opensearch-project#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
heemin32 added a commit to heemin32/geospatial that referenced this pull request Jul 21, 2023
* Update gradle version to 7.6 (opensearch-project#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
heemin32 added a commit that referenced this pull request Jul 21, 2023
* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Add Auto Release Workflow (#288)

Signed-off-by: Naveen Tatikonda <navtat@amazon.com>

* Change package for Strings.hasText (#314)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Adding release notes for 2.8 (#323)

Signed-off-by: Martin Gaievski <gaievski@amazon.com>

* Add 2.9.0 release notes (#350)

Signed-off-by: Junqiu Lei <junqiu@amazon.com>

* Update packages according to a change in OpenSearch core (#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement creation of ip2geo feature (#257)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Added unit tests with some refactoring of codes (#271)

* Add Unit tests
* Set cache true for search query
* Remove in memory cache implementation (Two way door decision)
 * Relying on search cache without custom cache
* Renamed datasource state from FAILED to CREATE_FAILED
* Renamed class name from *Helper to *Facade
* Changed updateIntervalInDays to updateInterval
* Changed value type of default update_interval from TimeValue to Long
* Read setting value from cluster settings directly

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Sync from main (#280)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Add datasource name validation (#281)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#282)

1. Change variable name from datasourceName to name
2. Change variable name from id to name
3. Added helper methods in test code

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change field name from md5 to sha256 (#285)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement get datasource api (#279)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update index option (#284)

1. Make geodata index as hidden
2. Make geodata index as read only allow delete after creation is done
3. Refresh datasource index immediately after update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make some fields in manifest file as mandatory (#289)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Create datasource index explicitly (#283)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add wrapper class of job scheduler lock service (#290)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove all unused client attributes (#293)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update copyright header (#298)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run system index handling code with stashed thread context (#297)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reduce lock duration and renew the lock during update (#299)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implements delete datasource API (#291)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Set User-Agent in http request (#300)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement datasource update API (#292)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring test code (#302)

Make buildGeoJSONFeatureProcessorConfig method to be more general

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ip2geo processor integ test for failure case (#303)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix and refactoring of code (#305)

1. Bugfix: Ingest metadata can be null if there is no processor created
2. Refactoring: Moved private method to another class for better testing support
3. Refactoring: Set some private static final variable as public so that unit test can use it
4. Refactoring: Changed string value to static variable

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for Ip2GeoProcessor (#306)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ConcurrentModificationException (#308)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for UpdateDatasource API (#307)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix on lock management and few performance improvements (#310)

* Release lock before response back to caller for update/delete API
* Release lock in background task for creation API
* Change index settings to improve indexing performance

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change index setting from read_only_allow_delete to write (#311)

read_only_allow_delete does not block write to an index.
The disk-based shard allocator may add and remove this block automatically.
Therefore, use index.blocks.write instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Fix bug in get datasource API and improve memory usage (#313)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change package for Strings.hasText (#314) (#317)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Do not index blank value and do not enrich null property (#320)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Move index setting keys to constants (#321)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Return null index name for expired data (#322)

Return null index name for expired data so that it can be deleted
by clean up process. Clean up process exclude current index from deleting.
Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add new fields in datasource (#325)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete index once it is expired (#326)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add restoring event listener (#328)

In the listener, we trigger a geoip data update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reverse forcemerge and refresh order (#331)

Otherwise, opensearch does not clear old segment files

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Removed parameter and settings (#332)

* Removed first_only parameter
* Removed max_concurrency and batch_size setting

first_only parameter was added as current geoip processor has it.
However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search.

max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add a field in datasource for current index name (#333)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete GeoIP data indices after restoring complete (#334)

We don't want to use restored GeoIP data indices. Therefore we
delete the indices once restoring process complete.

When GeoIP metadata index is restored, we create a new GeoIP data index instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use bool query for array form of IPs (#335)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run update/delete request in a new thread (#337)

This is not to block transport thread

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove IP2Geo processor validation (#336)

Cannot query index to get data to validate IP2Geo processor.
Will add validation when we decide to store some of data in cluster state metadata.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Acquire lock sychronously (#339)

By acquiring lock asychronously, the remaining part of the code
is being run by transport thread which does not allow blocking code.
We want only single update happen in a node using single thread. However,
it cannot be acheived if I acquire lock asynchronously and pass the listener.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Added a cache to store datasource metadata (#338)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Changed class name and package (#341)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#342)

1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao
2. Moved the Ip2GeoCachedDao from cache to dao package

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add geo data cache (#340)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add cache layer to reduce GeoIp data retrieval latency (#343)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use _primary in query preference and few changes (#347)

1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346
2. Update datasource metadata index mapping
3. Move batch size from static value to setting

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Wait until GeoIP data to be replicated to all data nodes (#348)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#354)

* Update packages according to a change in OpenSearch core

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Signed-off-by: Naveen Tatikonda <navtat@amazon.com>
Signed-off-by: Martin Gaievski <gaievski@amazon.com>
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
Co-authored-by: Naveen Tatikonda <navtat@amazon.com>
Co-authored-by: Martin Gaievski <gaievski@amazon.com>
Co-authored-by: Junqiu Lei <junqiu@amazon.com>
heemin32 added a commit that referenced this pull request Jul 24, 2023
* Implement creation of ip2geo feature (#257)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Added unit tests with some refactoring of codes (#271)

* Add Unit tests
* Set cache true for search query
* Remove in memory cache implementation (Two way door decision)
 * Relying on search cache without custom cache
* Renamed datasource state from FAILED to CREATE_FAILED
* Renamed class name from *Helper to *Facade
* Changed updateIntervalInDays to updateInterval
* Changed value type of default update_interval from TimeValue to Long
* Read setting value from cluster settings directly

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Sync from main (#280)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Add datasource name validation (#281)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#282)

1. Change variable name from datasourceName to name
2. Change variable name from id to name
3. Added helper methods in test code

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change field name from md5 to sha256 (#285)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement get datasource api (#279)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update index option (#284)

1. Make geodata index as hidden
2. Make geodata index as read only allow delete after creation is done
3. Refresh datasource index immediately after update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make some fields in manifest file as mandatory (#289)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Create datasource index explicitly (#283)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add wrapper class of job scheduler lock service (#290)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove all unused client attributes (#293)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update copyright header (#298)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run system index handling code with stashed thread context (#297)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reduce lock duration and renew the lock during update (#299)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implements delete datasource API (#291)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Set User-Agent in http request (#300)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement datasource update API (#292)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring test code (#302)

Make buildGeoJSONFeatureProcessorConfig method to be more general

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ip2geo processor integ test for failure case (#303)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix and refactoring of code (#305)

1. Bugfix: Ingest metadata can be null if there is no processor created
2. Refactoring: Moved private method to another class for better testing support
3. Refactoring: Set some private static final variable as public so that unit test can use it
4. Refactoring: Changed string value to static variable

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for Ip2GeoProcessor (#306)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ConcurrentModificationException (#308)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for UpdateDatasource API (#307)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix on lock management and few performance improvements (#310)

* Release lock before response back to caller for update/delete API
* Release lock in background task for creation API
* Change index settings to improve indexing performance

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change index setting from read_only_allow_delete to write (#311)

read_only_allow_delete does not block write to an index.
The disk-based shard allocator may add and remove this block automatically.
Therefore, use index.blocks.write instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Fix bug in get datasource API and improve memory usage (#313)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change package for Strings.hasText (#314) (#317)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Do not index blank value and do not enrich null property (#320)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Move index setting keys to constants (#321)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Return null index name for expired data (#322)

Return null index name for expired data so that it can be deleted
by clean up process. Clean up process exclude current index from deleting.
Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add new fields in datasource (#325)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete index once it is expired (#326)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add restoring event listener (#328)

In the listener, we trigger a geoip data update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reverse forcemerge and refresh order (#331)

Otherwise, opensearch does not clear old segment files

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Removed parameter and settings (#332)

* Removed first_only parameter
* Removed max_concurrency and batch_size setting

first_only parameter was added as current geoip processor has it.
However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search.

max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add a field in datasource for current index name (#333)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete GeoIP data indices after restoring complete (#334)

We don't want to use restored GeoIP data indices. Therefore we
delete the indices once restoring process complete.

When GeoIP metadata index is restored, we create a new GeoIP data index instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use bool query for array form of IPs (#335)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run update/delete request in a new thread (#337)

This is not to block transport thread

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove IP2Geo processor validation (#336)

Cannot query index to get data to validate IP2Geo processor.
Will add validation when we decide to store some of data in cluster state metadata.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Acquire lock sychronously (#339)

By acquiring lock asychronously, the remaining part of the code
is being run by transport thread which does not allow blocking code.
We want only single update happen in a node using single thread. However,
it cannot be acheived if I acquire lock asynchronously and pass the listener.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Added a cache to store datasource metadata (#338)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Changed class name and package (#341)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#342)

1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao
2. Moved the Ip2GeoCachedDao from cache to dao package

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add geo data cache (#340)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add cache layer to reduce GeoIp data retrieval latency (#343)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use _primary in query preference and few changes (#347)

1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346
2. Update datasource metadata index mapping
3. Move batch size from static value to setting

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Wait until GeoIP data to be replicated to all data nodes (#348)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#354)

* Update packages according to a change in OpenSearch core

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2023
* Implement creation of ip2geo feature (#257)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Added unit tests with some refactoring of codes (#271)

* Add Unit tests
* Set cache true for search query
* Remove in memory cache implementation (Two way door decision)
 * Relying on search cache without custom cache
* Renamed datasource state from FAILED to CREATE_FAILED
* Renamed class name from *Helper to *Facade
* Changed updateIntervalInDays to updateInterval
* Changed value type of default update_interval from TimeValue to Long
* Read setting value from cluster settings directly

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Sync from main (#280)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Add datasource name validation (#281)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#282)

1. Change variable name from datasourceName to name
2. Change variable name from id to name
3. Added helper methods in test code

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change field name from md5 to sha256 (#285)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement get datasource api (#279)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update index option (#284)

1. Make geodata index as hidden
2. Make geodata index as read only allow delete after creation is done
3. Refresh datasource index immediately after update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make some fields in manifest file as mandatory (#289)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Create datasource index explicitly (#283)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add wrapper class of job scheduler lock service (#290)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove all unused client attributes (#293)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update copyright header (#298)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run system index handling code with stashed thread context (#297)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reduce lock duration and renew the lock during update (#299)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implements delete datasource API (#291)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Set User-Agent in http request (#300)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement datasource update API (#292)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring test code (#302)

Make buildGeoJSONFeatureProcessorConfig method to be more general

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ip2geo processor integ test for failure case (#303)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix and refactoring of code (#305)

1. Bugfix: Ingest metadata can be null if there is no processor created
2. Refactoring: Moved private method to another class for better testing support
3. Refactoring: Set some private static final variable as public so that unit test can use it
4. Refactoring: Changed string value to static variable

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for Ip2GeoProcessor (#306)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ConcurrentModificationException (#308)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for UpdateDatasource API (#307)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix on lock management and few performance improvements (#310)

* Release lock before response back to caller for update/delete API
* Release lock in background task for creation API
* Change index settings to improve indexing performance

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change index setting from read_only_allow_delete to write (#311)

read_only_allow_delete does not block write to an index.
The disk-based shard allocator may add and remove this block automatically.
Therefore, use index.blocks.write instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Fix bug in get datasource API and improve memory usage (#313)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change package for Strings.hasText (#314) (#317)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Do not index blank value and do not enrich null property (#320)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Move index setting keys to constants (#321)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Return null index name for expired data (#322)

Return null index name for expired data so that it can be deleted
by clean up process. Clean up process exclude current index from deleting.
Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add new fields in datasource (#325)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete index once it is expired (#326)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add restoring event listener (#328)

In the listener, we trigger a geoip data update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reverse forcemerge and refresh order (#331)

Otherwise, opensearch does not clear old segment files

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Removed parameter and settings (#332)

* Removed first_only parameter
* Removed max_concurrency and batch_size setting

first_only parameter was added as current geoip processor has it.
However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search.

max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add a field in datasource for current index name (#333)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete GeoIP data indices after restoring complete (#334)

We don't want to use restored GeoIP data indices. Therefore we
delete the indices once restoring process complete.

When GeoIP metadata index is restored, we create a new GeoIP data index instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use bool query for array form of IPs (#335)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run update/delete request in a new thread (#337)

This is not to block transport thread

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove IP2Geo processor validation (#336)

Cannot query index to get data to validate IP2Geo processor.
Will add validation when we decide to store some of data in cluster state metadata.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Acquire lock sychronously (#339)

By acquiring lock asychronously, the remaining part of the code
is being run by transport thread which does not allow blocking code.
We want only single update happen in a node using single thread. However,
it cannot be acheived if I acquire lock asynchronously and pass the listener.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Added a cache to store datasource metadata (#338)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Changed class name and package (#341)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#342)

1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao
2. Moved the Ip2GeoCachedDao from cache to dao package

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add geo data cache (#340)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add cache layer to reduce GeoIp data retrieval latency (#343)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use _primary in query preference and few changes (#347)

1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346
2. Update datasource metadata index mapping
3. Move batch size from static value to setting

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Wait until GeoIP data to be replicated to all data nodes (#348)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#354)

* Update packages according to a change in OpenSearch core

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
(cherry picked from commit 0cd9153)
heemin32 added a commit that referenced this pull request Jul 24, 2023
* Implement creation of ip2geo feature (#257)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Implement creation of ip2geo feature

* Implementation of ip2geo datasource creation
* Implementation of ip2geo processor creation

Signed-off-by: Heemin Kim <heemin@amazon.com>
---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Added unit tests with some refactoring of codes (#271)

* Add Unit tests
* Set cache true for search query
* Remove in memory cache implementation (Two way door decision)
 * Relying on search cache without custom cache
* Renamed datasource state from FAILED to CREATE_FAILED
* Renamed class name from *Helper to *Facade
* Changed updateIntervalInDays to updateInterval
* Changed value type of default update_interval from TimeValue to Long
* Read setting value from cluster settings directly

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Sync from main (#280)

* Update gradle version to 7.6 (#265)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

* Exclude lombok generated code from jacoco coverage report (#268)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make jacoco report to be generated faster in local (#267)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update dependency org.json:json to v20230227 (#273)

Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Baseline owners and maintainers (#275)

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>

* Add datasource name validation (#281)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#282)

1. Change variable name from datasourceName to name
2. Change variable name from id to name
3. Added helper methods in test code

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change field name from md5 to sha256 (#285)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement get datasource api (#279)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update index option (#284)

1. Make geodata index as hidden
2. Make geodata index as read only allow delete after creation is done
3. Refresh datasource index immediately after update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Make some fields in manifest file as mandatory (#289)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Create datasource index explicitly (#283)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add wrapper class of job scheduler lock service (#290)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove all unused client attributes (#293)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update copyright header (#298)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run system index handling code with stashed thread context (#297)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reduce lock duration and renew the lock during update (#299)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implements delete datasource API (#291)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Set User-Agent in http request (#300)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Implement datasource update API (#292)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring test code (#302)

Make buildGeoJSONFeatureProcessorConfig method to be more general

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ip2geo processor integ test for failure case (#303)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix and refactoring of code (#305)

1. Bugfix: Ingest metadata can be null if there is no processor created
2. Refactoring: Moved private method to another class for better testing support
3. Refactoring: Set some private static final variable as public so that unit test can use it
4. Refactoring: Changed string value to static variable

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for Ip2GeoProcessor (#306)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add ConcurrentModificationException (#308)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add integration test for UpdateDatasource API (#307)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Bug fix on lock management and few performance improvements (#310)

* Release lock before response back to caller for update/delete API
* Release lock in background task for creation API
* Change index settings to improve indexing performance

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change index setting from read_only_allow_delete to write (#311)

read_only_allow_delete does not block write to an index.
The disk-based shard allocator may add and remove this block automatically.
Therefore, use index.blocks.write instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Fix bug in get datasource API and improve memory usage (#313)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Change package for Strings.hasText (#314) (#317)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove jitter and move index setting from DatasourceFacade to DatasourceExtension (#319)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Do not index blank value and do not enrich null property (#320)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Move index setting keys to constants (#321)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Return null index name for expired data (#322)

Return null index name for expired data so that it can be deleted
by clean up process. Clean up process exclude current index from deleting.
Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add new fields in datasource (#325)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete index once it is expired (#326)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add restoring event listener (#328)

In the listener, we trigger a geoip data update

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Reverse forcemerge and refresh order (#331)

Otherwise, opensearch does not clear old segment files

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Removed parameter and settings (#332)

* Removed first_only parameter
* Removed max_concurrency and batch_size setting

first_only parameter was added as current geoip processor has it.
However, the parameter have no benefit for ip2geo processor as we don't do a sequantial search for array data but use multi search.

max_concurrency and batch_size setting is removed as these are only reveal internal implementation and could be a future blocker to improve performance later.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add a field in datasource for current index name (#333)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Delete GeoIP data indices after restoring complete (#334)

We don't want to use restored GeoIP data indices. Therefore we
delete the indices once restoring process complete.

When GeoIP metadata index is restored, we create a new GeoIP data index instead.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use bool query for array form of IPs (#335)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Run update/delete request in a new thread (#337)

This is not to block transport thread

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Remove IP2Geo processor validation (#336)

Cannot query index to get data to validate IP2Geo processor.
Will add validation when we decide to store some of data in cluster state metadata.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Acquire lock sychronously (#339)

By acquiring lock asychronously, the remaining part of the code
is being run by transport thread which does not allow blocking code.
We want only single update happen in a node using single thread. However,
it cannot be acheived if I acquire lock asynchronously and pass the listener.

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Added a cache to store datasource metadata (#338)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Changed class name and package (#341)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Refactoring of code (#342)

1. Changed class name from Ip2GeoCache to Ip2GeoCachedDao
2. Moved the Ip2GeoCachedDao from cache to dao package

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add geo data cache (#340)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Add cache layer to reduce GeoIp data retrieval latency (#343)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Use _primary in query preference and few changes (#347)

1. Use _primary preference to get datasource metadata so that it can read the latest data. RefreshPolicy.IMMEDIATE won't refresh replica shards immediately according to #346
2. Update datasource metadata index mapping
3. Move batch size from static value to setting

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Wait until GeoIP data to be replicated to all data nodes (#348)

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#354)

* Update packages according to a change in OpenSearch core

Signed-off-by: Heemin Kim <heemin@amazon.com>

* Update packages according to a change in OpenSearch core (#353)

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Heemin Kim <heemin@amazon.com>

---------

Signed-off-by: Vijayan Balasubramanian <balasvij@amazon.com>
Signed-off-by: Heemin Kim <heemin@amazon.com>
Co-authored-by: Vijayan Balasubramanian <balasvij@amazon.com>
Co-authored-by: mend-for-github-com[bot] <50673670+mend-for-github-com[bot]@users.noreply.github.com>
(cherry picked from commit 0cd9153)

Co-authored-by: Heemin Kim <heemin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants