Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

choose different serialization scheme for storing configuration #2329

Open
vladak opened this issue Aug 31, 2018 · 6 comments
Open

choose different serialization scheme for storing configuration #2329

vladak opened this issue Aug 31, 2018 · 6 comments

Comments

@vladak
Copy link
Member

vladak commented Aug 31, 2018

The XML encoder used for configuration serialization is not very robust (e.g. in the face of changing class hierarchy and removing configuration options) and has some quirks (#2002). We should consider using something else (YAML/JSON ?).

Also, this serialization is used not only for configuration but also elsewhere (IndexAnalysisSettings).

@tulinkry
Copy link
Contributor

Yes, finally.

@tulinkry
Copy link
Contributor

tulinkry commented Feb 4, 2019

Looks like yaml would be the way to go.

@vladak
Copy link
Member Author

vladak commented Apr 12, 2019

Also, the configuration should be treated as data, not serialized objects, to avoid security vulnerabilities that might happen when de-serializing XML into Java objects.

@vladak
Copy link
Member Author

vladak commented Mar 28, 2022

The other reason for using something else is performance. Lately, I realized that XMLEncoder does not scale when retrieving configuration using the RESTful API. When running a multithreaded program where each thread just retrieves the configuration in a loop, where the number of threads matches the number of CPUs, the times shoot up to almost 2 seconds, compared to single threaded program with 0.4 seconds. The XML file with the configuration has some 1.38 MB. When I got a jstack snapshot, it revealed that lots of the XMLEncoder processing threads (like 25 out of the 32 threads I was using) are waiting on internal synchronization object, with top of the stack looking like this:

"http-nio-8080-exec-1427" #29360 daemon prio=5 os_prio=64 cpu=38052.59ms elapsed=2859934.54s tid=0x000000000531c000 nid=0x7981 waiting for monitor entry  [0x00007fff808fa000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at com.sun.beans.util.Cache.get(java.desktop@11.0.7-internal/Cache.java:119)
        - waiting to lock <0x00007ff387d7f320> (a java.lang.ref.ReferenceQueue)
        at com.sun.beans.finder.MethodFinder.findMethod(java.desktop@11.0.7-internal/MethodFinder.java:81)
        at java.beans.Statement.getMethod(java.desktop@11.0.7-internal/Statement.java:369)
        at java.beans.Statement.invokeInternal(java.desktop@11.0.7-internal/Statement.java:273)
        at java.beans.Statement$2.run(java.desktop@11.0.7-internal/Statement.java:187)
        at java.security.AccessController.doPrivileged(java.base@11.0.7-internal/Native Method)
        at java.beans.Statement.invoke(java.desktop@11.0.7-internal/Statement.java:184)
        at java.beans.Expression.getValue(java.desktop@11.0.7-internal/Expression.java:155)
        at java.beans.Encoder.getValue(java.desktop@11.0.7-internal/Encoder.java:105)
        at java.beans.Encoder.get(java.desktop@11.0.7-internal/Encoder.java:252)
        at java.beans.PersistenceDelegate.writeObject(java.desktop@11.0.7-internal/PersistenceDelegate.java:112)
        at java.beans.Encoder.writeObject(java.desktop@11.0.7-internal/Encoder.java:74)
        at java.beans.XMLEncoder.writeObject(java.desktop@11.0.7-internal/XMLEncoder.java:326)

Now, I did this exercise in order to simulate read timeout problems that occur right after running all-project sync using the sync.py command. This command runs number of reindex_project.py programs in parallel and each reindex_project.py retrieves the configuration from the web app at the start. Using --api_timeout with increased value for the Python tools is usable as a workaround, however my expectation is that this should scale.

@vladak
Copy link
Member Author

vladak commented Oct 19, 2022

Another feature that could be brought with new serialization scheme is wildcards. For instance, I'd like to be able to set project properties for a set of projects specified with wildcards (regexps, even), similarly to what is done in opengrok-mirror configuration:

projects:
  apache-httpd-.*:
     proxy: true

@vladak
Copy link
Member Author

vladak commented Dec 1, 2022

YAML is probably not so great so perhaps using something like TOML might be better idea, however still need to address the need for serialization of objects like Project and RepositoryInfo. Seems like some TOML Java implementations support serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants