-
-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove azure directory #226
Conversation
…ob-v1 # Conflicts: # build/build.xml
…ob-v1 # Conflicts: # src/Examine.AzureDirectory/AzureDirectory.cs # src/Examine.AzureDirectory/Properties/AssemblyInfo.cs # src/Examine/Properties/AssemblyInfo.cs
…ons. Refactor to simplify code/ improve code reuse. Change framework to 4.6.1 for azuredirectory, test and demo projects as required by blob storage package
…h namespace. Integration test is not very reliable with the azure storage emulator.
Use IndexReader in lucene searcher - based on Shannon comments, add check sync on create input/output, add lock on rebuilding of cache
…giri-1.11.1' into feature/v1/azure-blob-1-fixes # Conflicts: # build/build.xml
…lob already started
# Conflicts: # build/build.xml # src/Examine.AzureDirectory/AzureDirectoryFactory.cs # src/Examine/Logging/LogEntry.cs # src/Examine/LuceneEngine/Providers/LuceneIndex.cs
update also target framework
# Conflicts: # src/Examine.AzureDirectory/AzureDirectoryFactory.cs
…lob-1-fixes # Conflicts: # src/Examine.Test/App.config # src/Examine.Test/Examine.Test.csproj # src/Examine.Test/packages.config # src/Examine.Web.Demo/Web.config # src/Examine/Examine.csproj # src/Examine/Properties/AssemblyInfo.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, have left a bunch of questions/comments.
@@ -0,0 +1,47 @@ | |||
<?xml version="1.0" encoding="utf-8"?> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this file doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza that was added by @nzdev, if I remember correctly will check :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I think @nzdev is using some static code analyse software and it was failing in building for him because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to remove this file
|
||
namespace Examine.LuceneEngine.DeletePolicies | ||
{ | ||
public class NoDeletionPolicy : IndexDeletionPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure if these new bits are actually needed and instead just using the build in SnapshotDeletePolicy. For example, here's some docs (there's a bunch more out there) on Hot Backups with older lucene. 4.8 has a built in Replicator. Just wanted to mention this since potentially SnapshotDeletePolicy solves these issues https://freecontent.manning.com/hot-backups-with-lucene/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza that one will actually make removal of files, so it will cause End of file exception. so sadly we need deal with that in that way, as we dont make backups, we actively using indexes.
but for 4.8 I will look into Replicator as it is actually more or less what I am doing now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a read of that document along with https://cwiki.apache.org/confluence/display/solr/SolrReplication. Using the snapshotdeletionpolicy sounded like the way to go to me. It allows for hot backups. It wouldn't be a duplex R/W approach, but use the on commit method to choose how many versions to hold and send to blob storage. At the same time you could queue a message to signal the other instances to sync index files. This way you would not need to check the storage account files every query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nzdev you forgot about one thing, solr is using that internally, we have delay between push and pull, so that will cause EOF :) sadly that aproach would be perfect for doing syncTemp or backups, but not live synchronization :)
as long we dont want make message system between instances there is not really good way of doing that base on snapshotdeletionpolicy from my point of view ;/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be made to work as the SnapshotDeletionPolicy can prevent files from being deleted, so no risk of EOF. Each IndexCommit tracks the changed files. We could then write the files plus the list of changes to blob storage and have the replicas/ client either poll for changes or check on index reader creation. The Replicator looks like the way to go though for 4.8, but ultimately it will be using the same approach as solr. Have a read of the "how does it work section" from https://cwiki.apache.org/confluence/display/solr/SolrReplication. The delay between push and pull can be handled by versioning segments.gen in blob storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nzdev hmmm i did read that, but yeah it will be something what will need further investigation, but yeah so far i have working production site base on code which is used in that repo already:
https://github.com/bielu/Examine.Directories
so if we will make changes there I guess we will need do modification of remote directory and test again :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nzdev @Shazwazza I looked into that more deeply, and there is few limitation:
- all writers has to be closed when restoring (so restoring from blob storage)
- based on point 1 we still staying with shuffle of folders in time of restore (so event on lucene index still needed)
- hmmm with nodeletionpolicy I hit interesting issue (some documents are still present even if in the newest commit they are not existing)
I started looking into snapshot policy as it would make much faster restore as it would be only duplication of last commit, but from that what I am seeing we are stacking with same issue, but will investigate this weekend further as might be to that better solution
|
||
namespace Examine.LuceneEngine.MergePolicies | ||
{ | ||
public class NoMergePolicy : MergePolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other comment about SnapshotDeletionPolicy. Could that 'just work' by acquiring a snapshot for syncing indexes?
#endif | ||
|
||
if (!RunAsync) | ||
{ | ||
var msg = "Indexing Error Occurred: " + e.Message; | ||
if (e.InnerException != null) | ||
msg += ". ERROR: " + e.InnerException.Message; | ||
//if(this._directory is ExamineDirectory examineDirectory && examineDirectory.IsReadOnly) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza will remove that asap :)
|
||
} | ||
|
||
_writer = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a note here to explain what this is doing.
if (_nrtWriter != null) | ||
{ | ||
if (_directory != null && _nrtWriter.Directory != _directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this code doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I added comment and because you commented I noticed I forgot to update that part to use WriterTracker.Current.MapAsOld( instead of direct dispose :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of edge case logic in here to support scenarios that are totally outside of the core of Examine and I worry that these changes will bleed into creating strange underlying issues and code maintenance questions. (i.e. why is this code here if it never needs to exist to support what Examine itself is doing). My other worry is that porting this to 2.0.0 on netcore will be problematic. We don't have any WriterTracker anymore. There is a single writer for an index and it is managed by the index and disposed when it shuts down. Maybe there's a better way to support all of the changes that you are wanting to do by exposing protected properties so that you can sub class some of these objects to control things the way that you want. I would very much prefer to not have code like this in here and instead make the code better extensible to provide changing the logic outside of this library. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza in 2.0.0 we dont need that at all, as lucene 4.8 support duplicator, so we dont need really play around with swapping indexes :). So we can discuss what we can do that :)
MaybeReopen(); | ||
|
||
} | ||
catch (Exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what exception is actually thrown? that's the one that should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I think I actually was planning to add logger there, forgot will do that soon :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I will add logging there, but can't really remind myself what exception i hit here 🤦♂️
@@ -41,9 +44,71 @@ public IndexWriter GetWriter(Directory dir, bool throwIfEmpty) | |||
|
|||
public IndexWriter GetWriter(Directory dir, Func<Directory, IndexWriter> factory) | |||
{ | |||
var resolved = _writers.GetOrAdd(dir.GetLockId(), s => factory(dir)); | |||
return resolved; | |||
lock (_locker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_writers is a ConcurrentDictionary, why do we need another lock? Else if we do because of the logic in MapAsOld, we can probably just make it a Dictionary and manage the locking ourselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I had some weird issue with just ConcurrentDictionary, so I added that, but will play around with just dictionary :)
} | ||
|
||
public void MapAsOld(Directory dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some code docs explaining what this does would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza will add :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented added, please let know if you need more descriptive comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added a few more comments on code
writer = new ExamineIndexWriter(d, FieldAnalyzer, false, IndexWriter.MaxFieldLength.UNLIMITED); | ||
} | ||
|
||
if (examineDirectory.GetMergeScheduler() != null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor - but calling a method multiple times shouldn't be done since normally that will incur some performance. Better to store the result of GetMergeScheduler() and re-use it.
} | ||
} | ||
|
||
public ExamineDirectory ExamineDir { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making this settable is problematic, it needs to be readonly and injected
/// Called on commit | ||
/// </summary> | ||
/// <param name="action"></param> | ||
public void SetOnCommitAction(Action<ExamineIndexWriter> action) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these Set
methods mean that you can mutate this instance at runtime which may be problematic. Since this needs to be injected into the ctor of the index, all of these things should be injected into the ctor of this object too. This is a new object so we don't need to work around breaking changes and introduce spaghetti code. I feel like this whole class can be simplified, it takes in it's requirements in the ctor and exposes them by properties (not methods).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Shazwazza I am going to look into that soon, as need figure out maybe better way of handling as I also need introduce background queue and need figure out best way of handling that :)
if (_nrtWriter != null) | ||
{ | ||
if (_directory != null && _nrtWriter.Directory != _directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot of edge case logic in here to support scenarios that are totally outside of the core of Examine and I worry that these changes will bleed into creating strange underlying issues and code maintenance questions. (i.e. why is this code here if it never needs to exist to support what Examine itself is doing). My other worry is that porting this to 2.0.0 on netcore will be problematic. We don't have any WriterTracker anymore. There is a single writer for an index and it is managed by the index and disposed when it shuts down. Maybe there's a better way to support all of the changes that you are wanting to do by exposing protected properties so that you can sub class some of these objects to control things the way that you want. I would very much prefer to not have code like this in here and instead make the code better extensible to provide changing the logic outside of this library. Does that make sense?
I think at this point, I wouldn't be able to progress with this approach without some sort of distributed log which I would say is more of a server implementation issue (think solr). I'd suggest closing this pr. |
@nzdev yeah it is something to close |
@Shazwazza as agreed PR is changed to remove Azure Directory, I am going today to create new Repository Examine.Directories, and I will play there with Azure and s3 directories :)