Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove azure directory #226

Closed
wants to merge 99 commits into from
Closed

Conversation

bielu
Copy link

@bielu bielu commented May 1, 2021

@Shazwazza as agreed PR is changed to remove Azure Directory, I am going today to create new Repository Examine.Directories, and I will play there with Azure and s3 directories :)

Arkadiusz Biel and others added 30 commits November 8, 2020 00:02
…ob-v1

# Conflicts:
#	src/Examine.AzureDirectory/AzureDirectory.cs
#	src/Examine.AzureDirectory/Properties/AssemblyInfo.cs
#	src/Examine/Properties/AssemblyInfo.cs
…ons. Refactor to simplify code/ improve code reuse. Change framework to 4.6.1 for azuredirectory, test and demo projects as required by blob storage package
…h namespace.

Integration test is not very reliable with the azure storage emulator.
Arkadiusz Biel added 19 commits February 12, 2021 14:50
Use IndexReader in lucene searcher - based on Shannon comments, add check sync on create input/output, add lock on rebuilding of cache
…giri-1.11.1' into feature/v1/azure-blob-1-fixes

# Conflicts:
#	build/build.xml
# Conflicts:
#	build/build.xml
#	src/Examine.AzureDirectory/AzureDirectoryFactory.cs
#	src/Examine/Logging/LogEntry.cs
#	src/Examine/LuceneEngine/Providers/LuceneIndex.cs
update also target framework
# Conflicts:
#	src/Examine.AzureDirectory/AzureDirectoryFactory.cs
…lob-1-fixes

# Conflicts:
#	src/Examine.Test/App.config
#	src/Examine.Test/Examine.Test.csproj
#	src/Examine.Test/packages.config
#	src/Examine.Web.Demo/Web.config
#	src/Examine/Examine.csproj
#	src/Examine/Properties/AssemblyInfo.cs
Copy link
Owner

@Shazwazza Shazwazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, have left a bunch of questions/comments.

@@ -0,0 +1,47 @@
<?xml version="1.0" encoding="utf-8"?>
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this file doing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza that was added by @nzdev, if I remember correctly will check :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I think @nzdev is using some static code analyse software and it was failing in building for him because of that.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to remove this file


namespace Examine.LuceneEngine.DeletePolicies
{
public class NoDeletionPolicy : IndexDeletionPolicy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure if these new bits are actually needed and instead just using the build in SnapshotDeletePolicy. For example, here's some docs (there's a bunch more out there) on Hot Backups with older lucene. 4.8 has a built in Replicator. Just wanted to mention this since potentially SnapshotDeletePolicy solves these issues https://freecontent.manning.com/hot-backups-with-lucene/

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza that one will actually make removal of files, so it will cause End of file exception. so sadly we need deal with that in that way, as we dont make backups, we actively using indexes.
but for 4.8 I will look into Replicator as it is actually more or less what I am doing now :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a read of that document along with https://cwiki.apache.org/confluence/display/solr/SolrReplication. Using the snapshotdeletionpolicy sounded like the way to go to me. It allows for hot backups. It wouldn't be a duplex R/W approach, but use the on commit method to choose how many versions to hold and send to blob storage. At the same time you could queue a message to signal the other instances to sync index files. This way you would not need to check the storage account files every query.

Copy link
Author

@bielu bielu May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzdev you forgot about one thing, solr is using that internally, we have delay between push and pull, so that will cause EOF :) sadly that aproach would be perfect for doing syncTemp or backups, but not live synchronization :)
as long we dont want make message system between instances there is not really good way of doing that base on snapshotdeletionpolicy from my point of view ;/

Copy link
Contributor

@nzdev nzdev May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be made to work as the SnapshotDeletionPolicy can prevent files from being deleted, so no risk of EOF. Each IndexCommit tracks the changed files. We could then write the files plus the list of changes to blob storage and have the replicas/ client either poll for changes or check on index reader creation. The Replicator looks like the way to go though for 4.8, but ultimately it will be using the same approach as solr. Have a read of the "how does it work section" from https://cwiki.apache.org/confluence/display/solr/SolrReplication. The delay between push and pull can be handled by versioning segments.gen in blob storage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzdev hmmm i did read that, but yeah it will be something what will need further investigation, but yeah so far i have working production site base on code which is used in that repo already:
https://github.com/bielu/Examine.Directories
so if we will make changes there I guess we will need do modification of remote directory and test again :)

Copy link
Author

@bielu bielu Jul 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nzdev @Shazwazza I looked into that more deeply, and there is few limitation:

  • all writers has to be closed when restoring (so restoring from blob storage)
  • based on point 1 we still staying with shuffle of folders in time of restore (so event on lucene index still needed)
  • hmmm with nodeletionpolicy I hit interesting issue (some documents are still present even if in the newest commit they are not existing)

I started looking into snapshot policy as it would make much faster restore as it would be only duplication of last commit, but from that what I am seeing we are stacking with same issue, but will investigate this weekend further as might be to that better solution


namespace Examine.LuceneEngine.MergePolicies
{
public class NoMergePolicy : MergePolicy
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment about SnapshotDeletionPolicy. Could that 'just work' by acquiring a snapshot for syncing indexes?

#endif

if (!RunAsync)
{
var msg = "Indexing Error Occurred: " + e.Message;
if (e.InnerException != null)
msg += ". ERROR: " + e.InnerException.Message;
//if(this._directory is ExamineDirectory examineDirectory && examineDirectory.IsReadOnly)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this comment?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza will remove that asap :)


}

_writer = null;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a note here to explain what this is doing.

if (_nrtWriter != null)
{
if (_directory != null && _nrtWriter.Directory != _directory)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this code doing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I added comment and because you commented I noticed I forgot to update that part to use WriterTracker.Current.MapAsOld( instead of direct dispose :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of edge case logic in here to support scenarios that are totally outside of the core of Examine and I worry that these changes will bleed into creating strange underlying issues and code maintenance questions. (i.e. why is this code here if it never needs to exist to support what Examine itself is doing). My other worry is that porting this to 2.0.0 on netcore will be problematic. We don't have any WriterTracker anymore. There is a single writer for an index and it is managed by the index and disposed when it shuts down. Maybe there's a better way to support all of the changes that you are wanting to do by exposing protected properties so that you can sub class some of these objects to control things the way that you want. I would very much prefer to not have code like this in here and instead make the code better extensible to provide changing the logic outside of this library. Does that make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza in 2.0.0 we dont need that at all, as lucene 4.8 support duplicator, so we dont need really play around with swapping indexes :). So we can discuss what we can do that :)

MaybeReopen();

}
catch (Exception)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what exception is actually thrown? that's the one that should be used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I think I actually was planning to add logger there, forgot will do that soon :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I will add logging there, but can't really remind myself what exception i hit here 🤦‍♂️

@@ -41,9 +44,71 @@ public IndexWriter GetWriter(Directory dir, bool throwIfEmpty)

public IndexWriter GetWriter(Directory dir, Func<Directory, IndexWriter> factory)
{
var resolved = _writers.GetOrAdd(dir.GetLockId(), s => factory(dir));
return resolved;
lock (_locker)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_writers is a ConcurrentDictionary, why do we need another lock? Else if we do because of the logic in MapAsOld, we can probably just make it a Dictionary and manage the locking ourselves?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I had some weird issue with just ConcurrentDictionary, so I added that, but will play around with just dictionary :)

}

public void MapAsOld(Directory dir)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some code docs explaining what this does would be good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza will add :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented added, please let know if you need more descriptive comment

Copy link
Owner

@Shazwazza Shazwazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added a few more comments on code

writer = new ExamineIndexWriter(d, FieldAnalyzer, false, IndexWriter.MaxFieldLength.UNLIMITED);
}

if (examineDirectory.GetMergeScheduler() != null)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor - but calling a method multiple times shouldn't be done since normally that will incur some performance. Better to store the result of GetMergeScheduler() and re-use it.

}
}

public ExamineDirectory ExamineDir { get; set; }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making this settable is problematic, it needs to be readonly and injected

/// Called on commit
/// </summary>
/// <param name="action"></param>
public void SetOnCommitAction(Action<ExamineIndexWriter> action)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these Set methods mean that you can mutate this instance at runtime which may be problematic. Since this needs to be injected into the ctor of the index, all of these things should be injected into the ctor of this object too. This is a new object so we don't need to work around breaking changes and introduce spaghetti code. I feel like this whole class can be simplified, it takes in it's requirements in the ctor and exposes them by properties (not methods).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Shazwazza I am going to look into that soon, as need figure out maybe better way of handling as I also need introduce background queue and need figure out best way of handling that :)

if (_nrtWriter != null)
{
if (_directory != null && _nrtWriter.Directory != _directory)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of edge case logic in here to support scenarios that are totally outside of the core of Examine and I worry that these changes will bleed into creating strange underlying issues and code maintenance questions. (i.e. why is this code here if it never needs to exist to support what Examine itself is doing). My other worry is that porting this to 2.0.0 on netcore will be problematic. We don't have any WriterTracker anymore. There is a single writer for an index and it is managed by the index and disposed when it shuts down. Maybe there's a better way to support all of the changes that you are wanting to do by exposing protected properties so that you can sub class some of these objects to control things the way that you want. I would very much prefer to not have code like this in here and instead make the code better extensible to provide changing the logic outside of this library. Does that make sense?

@nzdev
Copy link
Contributor

nzdev commented Aug 15, 2023

I think at this point, I wouldn't be able to progress with this approach without some sort of distributed log which I would say is more of a server implementation issue (think solr). I'd suggest closing this pr.

@bielu
Copy link
Author

bielu commented Aug 15, 2023

@nzdev yeah it is something to close

@bielu bielu closed this Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants