Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursion is used for OS X adapters when it need not be #307

Closed
driskell opened this issue May 4, 2015 · 23 comments
Closed

Recursion is used for OS X adapters when it need not be #307

driskell opened this issue May 4, 2015 · 23 comments

Comments

@driskell
Copy link

driskell commented May 4, 2015

Hello,

On OS X I've noticed recursion is used when an event is received. So even though events are received on a per folder basis it will recurse into them. This means if you copy in a large tree of files, it will recurse the tree exponentially dependent on depth. For example a simple 1/2/3/4/5 would mean a recursion over the 4/5 layer 3 times. It also means a single change to a file in the root triggers a recursion of the entire watched folder. If the folder contains thousands of folders/files it will then delay the next change by up to 10-15 seconds sometimes in the tests I've done while the CPU sits at 100%.

Is there a reason for this recursion? Do any of the adapters (except poll) even need this at all?

(On a separate note - I notice recursion is implied... the actual recursive flag passed by the adapters isn't even referenced.)

There is one reason I can think - and that is if a new folder is added - it would need to Directory.scan it in order to process its contents, thus it would (kind of) need to recurse into it. This can easily be made a special case, and full recursion only performed if requested (by polling adapter for example.)

Does this make sense? Have I misunderstood the code?

@e2
Copy link
Contributor

e2 commented May 4, 2015

Recursion is a limitation of rb-fsevent. There might be a way to avoid it, but that requires OSX development, which I can't do (I'm Linux-only).

Overall, rb-fsevent is capable of monitoring changes to files, but this was never enabled because AFAIR watching files (see rb-fsevent docs) generated too many events.

So ... the Darwin adapter in Listen is basically "half-polling" - when something changes in a directory, that directory is scanned.

You're right - there's no reason other adapters need it (except polling).

I wouldn't mind accepting patches that add adapter-specific options (open an issue + ask for help if you want to take a stab at it).

The best "workaround" is probably to ignore directories you don't care about (so they won't be scanned).

If you think OSX adapter is scanning more than it should - you may be right. The LISTEN_GEM_DEBUGGING=2 environment variable should provide enough info to see what's going on (you can upload a gist if you're unsure).

It's possible to speed up the scanning a lot - mostly by avoiding Celluloid and avoiding using Pathnames where strings could be used (Pathname is quite slow). I've wanted to get to this for a while - but it might take time to make it stable enough to release.

One final note: it's best to be specific when talking about the use case, because there are so many caveats. E.g. instead of copying files into the project, you could copy them outside the project and then move them into the project (the operation should be faster and may/should generate less events).

@driskell
Copy link
Author

driskell commented May 4, 2015

@e2 Thanks for the overview! It fills in a lot of background. I guess I can narrow this issue to just OS X for now.

So ... the Darwin adapter in Listen is basically "half-polling" - when something changes in a directory, that directory is scanned.
...
If you think OSX adapter is scanning more than it should - you may be right. The LISTEN_GEM_DEBUGGING=2 environment variable should provide enough info to see what's going on (you can upload a gist if you're unsure).

This is interesting. What I see is that not just the directory is scanned, but all subdirectories also. For a large folder this can take some time. I've already enabled LISTEN_GEM_DEBUGGING=2 and it will spam for nearly a minute scanning the entire directory tree when I merely touch a single file in the root of the directory tree.

I notice that the Darwin adapter (and BSD, and a few parts of others) pass in "recursive: true" - however, as mentioned, this is not referenced, the action is ALWAYS recursive. What I don't understand though, is why recursive is required?

Darwin adapter recursive: true

_queue_change(:dir, dir, rel_path, recursive: true)

BSD adapter recursive: true
_queue_change(:dir, dir, '.', recursive: true)

For example, if I have:
folder/file.rb
folder/second.rb
folder/subdir/third.rb
folder/subdir/another/fourth.rb

If I modify file.rb, the entire tree is crawled for changes. Only the root should be crawled.
If I modify third.rb, the fourth.rb is scanned. Only the subdir folder should be crawled.
If I copy subdir to subdir2, the new fourth.rb in subdir2 would be scanned twice, once by the recursion started at subdir2, and again by the recursion started at subdir2/another (an event is raised for both new directories which form part of the new tree.)

I guess one problem is that Directory::scan calls back to Change::change for each entry, and in the case of directory this calls Directory::scan again. If I perform a PR to prevent recursion when recursion is not true, and to remove the recursion true from OS X - would that be acceptable?

@driskell driskell changed the title Recursion is used for all adapters Recursion is used for OS X adapters when it need not be May 4, 2015
@driskell
Copy link
Author

driskell commented May 4, 2015

Updated comment regarding copying subdir to subdir2

@driskell
Copy link
Author

driskell commented May 4, 2015

Copying and moving is a good point that will need testing. I'll explore that as it may not generate events for subtree entries.

In which case I do have another experimental patch that ONLY recurses new folders in order to capture this.

You are right though it does seem a minefield of possible inputs and will be hard to optimise without impacting any. I'll report back soon!

@driskell
Copy link
Author

driskell commented May 5, 2015

OK - I've been playing around in the root of the watched directory and it seems behaviour here is completely different to that in deeper directories.

The record builder places files in the root in the same place as paths are placed. This means Record::dir_entries(watched_dir, '.') actually returns a list of files in the root as well as every folder in the structure. In deeper folders, it just returns the files (and not folders.)

So actually seems theres an issue with the way root files are stored in Record.

This somewhat explains why recursion is probably implied at the moment. There's no way to check a subfolder was removed without starting a Change::change on it, since the Record::dir_entries only returns files (though the code within in hints it would return directories without data due to an :mtime check) - so some very very strange and confusing things in the code it seems!

@e2
Copy link
Contributor

e2 commented May 5, 2015

@driskell - it's best to handle this case-by-case, or it can be very confusing.

One key distinction here: watching recursively is unrelated to scanning recursively. (the first is setting up watched directories, the second is adapter-specific to detect the changes based on events).

I think the confusing part is: polling needs recursion (otherwise it can't detect a change deep in the tree, e.g. its scans foo recursively so it can detect a change in foo/bar/baz.rb), while OSX just "piggy-backs" on this functionality, even though if something changes in foo/bar/baz.rb, it already knows there's a change in foo/bar/baz.rb.

I'm a bit "scared" to optimize and check OSX, because I'm worried I'll break something not handled by tests. But - if foo/bar/baz.rb is changed and you get an rb-fsevent :dir event on foo/bar, there's no reason to scan foo recursively.

Also, by changing baz.rb, you're changing the directory entry in baz - I don't know if there's a bar event if that happens.

It's tricky - and I don't know enough about how rb-fsevent behaves (which events it gets in which scenarios) to optimize for it. If you can setup a os-agnostic test case (see the darwin_spec.rb adapter tests), I could help with the rest.

The Record is optimized for size somewhat - if you think there's an error there, setup a basic project, modify a file in root and attach a link to the gist with the output (again, I don't have OSX, so I have no idea what rb-fsevent provides).

Without a failing test case it's hard to move forward - so let me know if you have a failing example, and I'll help turn it into a test case. As soon as I can reproduce the problem on Linux, I can take over and fix + refactor things.

Summary: I don't know how OSX behaves and there are no unit tests which tell me what rb-fsevent events happen in which scenarios. The codepath for OSX and polling maybe share a bit more than they should (ideally, polling should handle recursion by itself - and not create fake dir events going through the whole stack).

@driskell
Copy link
Author

driskell commented May 5, 2015

@e2 As I delve deeper I can absolutely agree, case by case basis is best, with sound tests and reproduction cases. What I started out (naively) thinking was fairly straight forward, was certainly not the case. I do firmly believe there's some enhancements to be made here though, and maybe better documentation in code to help. Appreciate the feedback, it's been extremely useful, and I'll take it all away and come back with something cleaner and narrowly targeted 👍

I'm a bit "scared" to optimize and check OSX, because I'm worried I'll break something not handled by tests. But - if foo/bar/baz.rb is changed and you get an rb-fsevent :dir event on foo/bar, there's no reason to scan foo recursively.

What happens here, is that bar is scanned recursively (not foo), but yes there is still no reason. So if next to baz.rb you have many folders and many levels deep, a change to just baz.rb recursively scans them all.

Also, by changing baz.rb, you're changing the directory entry in baz - I don't know if there's a bar event if that happens.

Not sure what you mean by baz as you did not mention it? When a directory changes (added/removed files) a :dir event raises for the directory. When a modification happens to a file, a :dir event is raised for the parent directory.

Regarding the Record issue - I will raise that in a new issue with an example of output so you may reproduce better the problem. Then I'll get some better information regarding the recursion so you can better see. I'm purely OS X client so I should be able write all the tests needed to give confidence 👍

Thanks again.

@driskell driskell closed this as completed May 5, 2015
@e2
Copy link
Contributor

e2 commented May 5, 2015

@driskell - I meant "foo" there: if you change baz.rb, you change the directory bar - and I don't know if that causes rb-fsevent to also generate an event on foo (all because something in the directory bar changed). If you change a file, the mtime changes - which is a change on the directory, and not the file. I don't know if that causes a change on the parent directory (or if it should) on OSX.

Don't worry about creating multiple issues and asking lots of questions - I'm more worried about my changing breaking things for OSX users, so if you can provide cases I can turn into tests on Linux, it would be very, very helpful.

If it helps, at some point I wanted to create a "virtual adapter", where you treat it like a filesystem (changing a file, touching a file, removing a directory, moving, etc.) and it returns a set of events for whatever backend you want. You could say it would be like a fakefs adapter. It worked pretty well, but when I got to stubbing rb-fsevent, I realized I have no idea how that adapter behaves (in order to simulate it).

That's the biggest issue for me - not knowing exactly how rb-fsevent behaves (which makes it very hard for me to even try to refactor/improve things).

For reference, you can check out Record.build works (all the fast_ methods) - it can scan tens of thousands of directories in a fraction of a second, even without multiple threads. I wanted to reimplement the directory scanning with the same approach - but any problems or unnecessary overhead you find will still be very valuable (because they mean potential bugs).

@driskell
Copy link
Author

driskell commented May 5, 2015

I meant "foo" there: if you change baz.rb, you change the directory bar - and I don't know if that causes rb-fsevent to also generate an event on foo (all because something in the directory bar changed). If you change a file, the mtime changes - which is a change on the directory, and not the file. I don't know if that causes a change on the parent directory (or if it should) on OSX.

@e2 It doesn't generate an event for foo. Only the modified directory bar.

That virtual adapter would be a perfect way to do reasonable test cross-platform. Though I guess testing on the platform itself needs doing eventually.

I've uploaded a gist regarding the Record.build issue I noticed. This is all OS X.

I ran LISTEM_GEM_DEBUGGING=2 bundle exec listen and then modified the LICENSE.txt in the root folder:
https://gist.github.com/driskell/367ee8dd3ea673367432
You can see it recursively hits everything in the listen repository. This seems to be because when Directory.scan calls Record::dir_entries for the folder the :dir event occurred on (which is the watched folder / the root) it returns the full path hash with ALL paths. So when it removes the current from previous it's left with pretty much with a full folder list such as lib/listen....

If I perform the same test again but modify lib/listen.rb it's better, because Record::dir_entries behaves fine when it's not the root. There's one fsevent here, which is a :dir for lib only. And removing current from previous leaves nothing so all is good.
https://gist.github.com/driskell/274ad8eec25c7cbfb09e
I've got a patch that fixes the behaviour for the root (but breaks many tests as they rely on existing behaviour when checking @paths contents!). Rather than ignoring . entries, or returning the @paths[root] for them, actually use a virtual . folder entry, and everything begins to work.
driskell@555ac23
This is the new gist of log with the patch: https://gist.github.com/driskell/477145e9cc96b4b7407b

You can also see in all of the above, that even though only LICENSE.txt was changed, or only lib/listen.rb was changed, it will recurses into subdirectories, even though the only fsevent was for . or lib and not any of their children.

@driskell driskell reopened this May 5, 2015
@driskell
Copy link
Author

driskell commented May 5, 2015

I can also see once a directory is added to Record::paths it is never removed, only entries within each directory hash are removed in _fast_unset_path, so a tiny memory leak I guess.

@e2
Copy link
Contributor

e2 commented May 5, 2015

I've tested it and it didn't work on my simulation, because when files are moved, (e.g. dir1 moved to dir/), the virtual entry doesn't contain entries from the initial build (e.g. initially there's dir1 and dir2, but record['.'] is empty).

The "structure" of the record is supposed to be "flatter" (not a tree):

record[watched_directory][full_relative_path_to_dir_or_file][dir_or_file_basename] = options (mtime, etc)
record[watched_directory][dir_or_file_basename_in_root] = options (mtime, etc)

e.g.

record['foo']['bar/baz/bay']['file.txt']
record['foo']['file.txt']

I think the problem here is that dir_entries(watched_dir, '.') should return just a hash with files.

@driskell
Copy link
Author

driskell commented May 5, 2015

What was your simulation? In my test @paths[root]['.'] is populated.
The only difference between old and new is that root files are stored in ['.'] rather than directly in the hash, this ensures they are stored separately to the dirnames and so dir_entries() returns correctly.

Actually, using master branch, if I move a directory from listen/lib to listen/lib2, on OS X, I get additions for listen/lib2 but removed is empty. If I then move listen/lib2 back to listen/lib it doesn't even call the callback.

I guess my patch has made the bad behaviour of subdirectory moves on OS X impact the root as well. I think this is due to the dir_entries only returning file entries, so when it goes through calling _change on no longer existing entries, it only does so for files and not directories, and once a directory exists, it will never be involved in a removed if the directory itself is moved away (deletion would work as it would remove the files first, generating events for the files.)

@e2
Copy link
Contributor

e2 commented May 5, 2015

@driskell - sorry, you're absolutely right. Having ['.'] explicitly is the way to go. dir_entries is broken anyway, so I'll check some things out and I'll get back to you.

Your patch looks really good btw.

@e2 e2 mentioned this issue May 5, 2015
@e2
Copy link
Contributor

e2 commented May 5, 2015

@driskell - I've included your patch and reworked the build logic here: #308

Previously, the Record#build didn't actually enumerate directory entries - now it does (should).

Can you test this on OSX?

@e2
Copy link
Contributor

e2 commented May 5, 2015

On Linux, I set the env LISTEN_GEM_SIMULATE_FSEVENT=true and work on a mounted FAT directory (to simulate HFS mtime granularity). What it basically does is: it translates inotify events into rb-fsevent directory events.

If you can, setup a small repo with a watched subdirectory - and reproduce problems there, so I can compare with the output I get (I may have bugs in the translation logic, so this may help me find them).

I'm not too sure about dealing with the recursion (your other branch) - directory mtime could be used instead, although it may be hard to detect changes without recursion on OSX (because mtime isn't reliable to filter out irrelevant changes).

I really appreciate the work you've done on this - thanks!.

@e2
Copy link
Contributor

e2 commented May 6, 2015

@driskell - I had some additional thoughts and I'm interested in what you think:

Current Listen assumptions:

  1. Record stores only file entries (and it not reliable for storing directories)
  2. Because of 1., this means directories have to be scanned every time
  3. Because of unreliable mtime on directories (1sec precision on HFS), directories have to be scanned anyway
  4. For directory scanning optimizations, the scanning would have to be sequential (otherwise subdirectories could be scanned before the parent directories - thus the need for auto_hash and unnecessary scanning happening). This may actually speed up scanning, since MRI has GIL (so using multiple threads doesn't improve performance) and there's no Celluloid overhead (which as as of 0.16.0 is pretty significant, because every class method is sent through a mailbox - so the more methods a Celluloid class has (like Record), the slower).
  5. Any optimization on OSX would have to work with Polling (Polling would benefit too) - this is because most optimizations for OSX will apply to Polling anyway (OSX adapter is basically "selective-subtree-scan" or "subtree polling without the loop").
  6. The current code makes it hard to debug and rework - ideally the Record should behave like a normal filesystem (and maybe have methods like mkdir, rmdir, rm). This way, Record could be tested against these filesystem methods, while Directory.scan would "detect" these operations based on differences with the Record.
  7. The ['.'] workaround shouldn't be used at this point. Instead, the actual entries should be filtered (anything unless it has a "") - though dir_entries is currently designed to return just files. This is because Record is built around the idea of storing info about files - and not optimized for providing a Dir.entries like API.
  8. If done properly, the logic of Listen::File (mtime, md5sum) could be shared with directories (for detecting changes)
  9. The md5sum checking is filesystem specific (not adapter specific), though dir-scanning is OS specific (OSX, Polling) - this means optimizations for OSX may slow down other adapters (though they shouldn't). This means adding dir info (dir entries + mtime) to Record is OS specific (OSX-only optimization).
  10. You might consider just switching to using file events from the rb-fsevent adapter. I'm not sure what the performance is (perhaps handling the FSevent events isn't that slow?). It may be a much better use of your time to check this.

@driskell
Copy link
Author

driskell commented May 6, 2015

Sure. Most of this is from OS X perspective and directory events but I believe would apply elsewhere - testing would be needed though to confirm behaviour, but I would say any directory-event like behaviour that does not match rb-fsevent for OS X would be somewhat intuitive or essentially useless for the purpose listen needs it for.

  1. It really needs to store both so it can be sure it knows which directories existed before and after events.
  2. Directories only need to be scanned if they are added or removed, in order to process addition and removal results. Events for a directory are never raised if a subdirectory contents are modified - the event would instead be for the subdirectory. As such recursion is not always necessary, it is only necessary for new folder entries that were moved into the tree, or entries moved out of the tree, and in the removal case recursion would be over the entries in Record.
  3. I disagree. mtime is only useful for directory contents modification. However, as mentioned in 2, the directory event covers this - so mtime can be ignored, and recursion avoided.
  4. I do not understand why directory scanning needs to be sequential, or how performing unnecessary scanning means it doesn't need to be. Subdirectories already can be scanned before the parent directories finish. We're not changing the ordering of anything or the synchronisation of anything, we're just reducing work, so I cannot see how it means it needs to be sequential.
  5. Yes, polling should be able to force recursion always. You in essence have three types of change discovery: file-based (I believe this is the :change option), directory-based (this is when :change is missing), and recursion based (polling would use this.)
  6. Sounds good. I actually think the way Record is at the moment is somewhat like that already. dir_entries = 'ls', fast_update_file = 'touch', fast_unset_path = 'rm/rmdir' (needs splitting to remove actual dir keys when they are removed to prevent that memory leak I mentioned.)
  7. I would say if Record is for files only, it is not fit for purpose. As per 1 it needs to do directories too. I do not consider '.' a workaround. @paths is a Hash, where keys are the paths, and values are the directory contents. For the root folder, this means it has to have a key, to use the filenames as keys just breaks this logic, and means you have to filter by '' - which for large structures is inefficient. Why store something somewhere where it cannot be easily accessed? If the @paths were different, if it was indeed like a filesystem structure, with multiple levels of depth, then I would agree, since you would never have @paths['.git/refs/head'] it would be @paths['.git']['refs']['head'], and a key would be an entry, rather than a path.
  8. This would only be useful for polling and I actually think you still need to scan for file changes (directory mtime in my experience only changes when new files added or files removed, that is, contents change - if a file changes, it doesn't affect directory). I guess it would mean you need to iterate the directory for additions - and can just iterate whats in Record to see if any of those changed. Possibly marginally faster as it prevents a directory listing.
  9. Would adding dir entries to Record slow down other adapters? If they don't use Directory.scan, it wouldn't impact them. If they do use Directory.scan, in my opinion it needs the directory entries otherwise Directory.scan doesn't do what it should do - notice new and removed folders.
  10. It's a possibility. I'm quite keen to get directory events working though as it seems like after some TLC it could be logically efficient (improved behaviour), and maybe later down the line algorithmically efficient (optimisations in the code but with the same behaviour). Also looks like file-events in rb-fsevent would be fairly new to the code-base, and so more likely to hit problems.

Hope I'm not direct in some of my thoughts, but I hope it helps in decisions and planning. As mentioned I'm happy to help out with any bits - and hopefully will be better suited to help once I've played with the acceptance testing more so rather than rambling I can actually provide reproducing code!

@e2
Copy link
Contributor

e2 commented May 6, 2015

any directory-event like behaviour that does not match rb-fsevent (...) would be essentially useless (...)

Polling "assumes" every directory has been modified - so it shares the scanning logic with Darwin.

It really needs to store both so it can be sure it knows which directories existed before and after events.

I'm not sure this will play well if a file is removed and a directory is created with the same name (or vice verse)

Or, if directory foo is quickly renamed to tmp, bar is renamed to foo, and tmp is renamed to bar. Listing is the same, but the directory contents don't match any more (only mtime is different).

Events for a directory are never raised if a subdirectory contents are modified - the event would instead be for the subdirectory.

What events to do get if a directory tree is added? What about an empty directory?

As such recursion is not always necessary

Is it necessary for added directory trees? (related to above question about events)

I do not understand why directory scanning needs to be sequential, or how performing unnecessary scanning means it doesn't need to be.

Mostly because Record uses ||= for hashes, and so it won't work reliably for storing directories without an overhaul. Celluloid also adds overhead to the the Record every time any of it's methods are called. add_dir was supposed to make sure directory entries exist as scanning is not ordered.

Some optimizations you mention may need sequential access, or they may be too complex to implement - especially to work reliably with JRuby and Rubinius.

Subdirectories already can be scanned before the parent directories finish.

And I don't think this is a good thing. Itm makes the code very complex, difficult to maintain and debug and quite hard to understand.

The performance gain from concurency on MRI is minimal (scanning is CPU bound, not IO-bound as one would expect).

Yes, polling should be able to force recursion always.

Polling should be optimizable on high-precision-mtime filesystems (anything except FAT and HFS). This is important for filesystems where inotify or FSEvent are unavailable. At the very least, optimizing Darwin adapter shouldn't slow down Polling (and e.g. Polling doesn't need directory info in Record if it's indiscriminately recursive).

You in essence have three types of change discovery: file-based (I believe this is the :change option), directory-based (this is when :change is missing), and recursion based (polling would use this.)

Polling probably should have less in common (in the codebase) with Darwin than it has now. The API should reflect more what happens than using magic "existing-or-not-existing" parameters.

I actually think the way Record is at the moment is somewhat like that already. dir_entries = 'ls', fast_update_file = 'touch', fast_unset_path = 'rm/rmdir' (needs splitting to remove actual dir keys when they are removed to prevent that memory leak I mentioned.)

That shouldn't be too difficult, given specs are added to Record.

As per 1 it needs to do directories too.

I'm not sure - is FSEvent providing insufficient events? (e.g. when directories are added). Otherwise, having directories is a Darwin-specific features - and so maybe there should be a Darwin-specific Record instead? Also, it may help to populate this "extra" Record at startup (to reduce the need for scanning on the first run).

This would only be useful for polling and I actually think you still need to scan for file changes (directory mtime in my experience only changes when new files added or files removed, that is, contents change - if a file changes, it doesn't affect directory).

mtime is stored in the directory (usually?) - so changing a file does change the directory.

Would adding dir entries to Record slow down other adapters?

Accessing the Record may be slow (the calls go through Celluloid mailboxes), and Polling uses Directory.scan (and doesn't store directory info - since it doesn't need it because of recursion).

There are also symlinks, which can add lots of other cases - calling real_path on every directory may be expensive. And you'd probably have to differentiate between files, dirs and symlinks.

Also looks like file-events in rb-fsevent would be fairly new to the code-base, and so more likely to hit problems.

I'd say less likely - both Windows and Linux adapters use file events, and they do rely on the Record in some cases anyway (so there wouldn't be a difference).

I translate file-events into dir events for simulation - so you can pick which events should be handled as file events, and which as dir events.

Hope I'm not direct in some of my thoughts, but I hope it helps in decisions and planning.

It's too easy to break something, and there are not enough tests to really considering optimizing. "Works" is better than "fast but broken", especially with so many users relying on Listen. Crashing is much better than not-responding to changes. And failing tests are much better than crashing.

As mentioned I'm happy to help out with any bits - and hopefully will be better suited to help once I've played with the acceptance testing more so rather than rambling I can actually provide reproducing code!

It's a complex and confusing topic - and we're both missing information, so neither of us can see the big picture.

If you have any suggestions (minor or major) that would help, let me know. The current code isn't good at all - so anything that confused you or didn't work as advertised should be changed.

@e2
Copy link
Contributor

e2 commented May 6, 2015

@driskell - I've updated,rebased and squashed: #308

If you can, try it out. If you have any changes you'd like on top, just attach links to the commits.

Major changes:

  • bugfix (cloberring the Record entries in Directory.scan)
  • initially including directories in ['.'] entry (otherwise existing files in root are later marked as added)
  • no more autohash (so stuff can fail, but it's easier to work with)

The changes are pretty drastic, and even though all the tests pass, this doesn't mean things work.

@driskell
Copy link
Author

driskell commented May 6, 2015

Thanks I'll give that a go.

On the previous:
Polling does need to recurse. Though Darwin shouldn't need to as it will be told which directory contents are changed.

It really needs to store both so it can be sure it knows which directories existed before and after events.

I'm not sure this will play well if a file is removed and a directory is created with the same name (or vice verse)
Or, if directory foo is quickly renamed to tmp, bar is renamed to foo, and tmp is renamed to bar. Listing is the same, but the directory contents don't match any more (only mtime is different).

This is a good spot. This would reach File::change which currently assumes Record entries to be for files only. So it needs to be updated to account for existing entry not being a file. Without the change, we don't detect folder removals if they are moved, so this needs to happen I believe. It will likely mean Record::add_dir then needs to verify if the existing entry is a file or not and fix it up. I can see this is deepening somewhat - and fixing this moving folder problem may be a large task!

Events for a directory are never raised if a subdirectory contents are modified - the event would instead be for the subdirectory.

What events to do get if a directory tree is added? What about an empty directory?

Just tested this scenario and it is as expected. A directory change event is raised for the parent directory in both cases - no even is raised for the new directory in either cases. If files are copied in, however, you receive events for each directory as files are added (but not for an empty directory as it never has files added.)

As such recursion is not always necessary

Is it necessary for added directory trees? (related to above question about events)

Yes, this is what I mean by "not always necessary" - recursion into a subfolder is ONLY needed if that subfolder is detected as new (or if Record remembers it as a file for example) or if the subfolder used to be there and isn't we should recurse the Record tree. Interestingly, back to your point about file becoming a folder - we now have a potential blocker - if a folder becomes a file, we need to maintain existing Record contents before file overwrites it. This would have to be handled carefully when File::change detects an entry used to be a directory, maybe even maintaining reference to the Record dir_entries and passing to Directory::scan - plot thickens here!

I do not understand why directory scanning needs to be sequential, or how performing unnecessary scanning means it doesn't need to be.

Mostly because Record uses ||= for hashes, and so it won't work reliably for storing directories without an overhaul. Celluloid also adds overhead to the the Record every time any of it's methods are called. add_dir was supposed to make sure directory entries exist as scanning is not ordered.

Some optimizations you mention may need sequential access, or they may be too complex to implement - especially to work reliably with JRuby and Rubinius.

I'm still not sure the problem with ||=. Storing directories is nothing different than storing files is it not?. The only difference is the data stored with the name entry. One is the data for a file, one is the data for a directory. To distinguish, a :type is stored or simply directories have :mtime missing which seems to possibly have been past behaviour as there is a reference to this in the old Directory::dir_entries

As per 1 it needs to do directories too.

I'm not sure - is FSEvent providing insufficient events? (e.g. when directories are added). Otherwise, having directories is a Darwin-specific features - and so maybe there should be a Darwin-specific Record instead? Also, it may help to populate this "extra" Record at startup (to reduce the need for scanning on the first run).

Not really. A subfolder is moved to trash for example - we get an event for the parent folder where that folder used to be. We have to work out what the change was in that folder - how do we work out if it was a moved away folder, and what it name was, if we don't remember it?
This is less a Darwin specific problem, and more a directory-event related problem. The way I see it is that Listen handles three types of systems, file-event, directory-event and polling. So rather than being treated as a Darwin specific problem, it's a directory-event specific behaviour that should be handled by the API.

This would only be useful for polling and I actually think you still need to scan for file changes (directory mtime in my experience only changes when new files added or files removed, that is, contents change - if a file changes, it doesn't affect directory).

mtime is stored in the directory (usually?) - so changing a file does change the directory.

I've tested on Mac and Linux. Modifying a file in a directory does not change the directory. (Using an editor may well do so, however, as most editors create a new file, delete the old, and then rename - which in essence generated three directory changes.) Specifically, mtime only changes for a directory when the directory list changes. Individual file content changes don't impact it as the list remains the same.

There are also symlinks, which can add lots of other cases - calling real_path on every directory may be expensive. And you'd probably have to differentiate between files, dirs and symlinks.

May be open for discussion this. I would say symlinks should be treated as files. Otherwise, we're watching a directory potentially outside the watched root.

Hope I'm not direct in some of my thoughts, but I hope it helps in decisions and planning.

It's too easy to break something, and there are not enough tests to really considering optimizing. "Works" is better than "fast but broken", especially with so many users relying on Listen. Crashing is much better than not-responding to changes. And failing tests are much better than crashing.

I appreciate this and perfectly understand the need for stringent testing, and our discussions I do hope are helping create a good log of understanding for anyone else wondering about the internals and what does what. It's definitely helped me appreciate the code a lot more, but also see some room for improvement. I agree with you here mostly. However, I would say it currently does not reliably work for OS X / directory-events, due to the missing directory lists in Record that mean moved folders are not raising the correct remove events. Preventing recursion I would say is n optimisation to drop for now and look at later, once the actual problems are solved.

@e2
Copy link
Contributor

e2 commented May 6, 2015

I'm still not sure the problem with ||=

It's related to logic regarding directories, because "empty" may mean "empty directory" or "not scanned yet". Also, ||= does not include merging data (entries added) - merging does happen in some Record methods, but there may be bugs here. (Bugs regarding optimization, but everything may still work well).

I've tested on Mac and Linux. Modifying a file in a directory does not change the directory.

rb-inotify works only on directories (BSD use kqueue to actually watch files). rb-inotify gets a list of directories to watch - and reports changes to files. This is basically via the :modified attribute (technically due to the mtime changing). I'm not talking about the mtime of the directory - I'm talking about the mtime of the file - which is stored in the directory (thus changed mtime of a file = "modified" directory contents).

I appreciate this and perfectly understand the need for stringent testing

I mentioned that to suggest that you don't put too much faith in the existing code (respecting your time). Expect the existing code to be buggy and "prove" it with test cases. Once we run out of things to test, we can release.

It's definitely helped me appreciate the code a lot more

My feelings are the opposite, lol!

Let me know if you're ready to move ahead, or if you have a specific error you want to fix (or need help writing a test for).

@e2
Copy link
Contributor

e2 commented May 30, 2015

@driskell - I'd like to resurrect this (PR #308 updated). I've tried to simulate the rb-fsevent adapter as close as possible on Linux.

I'd take the approach of disabling recursion (now possible with the LISTEN_GEM_FSEVENT_NO_RECURSION=1 env var - see PR) and then "fixing" things to detect moving trees.

So until all issues are sorted out, people could "opt-out" of recursion (to get the performance boost) and by disabling recursion, meaningful work and testing could be done to improve things.

If something doesn't work, there should be failing tests anyway (which can lead to quick fixes).

So if you can create a failing example, and I can reproduce it as a test case on Linux, I can take a stab at getting directory info in the record, adding missing tests, etc.

There are a lot of changes (based on your fork) - and quite a bit of refactoring I need to do (the whole recursion thing should be object oriented - so it can be tested without having to understand the full stack).

Any step-by-step suggestions from your side would be very helpful - feel free to open a "Refactor: " issue with a list of changes you think would help (so we can track this separate from this issue) - or open a PR, so I can keep track of what works for you.

Thanks!

@ColinDKelley
Copy link
Collaborator

This issue is over 6 years old, so I'm closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants