Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watch_index method and ark-cli watch command #36

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

tareknaser
Copy link
Collaborator

Description

This pull request adds a new method to fs-index crate, watch_index, to monitor file system changes and automatically update the index.
Additionally, it adds a new command to ark-cli to make this functionality accessible to users.
This change is the first step of addressing issue #21.

Testing

An example of the new method's usage is in the fs-index crate at fs-index/examples/index_watch.rs.
To run the example, run the following command:

cargo run --example index_watch

This command monitors the index at the test-assets/ directory and automatically updates it upon any file system changes.

Copy link

Benchmark for 341c426

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.6±0.51µs 13.3±0.09µs -2.21%
../test-assets/test.pdf/compute_bytes 139.0±2.61µs 107.6±0.80µs -22.59%
compute_bytes_large/compute_bytes 467.9±9.08µs 139.9±1.85µs -70.10%
compute_bytes_medium/compute_bytes 26.8±0.25µs 27.7±0.79µs +3.36%
compute_bytes_small/compute_bytes 127.2±1.07ns 128.0±6.04ns +0.63%
index_build/index_build/../test-assets/ 161.3±5.81µs 160.5±1.53µs -0.50%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
@kirillt
Copy link
Member

kirillt commented May 11, 2024

It's a good PR, and it seems to be pretty straightforward to complete it, but I'm afraid that merging it before the other index refactorings could make porting ARK-Builders/arklib#72 too difficult. Because we'd need to add one more function to the index, and at the same time we need to check diffs while porting ARK-Builders/arklib#72.

Copy link

Benchmark for 0332bd7

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.3±0.16µs 13.3±0.08µs 0.00%
../test-assets/test.pdf/compute_bytes 109.8±2.17µs 111.6±0.55µs +1.64%
compute_bytes_large/compute_bytes 471.0±0.78µs 139.5±3.27µs -70.38%
compute_bytes_medium/compute_bytes 30.9±0.20µs 27.7±0.21µs -10.36%
compute_bytes_small/compute_bytes 127.8±1.66ns 128.2±3.58ns +0.31%
index_build/index_build/../test-assets/ 163.0±1.45µs 161.0±0.53µs -1.23%

@tareknaser tareknaser marked this pull request as draft May 12, 2024 18:25
@tareknaser tareknaser mentioned this pull request Sep 1, 2024
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser
Copy link
Collaborator Author

Updated the watch API to call ResourceIndex::update_one() for files created, removed, or modified based on streams from notify events.

@tareknaser tareknaser marked this pull request as ready for review September 9, 2024 09:23
Copy link

github-actions bot commented Sep 9, 2024

Benchmark for dc67cc3

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.7±1.19µs 247.7±0.87µs -0.80%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.6±0.08µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1350.7±8.76ns 1358.8±6.23ns +0.60%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.2±0.54µs 197.0±0.67µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1757.1±7.55µs 1763.0±13.37µs +0.34%
crc32_resource_id_creation/compute_from_bytes:large 86.7±0.24µs 86.8±0.34µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.55ns 92.4±0.33ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.8±0.29µs 64.8±0.86µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 946.9±3.53µs 949.6±4.11µs +0.29%
resource_index/index_build//tmp/ark-fs-index-benchmarks94k72W 106.6±3.35ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksYCHMXF 105.0±2.17ms N/A N/A
resource_index/index_get_resource_by_id 97.1±0.25ns 99.2±0.37ns +2.16%
resource_index/index_get_resource_by_path 52.8±0.26ns 55.1±0.33ns +4.36%
resource_index/index_update_all 1135.9±41.95ms 1137.9±59.79ms +0.18%
resource_index/index_update_one 684.1±33.46ms 693.3±33.36ms +1.34%

Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser
Copy link
Collaborator Author

There appear to be some unexpected events coming from the notify stream. For example, I've identified a potential flaw with the following steps:

  1. Run the watch API on a folder.
  2. Copy a file multiple times (e.g., file copy.txt, file copy 2.txt).
  3. Up until this point, the index updates correctly.
  4. Delete both files simultaneously (select and delete them together).
    This results in a panic in ResourceIndex::update_one().

This situation requires further investigation. Additionally, we need to test this scenario alongside other ResourceIndex tests to be implemented for #88.

Signed-off-by: Tarek <tareknaser360@gmail.com>
@kirillt
Copy link
Member

kirillt commented Sep 9, 2024

@tareknaser does only simultaneous deletion cause problems? Does simultaneous addition work fine?

Copy link

github-actions bot commented Sep 9, 2024

Benchmark for b8ed0bd

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 250.8±0.85µs 249.0±1.70µs -0.72%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.03µs 15.6±0.04µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1357.4±3.61ns 1363.3±8.30ns +0.43%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.8±2.52µs 197.6±0.65µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1762.4±4.96µs 1768.8±36.65µs +0.36%
crc32_resource_id_creation/compute_from_bytes:large 86.9±0.69µs 86.9±0.42µs 0.00%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.70ns 92.7±1.67ns +0.32%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.27µs 64.9±1.47µs +0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 945.7±4.82µs 946.3±5.29µs +0.06%
resource_index/index_build//tmp/ark-fs-index-benchmarks61KWbS 106.6±1.98ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksLAAoc8 111.8±0.74ms N/A N/A
resource_index/index_get_resource_by_id 97.1±0.37ns 96.7±0.50ns -0.41%
resource_index/index_get_resource_by_path 52.6±0.24ns 52.7±0.25ns +0.19%
resource_index/index_update_all 1089.8±34.10ms 1115.0±32.55ms +2.31%
resource_index/index_update_one 669.3±24.95ms 660.4±22.62ms -1.33%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
@tareknaser
Copy link
Collaborator Author

does only simultaneous deletion cause problems? Does simultaneous addition work fine?

Yes and yes
Even simultaneous deletion work fine in some cases but i was able to reproduce the error more than once

Signed-off-by: Tarek <tareknaser360@gmail.com>
@kirillt
Copy link
Member

kirillt commented Sep 9, 2024

README should be updated to explicitly state in which folder this command should be run:

cargo run --example resource_index

If ark-cli can handle similar scenario, it should be mentioned in the README, too.

Copy link

github-actions bot commented Sep 9, 2024

Benchmark for 4fe7076

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.4±1.85µs 250.4±3.74µs +0.81%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.15µs 16.9±0.17µs +8.33%
blake3_resource_id_creation/compute_from_bytes:small 1360.0±4.45ns 1356.6±5.93ns -0.25%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.4±1.32µs 197.7±1.08µs +0.15%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1757.9±10.11µs 1769.9±21.12µs +0.68%
crc32_resource_id_creation/compute_from_bytes:large 87.0±0.89µs 86.8±0.62µs -0.23%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.06µs 5.4±0.09µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.6±1.26ns 92.6±1.39ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 65.0±0.47µs 65.0±0.57µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 953.0±24.12µs 967.2±2.47µs +1.49%
resource_index/index_build//tmp/ark-fs-index-benchmarks0HA7fz 106.8±2.43ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksf1Yq2s 112.6±1.21ms N/A N/A
resource_index/index_get_resource_by_id 97.4±0.67ns 94.9±1.11ns -2.57%
resource_index/index_get_resource_by_path 52.9±0.60ns 50.2±0.26ns -5.10%
resource_index/index_update_all 1091.9±36.89ms 1118.4±43.34ms +2.43%
resource_index/index_update_one 653.8±22.83ms 668.3±19.57ms +2.22%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
Signed-off-by: Tarek <tareknaser360@gmail.com>
Signed-off-by: Tarek <tareknaser360@gmail.com>
@tareknaser
Copy link
Collaborator Author

README should be updated to explicitly state in which folder this command should be run:

I added a note on how to run the example and mentioned that more can be done with ark-cli watch.

Copy link

Benchmark for 1bbb897

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.6±1.44µs 249.2±1.63µs -0.16%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.5±0.06µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1362.9±7.31ns 1357.6±7.05ns -0.39%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.5±0.41µs 197.5±0.85µs 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1760.6±8.18µs 1770.1±29.87µs +0.54%
crc32_resource_id_creation/compute_from_bytes:large 86.6±0.29µs 86.8±0.19µs +0.23%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.03µs 5.4±0.06µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.54ns 92.3±0.30ns -0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.30µs 64.9±0.52µs +0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 947.4±4.87µs 952.9±3.74µs +0.58%
resource_index/index_build//tmp/ark-fs-index-benchmarksdlp6ac 117.7±2.70ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksiBuvQD 114.6±2.21ms N/A N/A
resource_index/index_get_resource_by_id 96.8±0.37ns 98.4±0.45ns +1.65%
resource_index/index_get_resource_by_path 52.6±0.15ns 54.4±0.39ns +3.42%
resource_index/index_update_all 1134.7±54.53ms 1169.1±51.93ms +3.03%
resource_index/index_update_one 688.0±29.08ms 701.6±31.71ms +1.98%

fs-index/src/watch.rs Outdated Show resolved Hide resolved

let relative_path = file.strip_prefix(&root_path)?;
log::info!("Relative path: {:?}", relative_path);
index.update_one(relative_path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be used to provide user with actual updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we have 2 approaches to choose from:

  1. Make update_one return same IndexUpdate type as update_all (simple). Then watch_index would return same type to the user. We could batch updates made in some interval to pack events together, but that's optional (if you find this idea useful, we can create a follow-up task).
  2. Alternatively, we could specialize update_one, so Track API and Watch API would become more powerful comparing to Reactive API. The extra power I mean is more finely-grained events, similar to what notify-rs provides: not only add/remove, but also rename/modify. This is more difficult though, and we would need to unify results from update_all and update_one before returning from watch_index. I suggest creating a follow-up issue for future consideration of this approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose the first approach to get things running faster and keep it simpler for now. I also plan to add tests soon.

Next, I want to add integration tests for this functionality. We could either add integration tests for fs-index directly or implement CI shell scripts to test ark-cli watch <PATH> for an end-to-end approach—possibly both?

Do you think CI shell scripts for ark-cli watch would be sufficient, or should we also include programmatic tests? For example, running the watcher in a separate thread and doing many create/delete operations to verify the results. Now that I think about it, writing these tests programmatically could be complex. What’s your take?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest creating a follow-up issue for future consideration of this approach.

tracked in #89

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think CI shell scripts for ark-cli watch would be sufficient, or should we also include programmatic tests? For example, running the watcher in a separate thread and doing many create/delete operations to verify the results. Now that I think about it, writing these tests programmatically could be complex. What’s your take?

Agree, I think we can achieve proper result by simple shell script. I imagine it like this:

  1. Run ark-cli watch in background, direct its output to dedicated log file.
  2. Randomly modify folder content and write the performed actions into another log file.
  3. Then compare the two log files.

fs-index/Cargo.toml Outdated Show resolved Hide resolved
@tareknaser
Copy link
Collaborator Author

I spent some time today looking into different ways to use notify by going through the docs and examples. Right now, we're using async_monitor, but it might not be the best choice for us.

Order is very important in our case because we need events to happen in the right order (for example, we don’t want to see "file1.txt removed" before "file1.txt created", since this would mess up our update_one() logic). Using an asynchronous watcher could cause issues with keeping the events in order.

Btw, I think this might be why we saw this error. As I reported, the error wasn’t consistent, which could be because asynchronous events don't always happen in the right order.

While looking through the examples, I also noticed that we might want to use the Debouncer. The file system can sometimes send multiple events for what is really just one change, which could cause problems. For example, it might trigger update_one() several times when a file is created.

I'm now testing this in a smaller example and looking at how to set up the event stream properly.

- Refactored `watch_index` to return an async stream of `WatchEvent` using `async_stream` and `tokio`
- Integrated notify debouncer to collapse multiple rapid events into a single event
- Updated the `watch_index` example to reflect the new implementation
- Marked `ResourceId` as `Sync` and `Send` to allow usage in async contexts
- Modified `ark-cli watch` to align with these changes

Signed-off-by: Tarek <tareknaser360@gmail.com>
…Update<Id>`

- Updated `update_one()` to return `IndexUpdate<Id>` with detailed changes
- Modified `WatchEvent::UpdatedOne` to include all file changes
- Added getters for `Timestamped<Item>` for easier access to data
- Updated `index_watch` example and `ark-cli watch` to adapt to these changes

Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

Benchmark for e4987f5

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.1±0.61µs 248.5±2.46µs -0.24%
blake3_resource_id_creation/compute_from_bytes:medium 15.8±1.14µs 15.6±0.12µs -1.27%
blake3_resource_id_creation/compute_from_bytes:small 1364.5±2.05ns 1365.9±1.64ns +0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.3±3.99µs 197.1±3.13µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1699.3±3.73µs 1718.3±22.19µs +1.12%
crc32_resource_id_creation/compute_from_bytes:large 86.7±1.35µs 86.8±1.13µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.13µs 5.4±0.01µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.19ns 92.3±0.30ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.1±0.32µs 64.2±0.60µs +0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 953.1±88.59µs 933.0±5.78µs -2.11%
resource_index/index_build//tmp/ark-fs-index-benchmarkst4cPto 109.9±1.24ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarkszKpEve 105.9±3.20ms N/A N/A
resource_index/index_get_resource_by_id 99.6±0.38ns 95.0±2.01ns -4.62%
resource_index/index_get_resource_by_path 55.6±2.38ns 50.6±0.75ns -8.99%
resource_index/index_update_all 1117.9±32.54ms 1125.5±52.37ms +0.68%
resource_index/index_update_one 666.8±18.20ms 667.6±28.00ms +0.12%

Signed-off-by: Tarek <tareknaser360@gmail.com>
Copy link

Benchmark for a95e06a

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 251.3±1.17µs 249.3±0.46µs -0.80%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.06µs 15.6±0.07µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1362.8±6.87ns 1355.7±8.95ns -0.52%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.8±0.35µs 196.8±0.52µs 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1705.4±4.30µs 1703.1±20.62µs -0.13%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.38µs 86.7±0.15µs -0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.03µs 5.4±0.05µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.18ns 92.4±0.48ns +0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.37µs 64.7±1.40µs +0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 939.6±4.55µs 941.8±11.76µs +0.23%
resource_index/index_build//tmp/ark-fs-index-benchmarks4VHJsr 109.1±2.05ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksjA6AY8 108.2±2.18ms N/A N/A
resource_index/index_get_resource_by_id 100.5±0.31ns 94.4±0.22ns -6.07%
resource_index/index_get_resource_by_path 53.8±0.30ns 50.4±0.18ns -6.32%
resource_index/index_update_all 1114.8±29.19ms 1118.7±41.25ms +0.35%
resource_index/index_update_one 660.5±20.57ms 665.9±23.63ms +0.82%

Comment on lines +68 to +77
impl<Item> Timestamped<Item> {
pub fn item(&self) -> &Item {
&self.item
}

/// Return the last modified time
pub fn last_modified(&self) -> SystemTime {
self.last_modified
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, why do we need this getters?

@@ -515,6 +531,7 @@ impl<Id: ResourceId> ResourceIndex<Id> {
self.id_to_paths.remove(&id.item);
}

result.removed.insert(id.item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I'm not mistaken, this line should be moved under the condition above:

if self.id_to_paths[&id.item].is_empty() {
    self.id_to_paths.remove(&id.item);
    // emit the removed event
}

I.e. only removal of the last path with same id should be considered real removal.

And we could log duplicates removal in else branch of same condition.

Comment on lines +571 to +573
result
.added
.insert(id, HashSet::from([timpestamped_path]));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could wrap this into similar condition, checking that the amount of paths mapped to the id is zero. And log that duplicate was introduced in the corresponding else branch.

This should allow us relax this requirement to update_one:

    /// - In case of a addition, the resource was not already in the index

@kirillt
Copy link
Member

kirillt commented Sep 16, 2024

Nitpick, but we can define aliases IndexUpdate::addition(id, path) and IndexUpdate::removal(id) for these snippets:

result.removed.insert(id.item);
result
    .added
    .insert(id, HashSet::from([timpestamped_path]));

Then we could immediately return from update_one once we determined the update:

return Ok(IndexUpdate::removal(id.item));
return Ok(IndexUpdate::addition(id, timpestamped_path));

This should be more readable.


By the way, it seems that we could simplify added field of the IndexUpdate structure. Since we don't distinguish duplicates, we can take any path as a representative of the group, so the app could do something with it. In practice, when unique resource is detected, we take its path as the representative. When a duplicate appears, we skip it. If during unique addition, several paths were introduced at once, we take an arbitrary one (options: 1) random; 2) just first in the vector; 3) the shortest path).

From API point of view, we don't need a collection of paths attached to the addition event, only one path (representative).


We might need separate events to track duplicates. Something like DuplicateAdded(id, path) and DuplicateRemoved(id, path). Although, I'm not sure that duplicate removal can be useful, maybe duplicate addition is enough. It could be used to allow the user to select representative manually. Just idea for future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants