-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor DocumentRoom
, throttle saves instead of debouncing saves
#250
base: main
Are you sure you want to change the base?
Conversation
Thanks for submitting your first pull request! You are awesome! 🤗 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first sight, this breaks the dirty indicator in JupyterLab's UI.
I created a test that passes on main
and fails with this PR.
How do you handle the case where the room is created at the same time from two different clients?
Removing
I don't think it's any simpler than previously, as there is now a possible recursive call to
The |
Where is |
To clarify, JupyterLab's UI sets a graphical dirty indicator whenever a change occurs, and listens to the shared model dirty attribute to clear this dirty indicator (on |
@davidbrochart Thank you for your feedback! It has allowed me to simplify and correct the implementation in this branch. Let me address the rest of your feedback here:
The current implementation doesn't handle this case either. Having a lock within room1 = DocumentRoom(...)
room2 = DocumentRoom(...)
async with asyncio.TaskGroup() as tg:
task1 = tg.create_task(room1.initialize())
task2 = tg.create_task(room2.initialize()) This will still result in the main body of I don't think this is an issue, since Tornado awaits the However, we should probably open an issue to Tornado to verify that this method isn't ultimately run as a concurrent task upstream somewhere.
Thanks for clarifying! I've reverted that change and added some logic to prevent a save loop without needing a lock.
Only one |
Proof that this PR correctly sets the Screen.Recording.2024-03-19.at.5.43.22.PM.mov |
This PR seems to set the Screen.Recording.2024-03-19.at.6.12.15.PM.movI think this is mainly due to the performance benefits of simplifying the async code and removing the lock waits. |
Just opening JupyterLab and creating a new notebook, it seems to be saved twice without doing any modification:
Can you explain why? |
Good point. I opened #255 which includes a test for concurrent room creation, which fails on this PR.
I still think that the recursive call to |
@davidbrochart Thanks for the feedback! Let me address it here:
I'm happy to change this, since I'm personally indifferent. I've changed this in the latest revision; this branch now uses a
Sure! The The current implementation ensures that all updates are written to disk by using the I tried to determine what Ydoc updates were occurring when the room is initialized (to see if they can be removed), but for some reason, I wasn't able to access every def _on_document_change(self, target: str, event: Any) -> None:
"""
...
"""
self.log.error(target)
self.log.error(event)
if isinstance(event, MapEvent):
self.log.error(event.keys) For some reason,
I don't think this issue should block this PR however; we should track this in a separate issue and address it later if possible. |
High level comments... I do think this PR dramatically improves the readability of the code. Great job there, @dlqqq! Reviewing the code, though, it makes me a bit uneasy to change so much under the hood of the DocumentRoom without more unit test coverage (and possible some integration tests). The current codebase has been "cooking" in a released state for a little while, so I have some confidence—albeit limited—that it works relatively well. We've heard edge case issues, but don't have unit tests to measure if we're getting any closer to a better state with these changes. Before merging impactful rewrites like this, I'd prefer we increase our test coverage first.
Personally—if we're aiming for readability—I like this new commit over the recursive |
@Zsailer I can add more unit test coverage in this PR. 👍 |
@dlqqq Thanks for identifying the bug in concurrent room initialization. A fix for it was merged in #255, and this PR now has conflicts, but it shouldn't be a problem since you are essentially rewriting the |
Description
Refactors the
DocumentRoom
class to not rely on theasyncio.Lock
context manager. This greatly improves readability by reducing nesting and reliance on theasyncio
API.Throttles saves instead of debouncing them, ensuring that document changes will be flushed to disk on a minimum interval. Related: Automatic file save strategies #244
All existing unit tests pass locally.
Change summary
Removes
self._initialization_lock
.initialize()
method is only called once and awaited, as is the case in this extension. This is a very reasonable usage constraint. Furthermore, inasyncio
, locks do not provide thread safety, contrary to the original docstring of this method.Removes
self._update_lock
.self._on_document_change()
from being called while the lock is held. Removing the locks results in a save loop, as without the lock,self._maybe_save_document()
would triggerself._on_document_change()
.self._document.dirty = False
statement, which could have been removed without consequence. See below.Throttles saves instead of debouncing saves by cancelling the previous task
Throttling seems preferable to debouncing here, as debouncing could result in the document not being saved if the document is being changed too frequently, which may arise in rooms with lots of collaborators.
Previously, every time a new
self._maybe_save_document()
task was started, the previous task was cancelled if it was in progress. The method required the previous task as an argument, which resulted in a weird way of calling the method:The new implementation of
self._maybe_save_document()
does not require an extra argument ortask.cancel()
; instead it relies on knowledge of its own state, stored in a couple of new instance attributes. Here is an overview of the algorithm:self._maybe_save_document()
task is waiting onself._save_delay
, then the current task can return early, as that previous task will save the Ydoc anyways.self._maybe_save_document()
task is currently saving viaFileLoader
, then the current task should setself._should_resave = True
and then return. Later, when the previous task is done saving, if this attribute isTrue
, then it will re-run itself viaasyncio.create_task(self.maybe_save_document())
.Removedself._document.dirty = False
statementsdirty
attribute was only referenced in a single unit test, and I could not find it mentioned inpycrdt
orpycrdt-websocket
. I removed this because settingself._document.dirty
triggers the_on_document_change()
observer, causing a save loop in this branch. Removing the statement fromself._maybe_save_document()
allows forself._update_lock
to be removed.