Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebook source control / diff #94810

Closed
rebornix opened this issue Apr 9, 2020 · 9 comments
Closed

Notebook source control / diff #94810

rebornix opened this issue Apr 9, 2020 · 9 comments
Assignees
Labels
api-proposal feature-request Request for new features or functionality notebook
Milestone

Comments

@rebornix
Copy link
Member

rebornix commented Apr 9, 2020

This is the summary of the discussions @kieferrm and me had about how to support diff in VS Code, what the UX is like and what are the gaps in the API to get it working properly.

UX (Rich/Plaintext diff)

Notebooks are reprenseted in rich UI (markdown previews, code editors, and outputs rendered in various forms) but the notebook documents are usually stored to file systems in text form and then it can be tracked by source control systems. Today in VS Code you can already do text form diffing but that's limited, for example, if an image/chart output changes, we can't tell what's being changed through text diff.

Thus we want to do rich diffs by rendering notebooks in Notebook Editor side by side, align the cells positions (similar to how we align lines in text diff editor). The catch with this approach is it doesn't present all the data in the document. One example is Jupyter Notebook stores custom metadatas (kernel info, document schema version, etc) but they are never presented in the VS Code UI. To allow users still have a full picture of what's being changed behind the scenes, we may want to still support text based diffing and users can easily switch between them.

FS & Source Control

Currently the two responsbilities of a vscode.NotebookContentProvider are

  1. resolving the content for a resource (identified by vscode.Uri) and converting to structured data vscode.NotebookDocument, and
  2. serializing vscode.NotebookDocument and saving its text form onto file system.

Since the identifiers for resources are vscode.Uri, which are always backed up by a file system provider, notebook content providers should use vscode.workspace.fs to resolve its raw content, instead of using node's fs.

The source control API in VS Code works seemlessly with vscode.Uri. For example, if you have a pending file change in a git repo, git extension can provide two resource Uri for the file, one file:/// uri for current content and one git:/// uri for the content prior to the change. Then we can ask notebook content provider to resolve the content for both Uris.

SCM API

export interface SourceControl {
    quickDiffProvider?: QuickDiffProvider;
}

interface QuickDiffProvider {
    provideOriginalResource?(uri: Uri, token: CancellationToken): ProviderResult<Uri>;
}

Notebook Content Provider

export interface NotebookContentProvider {
    /**
     * Resolve content from `uri` and convert it to `NotebookData`
     * Extensions should use `vscode.workspace.fs` for resolving the raw content for `uri`.
     */
    openNotebook(uri: Uri, openContext: NotebookDocumentOpenContext): NotebookData | Promise<NotebookData>;
}

Dirty changes in workspace

Uris work great for source control as the content changes are already saved to file system. However if users have a dirty notebook document in the workspace (say auto saved is turned off), we can't differenciate content on disk and content in workspace as they share the same Uri and VS Code core doesn't know how to turn the dirty vscode.NotebookDocument to text.

Since vscode.NotebookContentProvider is the only one who knows how to convert a vscode.NotebookDocument to text, we will delegate this to the content provider:

export interface NotebookContentProvider {
    /**
     * Save the text form of `notebookDocument` into `textDocument`
     */
    notebookAsText(notebookDocument: NotebookDocument, textDocument: TextDocument): Promise<void>;
}

Diff

We have a complex two way diff algorithm for the text files, which can probably be used for the notebook document too. The catch is how we are going to compare NotebookCells efficiently. If notebook providers can provide an unique id for each cell, that would be great. If not, we have to do deep comparison for NotebookCell content.

The comparison algorithm for NotebookCell might differ for different notebook providers. For example the GitHub Notebook wants to exclude outputs but Jupyter Notebook may include them. Not sure yet whether this can be described descriptively through metadata or we need to introduce new APIs.

@rebornix rebornix added this to the April 2020 milestone Apr 9, 2020
@rebornix rebornix added the feature-request Request for new features or functionality label Apr 9, 2020
@rebornix rebornix modified the milestones: April 2020, May 2020 Apr 24, 2020
@jrieken jrieken modified the milestones: May 2020, Backlog Jun 3, 2020
@rebornix rebornix modified the milestones: Backlog, July 2020 Jul 8, 2020
@rebornix rebornix self-assigned this Jul 8, 2020
@rebornix
Copy link
Member Author

Some useful links:

@rebornix
Copy link
Member Author

rebornix commented Jul 30, 2020

While building the first prototype for inline diff decorators (gutter) and side by side diff view, I didn't run into API limitations but quite a few UI/UX challenges and it has several new requirements for our notebook infrastructure (but it's already pretty powerful and doesn't require too many changes to build the prototype).

Diff Decorator

  • We listen to model content change of active notebook document, resolve the original resource URI from the scm service
    • It works the same as text files, the implementations are identical
  • We resolve the NotebookData of original resource URI, through NotebookContentProvider#openNotebook API.
    • openNotebook can be costly (parsing file from disk to json) so we don't want to do that after every content change. Thus we watch the original resource URI through FileService#watch, and only resolve notebook data when it changes.
    • It means we assume that the resource is always a file and watchable.
  • Compare
    • Delay/Debounce by 200ms
    • Currently we compare the diff in UI thread while text file diffing is performed in editorWorker. We need to introduce a worker for notebook documents and perform costly operations there
    • Currently use LCS diff algorithm, which is performant but doesn't tell a cell is deleted-then-inserted or modified.

Side by Side rich diff

  • Resolving original document and diff are the same as above
  • Align cells across original & modified notebook editor. There are two solutions
    • Whitespace/Viewzone API in the list view to allow gaps between list items
    • Insert dummy items into the list view
  • Need to figure out how to present UI changes caused by metadata, user preference, etc
    • cell metadata
    • document display order
    • user setting display order
    • multiple output mimetypes

@DonJayamanne
Copy link
Contributor

Thoughts:

  • I hope changes to VS Code specific metadata such as runState, runnable, editable and the like can be displayed in the GUI without having to resort to some text diffing (note, I'm referring to VSCode specific metadata, not custom).
  • Assume a user opens a notebook and runs a cell & the output is identical. I'm of the opinion this is a very likely/common scenario. At this point, I'd like to ensure the user knows immediately that nothing has changed except for executionOrder through the Gui.
    • I.e. in such a situation the Gui will show execution order has changed and no other changes , not even metadata.
  • Another side effect of the above is runStartTime and lastRunDuration.
    • For every run, this metadata will change. Again, I would expect the user to see the difference in the execution time to be the only change (apart from executionOrder).

I believe these two would be crucial in addressing the pain points users have when dealing with Jupyter ipynb notebooks. Others have been or I expect they will be resolved with your proposals (rich diffing of output, code and the support for metadata).

Question:

  • Is it possible to detect differences in order of the output? I can't think a use case for this, however its a good thing to have, this way the GUI can handle changes in order of output as well.

@rebornix Lets discuss offline how trusted notebooks could impact this design. FYI -if a notebook is not trusted, we return different metadata & no output.

@rebornix
Copy link
Member Author

cc @roblourens as we have quite a few metadata which are used for UI states (runnable, collapse, editable).

@rebornix rebornix modified the milestones: July 2020, August 2020 Aug 3, 2020
@rebornix
Copy link
Member Author

One main API gap/challenge raised from the offline discussions is that whether we should show inline SCM decorators (on gutter) when there are metadata or output changes is unknown. Currently when metadata or output changes, the core don't make the document dirty, instead content provide decides if they want to mark it dirty (by triggering content change event) as only the content provider has the knowledge if a specific metadata or output should be saved to disk.

For example, GitHub Issue Notebook extension implements a command to lock the cell (to disallow editing of the content), the command will

cell.editable = false;
this._onDidChangeNotebook.fire({ document } );

However, there is a bug with this approach: any extension can modify a cell's metadata, but they don't have the ability to trigger content change event. Imagine that another extension modifies cell.editable but the document doesn't become dirty, the data might be lost if users don't save the document intentionally.

One proposal from @jrieken is embedding persistent/transient info into metadata values, e.g.:

interface Cell {
  metadata: {
    editable: { value: boolean, persistent: boolean}
  };
}

The content provider will tell the core what the value of a metadata and whether it's persistent or transient. The core is then responsible for marking the document dirty and computing Diff (and also ignore the metadata if it's transient).

Whether a metadata will lead to content change on disk, is now a contract between the content provider and the core. We can apply this to cell outputs in the same way

type NotebookOptionalData<T> = { value: T, persistent: boolean }

interface Cell {
  outputs: Readonly<NotebookOptionalData<Output[]>>;

  metadata: Readonly<{
    editable: Readonly<NotebookOptionalData<boolean>>;
    executionCount: Readonly<NotebookOptionalData<number>>;
  }>;
}

Considering that modifying a metadata or output might have side effects (dirty), extensions should not modify them directly. All updates should come through a set of cell editing API provided by the core:

// pseudo code
updateMetadata(cell, { editing: false, executionCount: 2 });\
updateOutputs(cell, []);

The core will update the model based on the current state of the document (for example, reject the edit if the version is wrong), and the contract with the content provider (if the change will trigger content change).

@rebornix
Copy link
Member Author

Several proposals from the discussion in the morning, we didn't finalize yet but listing all of them helps us have a better understanding of the problem.

proposal 1
Instead of asking content provider to specify persistency on each metadata, we can split the metadata into two groups, one for persistable, the other for transient ones.

interface ICellMetadata {
 editable?: boolean;
}

interface INotebookData {
  cells: CellData[];
}

interface CellData {
  persistant: ICellMetadata;
  transient: ICellMetadata;
}

Extensions still use the updateMetadata API provided by the core to modify cell metadata. With this approach, we will need a unified property for evaluated metadata on NotebookCell.

interface NotebookCell {
  metadata: Readonly<ICellMetadata>
}

proposal 2

The other proposal is making outputs and all metadata persistent. Firstly, it's inconsistent and confusing that some content providers save the outputs but some don't, we might want to leave the decision to users instead of content provider:

  • By default the document becomes dirty when outputs change. If users set to ignore outputs in the file, it will no longer trigger content change.
  • Content provider reads this property to decide if they should save the outputs into the document (or we pass the option in save/save as a save option)

And since metadata will all be persistent, we will need to move execution and view related information out:

  • execution related states (e.g., runState): they will only be set during execution
  • view states (e.g., input/outputCollapsed): extensions may need new API to restore its state when a document is opened and also listen to its state change and save them into NotebookCell#metadata.custom if they want to store them in the document. Changes to NotebookCell#metadata.custom will trigger content change so they won't be lost.

@rebornix
Copy link
Member Author

rebornix commented Aug 17, 2020

Firstly, it's inconsistent and confusing that some content providers save the outputs but some don't, we might want to leave the decision to users instead of content provider

wrong assumption though 😢 . whether outputs should be saved to disk can be a preference, however whether outputs can be saved to disk is decided by the content provider. One example is powershell notebook (which allows to open ps1 file in notebook editor) can't do any output persistence.

@rebornix
Copy link
Member Author

My current thinking is for outputs, content provider tells us if they can be saved to disk; for metadata, we try to make them minimal and all persistent, execution and view related infos are moved out of metadata but stored as properties of NotebookCell.

With this approach, we can introduce a content provider ctor options

export function registerNotebookContentProvider(
	notebookType: string,
	provider: NotebookContentProvider,
	options?: {
		persistOutputs: boolean;
	}
): Disposable;

Content provider tells VS Code if they can/want to store outputs at registration time.

export interface NotebookCell {
  outputs: ReadonlyArray<CellOutput>;
}

@rebornix rebornix modified the milestones: August 2020, September 2020 Sep 1, 2020
@rebornix rebornix modified the milestones: September 2020, October 2020 Sep 30, 2020
@rebornix rebornix removed this from the October 2020 milestone Oct 26, 2020
@rebornix rebornix added this to the November 2020 milestone Oct 26, 2020
@rebornix rebornix modified the milestones: January 2021, Backlog Jan 28, 2021
@rebornix
Copy link
Member Author

rebornix commented Aug 2, 2021

We now support diffing notebook with output rendered.

@rebornix rebornix closed this as completed Aug 2, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Sep 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-proposal feature-request Request for new features or functionality notebook
Projects
None yet
Development

No branches or pull requests

3 participants