Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rfc for supporting tables in excalidraw #1

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

ad1992
Copy link
Member

@ad1992 ad1992 commented Jul 13, 2024

@ad1992 ad1992 marked this pull request as ready for review July 13, 2024 12:26
Comment on lines 148 to 158
{
"id": "cell-1",
"row": 0,
"column": 0,
"x": 100,
"y": 100,
"width": 100,
"height": 50,
"text":"Cell 1",
"fontSize": 16;
"fontFamily": 1;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dwelle, as per my understanding, can you confirm if this is what you meant by virtual elements? These elements would be drawn at the time of rendering so there will be no existence of actual separate text elements

@ad1992
Copy link
Member Author

ad1992 commented Jul 15, 2024

@Mrazator I have tried incorporating some of your suggestions and answered some of the queries in the updated RFC.

Could you elaborate on what you meant by Defining the order of rows and columns to allow drag and drop?

@ad1992 ad1992 requested review from dwelle and Mrazator July 15, 2024 09:49
@Mrazator
Copy link
Member

@Mrazator I have tried incorporating some of your suggestions and answered some of the queries in the updated RFC.

Could you elaborate on what you meant by Defining the order of rows and columns to allow drag and drop?

Sure, the flat array of keeping all the cells and the related metadata seems quite brittle & inflexible long-term. I would aim at lifting most of the cell metadata above, so it's easily accessible (table dimensions, row order, column order, etc.). It should make the data also more understandable and flexible, so we don't have to iterate through all the cells and calculate the mental model for each potential use case - swap rows/columns (more below), column sorting, ensuring table integrity (no duplicate rows/columns), increasing/decreasing table dimensions, etc.

The example given was about the common drag-and-drop swapping behaviour of being able to pick a row/column and drop it into a different position within a table (see below examples from Notion and Notes). With a flat hierarchy (one array) to keep all the cells, such an operation would rely on a sequential search going through all the cells.

Screen.Recording.2024-07-15.at.12.27.07.mov
Screen.Recording.2024-07-15.at.12.28.48.mov

One simple way of improving this is to have a matrix (simplified):

{
  header: ["A", "B", "C"],
  rows: [
    [1, 2, 3],
    [4, 5, 6]
  ]
}

This way you can easily perform the following:

  • maintain table dimensions
    • # of columns is header.length
    • # of rows is rows.length
  • swap first and second row/columns
  • sort the first or second column

Such a format is probably fine for import/export or even simple single-player scenarios, but it raises issues in multiplayer, affecting both History and Collab (array order changes can't be safely applied/reconciled, unless we have some additional metadata, i.e. fractional indices). We should think through the multiplayer use cases and impliciation in both of these components, otherwise, it will lead to major usability & complexity issues.

I will post an update here once I sort my thoughts out.


Yes we will need write custom logic to handle reconcilation in `tables`, mainly attaching `timestamp` to each cell and writing a custom algorithm to resolve the conflicts and ensuring latest changes are applied to all the collaborators.

### Storing cells with respect to rows to ease reordering of rows and columns
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Mrazator as per your suggestion, I have added an alternative approach for taking care of swapping rows and columns.
I have added index and lastUpdated fields to resolve conflicts during collaboration.
Let me know what you think

Copy link
Member

@Mrazator Mrazator Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that in order to support concurrent changes out of the box, we cannot rely on an array at all and thus we would have to replace the whole thing with objects, which gets ugly pretty quickly. Very rough format idea:

// initial state
{
  header: {
    "column_A": { value: "A", index: "a0", isDeleted: false },
    "column_B": { value: "B", index: "a1", isDeleted: false },
    "column_C": { value: "C", index": "a2", isDeleted: false },
  },
  rows: {
    "row_0": {
    	cells: {
    	  "cell_0": { value: 1, index: "a0" },
          "cell_1": { value: 2, index: "a1" },
          "cell_2": { value: 3, index: "a2" },
    	},
        index: "a0"
        isDeleted: false,
    },
    "row_1": {
     	cells: {
     	   "cell_0": { value: 4, index: "a0" },
           "cell_1": { value: 5, index: "a1" },
           "cell_2": { value: 6, index: "a2" },
     	},
        index: "a1",
        isDeleted: false,
    },
  }
}

That is all based on simple scenarios, which modify columns & rows (notice the likely necessity for soft deletions - isDeleted).

A) To the table defined above, "A" added a row [7, 8, 9], "B" removed a column "C", "A" performs undo, "B" performs undo
B) To the table defined above, "A" added a column D, "A" performs undo, "B" added a column "E", "A" performs redo

Questionably, similar soft deletions might be / might not be needed on the cell level, but I still didn't go through all the possible cases.


Btw. I think we will likely want to avoid relations with created with boundElements, as those introduced a lot of challenges during multiplayer undo-redo, so I am guessing it would end in similar cases here. For details check the conflict resolution of relations defined with boundElements as part of the History tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mrazator in the nested tree structure above, unless the object order is significant, you'd need to sort by index first during render if you didn't want to keep explicit x & y (and potentially width & height).

At that point we can go with a flat structure and it'll have similar perf profile (except you sort once, but longer array which might be slower depending on number of cells).

Ideally we get rid of sorting.

Below, I'm using a flat structure, but need to track x/y/width/height.

{
  "type": "table",
  "cells": {
    "cell-id-1": {
      "id": "cell-id-1",
      "row": 0,
      "column": 0,
      "x": 0,
      "y": 0,
      "width": 100,
      "height": 50
    },
    "cell-id-2": {
      "id": "cell-id-2",
      "row": 0,
      "x": 100,
      "y": 0,
      "column": 1,
      "width": 100,
      "height": 50
    }
  }
}

If we want to get rid of x & y, we'd have to keep the auxiliary rows mapping as @Mrazator suggested).

As for width & height, IMO we need to track that regardless, else we'd need to compute it dynamically from the text content and find a maxHeight for each row, an maxWidth for each column, before we could render any cell.

But by tracking these, a simple change in dimension for once cell would cascade for all the cells in given row/column based on what dimension changed.

If we track x & y as well, updates would cascade even more.

We would need to sync on property-level instead of element-level.


As for isDeleted, I'm still not convinced we need it for either.

Comment on lines 418 to 484
{
"id": "table-id",
"type": "table",
"x": 100,
"y": 100,
"title": "Table Title",
"rows": [{
"id": "row-1",
"index": 0,
"lastUpdated": 1633257617000
"cells": [{
"id": "cell-1",
"x": 100,
"y": 100,
"width": 100,
"height": 50,
"content": "Cell-11",
"timestamp": 1633257618000
},
{
"id": "cell-2",
"x": 200,
"y": 100,
"row": 0,
"column": 1,
"width": 150,
"height": 50,
"content": "Cell-12",
"timestamp": 1633257619000
}
]},
{
"id": "row-2",
"index": 1,
"lastUpdated": 1633257620000
"cells:[{
"id": "cell-3",
"x": 100,
"y": 150,
"width": 120,
"height": 60,
"content": "Cell-21",
"timestamp": 1633257621000
},
{
"id": "cell-4",
"x": 220,
"y": 150,
"width": 130,
"height": 60,
"content": "Cell-22",
"timestamp": 1633257617000

}
}],
"columns:": [{
"id": "column-1",
"index": 0,
"title": "Column 1",
"timestamp": 1633257617000
},
{
"id": "column-2",
"index": 1,
"title": "Column 2",
"timestamp": 1633257617000
}],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the discussion here so it reflects all the fields (lastUpdated, timestamps index` and more) and the high-level approach.

I agree using arrays will impact the perf especially when it comes to lookup so moving to objects will be a better option hence here is the proposed structure 👇🏻

{
  "id": "table-id",
  "type": "table",
  "x": 100,
  "y": 100,
  "title": "Table Title",
  "cells": {
    "cell-11": {
      "id": "cell-11",
      "width": 100,
      "height": 50,
      "content": "Cell-11",
      "rowId": "row-1",
      "columnId": "col-1",
      "timestamp": 1633257618000
    },
    "cell-12": {
      "id": "cell-12",
      "row": 0,
      "column": 1,
      "width": 150,
      "height": 50,
      "content": "Cell-12",
       "rowId": "row-1",
      "columnId": "col-2",
      "timestamp": 1633257619000
    },
    "cell-21":{
      "id": "cell-21",
      "width": 120,
      "height": 60,
      "content": "Cell-21",
       "rowId": "row-2",
      "columnId": "col-1",
      "timestamp": 1633257621000
    },
    "cell-22": {
      "id": "cell-22",
      "width": 130,
      "height": 60,
      "content": "Cell-22",
       "rowId": "row-2",
      "columnId": "col-2",
      "timestamp": 1633257617000
    }
  },
  "rows": {
    "row-1": {
    "id": "row-1",
    "index": 0,
    "cellIds":["cell-11", "cell-12"],
    "lastUpdated": 1633257617000,
    "isDeleted": false,
    },
    "row-2": {
    "id": "row-2",
    "index": 1,
    "cellIds":["cell-21", "cell-22"],
    "lastUpdated": 1633257620000,
    "isDeleted": false,
    }
  },
  "columns:": {
    "column-1": {
    "id": "column-1",
    "index": 0,
    "title": "Column 1",
    "timestamp": 1633257617000,
     "isDeleted": false,
  },
    "column-2": {
    "id": "column-2",
    "index": 1,
    "title": "Column 2",
    "timestamp": 1633257617000,
     "isDeleted": false,
    }
  },
  1. Each row has a mapping of cellIds to identify which cells are mapped to that row.
  2. Depending on whether the virtual text elements approach works out, the content field will be replaced with boundTextElementIds / boundTextId to keep it simple and the timestamp field will be removed as well from each cell. Can rename the content to text to keep it simple.
  3. As per isDeletedfor tracking row and column deletion we will need it. As per isDeleted for each cell, I am not sure if we will need it so can revisit this later.
  4. As per width and height, we need it as mentioned in RFC since each cell may have a different dimension
  5. As per x/y , since cellIds are there for each cell so we can compute the coordinates for each cell with cellIds and width and height
  6. I have kept rowId and columnId for each cell to identify the rows and columns for quick look up when cell updated since now it's a flatten structure. Can remove later if not needed but most likely will be needed during updates.

If the above structure looks good, I will get started on implementation and we can explore the scenarios on the way.
cc @Mrazator @dwelle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. Each row has a mapping of cellIds to identify which cells are mapped to that row.

do you anticipate we'll be doing lookups from cell -> row/column to warrant the expanded data structure of rows: {}, columns: {} instead of simple rows: string[][] @Mrazator suggested?

EDIT: Actually yes, during local edits so we can easily update applicable row + column cells when source cell dimensions change. Though I assume we could do array lookup or in-memory cache and it'd be fine, too.

The way I look at it, the renderer will go row by row and look up the cell by id to render it (sow row -> cells), instead.

3. As per isDeleted for tracking row and column deletion we will need it. As per isDeleted for each cell, I am not sure if we will need it so can revisit this later.

Actually scratch my previous comment, as I was considering isDeleted flag only from the perspective of the history. We will need the flag for collab/reconciliation. So we'll need it for each cell separately.

But why track it for whole columns/rows?

4. As per width and height, we need it as mentioned in RFC since each cell may have a different dimension

True, didn't consider users can make the columns/rows bigger than the text itself. Although perhaps the MVP doesn't need to support manual resizing (but it probably won't be that hard to implement).


Now that I'm thinking about it more, we'll definitely need per-property syncing & reconciliation, otherwise if someone is editing two different cells in the same row/column, there would be conflicts such as when one user writes enough text it expands the row height, while the other user edits a different cell in that row.

Row/column reordering edits would have similar conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually scratch my previous comment, as I was considering isDeleted flag only from the perspective of the history. We will need the flag for collab/reconciliation. So we'll need it for each cell separately.

Why for each cell? Users can delete a row or a column but not individual cell

But why track it for whole columns/rows?

To track the deletion of entire row / column and this will be used in undo/redo and collab reconciliation as well.

Now that I'm thinking about it more, we'll definitely need per-property syncing & reconciliation, otherwise if someone is editing two different cells in the same row/column, there would be conflicts such as when one user writes enough text it expands the row height, while the other user edits a different cell in that row.

Per property syncing as in? With the help of index, lastUpdated in row and column, and timestamp for each cell we could come up with a reconciliation algorithm (theoretically as I think but will be more clear once implemented)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why for each cell? Users can delete a row or a column but not individual cell
..
To track the deletion of entire row / column and this will be used in undo/redo and collab reconciliation as well.

Yeah nvm, I didn't think this one through.

Per property syncing as in? With the help of index, lastUpdated in row and column, and timestamp for each cell we could come up with a reconciliation algorithm (theoretically as I think but will be more clear once implemented)

So that the remote client doesn't throw away someone's text update just because someone (e.g. themselves) have updated width of that cell more recently due to table reflow.

But unless we want to keep separate version for each respective attribute (or ones we deem should be synced per-attribute), we'll have to come up with something else, such as in-memory versioning for attributes. But that feels brittle and would sidestep element.version entirely, adding more complexity.

Or we can ignore it, but then we should come up with a data structure which requires the least updates of neighboring cells during reflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I get the point, this is a valid use case, I have updated the RFC with the latest structure, queries/updates so we can start the implementation.
As per the syncing, I will come back to this later and we can decide how we want to go about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, most likely we will have to add per-property syncing for tables as @dwelle was also suggesting.
Each cell property can have a timestamp property as well and the conflicts will be resolved based on that to ensure all collaborators' updates are applied correctly and there isn't randomness of who wins.

Let's say user A updates the width of the cell and user B updates the text of the cell, so the updates would look like 👇🏻

{
  "id": "cell-11",
  {
     width: 60,
      timestamp: 1633257618000
   },
   {
       height: 50,
       timestamp: 1633257616000,
    },
   
      content: "Hello",
      timestamp: 1633257616000,
    }
}

And updates of user B would look like 👇🏻

{
  "id": "cell-11",
  {
     width: 60,
      timestamp: 1633257616000
   },
   {
       height: 50,
       timestamp: 1633257616000,
    },
   
    content: "Hello",
    timestamp: 1633257618000,
    }
}

Now each property timestamp will be compared and one with a greater timestamp value will be applied for that property
width -> User A wins
height -> Both are equal
content -> User B wins

However, I strongly think we are going towards OT so most likely we might want to try out OT (Operational Transform) for tables to resolve the concurrent conflicts efficiently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You cannot rely on regular timestamps (in such format) for synchronisation purposes (in general in a distributed system), as you cannot ensure the clocks on the client are in sync. Instead, lamports clocks (~version) and their combinations with versionNonce/clientId or in the form of HLC are usually used. Also for a traditional OT, you would need a server, which does not go well with E2EE and local-first principles.

I am also not persuaded we need syncing on the properties of the individual cells, as it will become unnecessarily complex not just for reconciliation, but also for the multiplayer history. Instead, as I suggested above, I would aim at lifting most of the metadata from the cells up. For example, if the width is defined in a column and the height is in the row, you don't need property-based syncing on the cell.

Surely there will be more properties on the cell level like fontFamily, but if we survived with element-based sync on all the other elements till now, we should be good to go for the cells as well. I would either attack property-level syncing before or after we have tables, but not as part of the tables and not just suddenly for the cells.

In its simplest form, when it comes to concurrent updates to cells, it would look something like:

User A updated cell-11:

{
  "id": "cell-11",
  "originalText": "Hello from A",
  "fontFamily": 1,
   "version": 100
}

User B updated cell-11:

{
  "id": "cell-11",
   "originalText": "Hello from B",
   "fontFamily": 2,
   "version": 100,
}

As we got used to now, since versions are the same, we take the one with a higher versioNonce. We could have versionNonce specified alongside the version. But ideally, as mentioned above. versionNonce should be the same for all the captured updates, so that one client consistently wins over the other, and the table data won't just become a mess.

This means we could just have one versionNonce on the whole table level, and in case of conflicts, take all the updates from the table with higher versionNonce. I.e. if User A update generated a versionNonce: 111 and User B generate a versionNonce: 222, User B should win for all conflicting changes - cells with the same version. Similar versioning can be applied to rows and columns.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a combination of version and versionNonce is what we use now as well.

For dimensions we can have width defined on column and height defined in rowand then the top level version and versionNonce combination should work well (apart from selecting random client when version is same and whichever version nonce is higher, but thats ok as we do that now as well)

Even if I ignore text-related properties (font size, font family etc for now where we need property-based syncing), later when we have full-fledged WYSIWYG, more properties on the cell will be added and syncing could become more challenging unless we check each type of property update or have some property based syncing algorithm - each cell has a version and version nonce (That's what feels theoretically rn).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rich text will be another challenge, but IMHO should not be done sooner than properties level sync. Once we have sync on properties, we will need to re-think the whole text-level reconciliation and after that, we will likely do something like https://www.inkandswitch.com/peritext/ for rich text (either ourselves or partially reuse an existing CRDT).

Copy link
Member

@Mrazator Mrazator Aug 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got me thinking, what's stopping us from a flat hierarchy (elements JSON) with the additional row element type?

  1. conceptually enforces a parent-child table > row > cell hierarchy (runtime)
  2. loading each element into an in-memory Map would rely on its insertions order (~ordering based on index just on init) and provide 0(1) lookups
  3. we could keep a plain-old text element for cells (hence once we tackle rich-text / text-level-reconcilliation, tables would be solved automatically)
  4. reconciliation would stay more or less untouched (for now)
  5. multiplayer history would mostly be provided out of the box - at least until we recognize some conflicts worth tweaking
  6. render step would skip rendering of all rows/cells (text elements) and render it with the table element (similarly as we do with bound text now) - hence keeping elements' order denormalized (similar to frames > frame childs)

tables

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants