Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making the Metadata API public #40676

Closed
27 of 29 tasks
sorbaugh opened this issue Sep 28, 2023 · 4 comments
Closed
27 of 29 tasks

Making the Metadata API public #40676

sorbaugh opened this issue Sep 28, 2023 · 4 comments
Assignees
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement ❄️ 2023-Winter
Milestone

Comments

@sorbaugh
Copy link
Contributor

sorbaugh commented Sep 28, 2023

Requirements

Use cases

Existing

  • Store width and height for pictures
  • Store GPS coordinate for pictures
  • Store user readable location for pictures

Planned

Potential

  • Tags on files

Features

Already implemented

  • Store arbitrary metadata as a string and link it to a fileid and a data name.
  • Allow the metadata to be manually and granularly exposed to WebDAV requests.
  • Trigger providers to generated metadata during files:scan and files' updates (NodeWrittenEvent).
  • Delete metadata upon files deletion (NodeDeletedEvent). But needs adjustments: [Bug]: oc_file_metadata preservation issues with trashbin #34424

Already missing

  • Support dependency among providers to allow providers to depend on the work of other providers. Example: MapMediaToPlaceJob in the Photos app.
  • Execute the provider in a background process to not slow down the request. Example: MapMediaToPlaceJob in the Photos app.
  • Support listing all different metadata value for a given user and metadata name. Example: listing all places linked to pictures in the Photos app.
  • Support listing all files with a given metadata value and name for a given user. Example: listing all pictures linked a place in the Photos app.

Needed for the new use cases

  • Ability to use a metadata in an orderby directive in a WebDAV SEARCH request (sorting picture by taken date)
  • Automatically handle the exposition of the metadata in WebDAV requests. (show EXIF data in sidebar)
  • Allow clients to set the content of a metadata. (live photos)

PRs

@sorbaugh sorbaugh added enhancement 0. Needs triage Pending check for reproducibility or if it fits our roadmap ❄️ 2023-Winter labels Sep 28, 2023
@sorbaugh sorbaugh added this to the Nextcloud 28 milestone Sep 28, 2023
@sorbaugh sorbaugh moved this to 📄 To do (~10 entries) in 📁 Files team Sep 28, 2023
@artonge
Copy link
Contributor

artonge commented Sep 28, 2023

Technical requirements

Database

Indexing all the metadata is unnecessary, so we have one table to store the all the metadata, and another table to store indexed metadata.
The second table contains a copy of some metadata contained in the first table.
The second table can be rebuilt from the first one. This opens the way to be smarter about which data we keep in the indexed table. In the future, we could drop some of its data, or populate it on demand.

The last_update column is used to know how old is a given set of metadata.
The unique column is used to avoid race condition when updating metadata.

The indexed table contains two columns for the value. One is a varchar, the other a bigint. This allows to optimize the usage of both type of data.

oc_metadata

Name Type
fileid varchar
metadata text
unique varchar
last_update datetime

oc_metadata_index

Name Type
fileid varchar
key varchar
value_string varchar
value_int bigint
last_update datetime

Data format in the database

Prefixing the name of the metadata prevents conflicts, and allow us to know who created a metadata. It is not decided yet whether we should enforce the prefixing.

A special key _indexed_values keeps track of which property should be indexed.

{
	"files:exif": {
		"value": {
			"width": 0,
			"height": 0,
			"coordinate": {
				"latitude": 0,
				"longitude": 0
			},
			"taken_date": 123456789
		},
		"type": "array",
		"indexed": true
	},
	"files:blurhash": {
		"value": "azertyuiop",
		"type": "string"
	},
	"photos:place": {
		"value": "Paris",
		"type": "string",
		"indexed": true
	},
	"files:last_access": {
		"value": ["user1", "user2"],
		"type": "string[]"
	},
	"photos:taken_date": {
		"value": 123456789,
		"type": "int",
		"indexed": true
	},
	"files:live_photos": {
		"value": "1234",
		"type": "string"
	},
	"files:state": {
		"value": "editing",
		"type": "string"
	},
	"files:tags": {
		"value": ["tag1", "tag2"],
		"type": "string[]"
	}
}

Populating metadata

  • On file creation or edit, an event is broadcasted with current metadata. App can listen to this event and change the metadata. Metadata are updated in the database at the end of the event.
  • Another event is dispatched in the same way, but inside a background job, allowing apps to do heavier work to generate the metadata.
  • Clients should be able to update the value of a metadata in PROPPATCH requests.
  • Metadata can be set as read (default) or indexed. All 'read' metadata related to a file are stored in the oc_metadata as a single JSON
  • If set as indexed, the metadata will be stored in oc_metadata and an entry will also be generated in the table oc_metadata_indexed.

WebDAV

Two options for requesting metadata:

<nc:metadata:files:exif></nc:metadata:files:exif>
<nc:metadata:files:blurhash></nc:metadata:files:blurhash>
<nc:metadata>
    <nc:metadata:files:exif>
    <nc:metadata:files:blurhash>
</nc:metadata>

@artonge
Copy link
Contributor

artonge commented Sep 28, 2023

TODO

  • MetadataModel

    • getMetadataKey(key: string): mixed
    • setMetadataKey(key: string, value: mixed)
    • addIndex(key: string)
    • removeIndex(key: string)
    • listIndexes(): string[]
    • isIndex(key: string): bool
  • Extend Node API to add:

    • getMetadata(): MetadataModel
    • setMetadata(metadata: MetadataModel)
  • Broadcast two new events MetadataRequestEvent and BackgroundMetadataRequestEvent.

    • getNode(): Node
  • Create basic API to retrieve the metadata

    • getMetadataForFile(fileid: string): MetadataModel
    • setMetadataForFile(fileid: string, metadata: MetadataModel)
    • setMetadataKeyForFile(fileid: string, key: string, value: mixed, indexed: bool)
  • Create advanced API to retrieve the metadata

    • getIndexedMetadataValueForUserAndKey(userId: string, key: string): <string|number>[]
    • getFilesForUserAndKeyAndIndexedValue(userId: string, key: string, value: string): string[]
    • ... Query helper ...
  • Plug into the WebDAV server to react to <nc:metadata> requests. PROPFIND and PROPPATCH

  • Plug into the WebDAV server to react SEARCH requests

  • Create a migration to migrate to the new tables

    • Create oc_metadata
    • Create oc_metadata_index
    • Populate oc_metadata
    • Populate oc_metadata_index
    • Delete oc_files_metadata
  • Update the code in occ files:scan --generate-metadata

  • Update the documentation

  • Migrate the existing providers to the new API

    • EXIF
    • Places (Photos)

@artonge
Copy link
Contributor

artonge commented Sep 28, 2023

@artonge
While I like the 2 separated events, we might miss the security feature included in the background jobs that limit the possibility to run the same job multiple time in parallel ?
2 seperated events, we store a unique key/timestamp to compare with value at read time before writing to avoid race condition/data loss on update of the item in database

Also can be better that your getMetadata() returns an object, based on a model that contains an array (to store the metadata) and single setters/getters for each type (bool, string, int, array, ...)

We load the object within the event and each app can read from it. If an app decide to update some data (setters will update an internal boolean 'updated'), we store the updated version of the JSON within the database.

The object is serialized to be stored in the database, and import/deserialize when needed.


Also, I would separate the creation of a new metadata and its configuration as indexeable:

  • the metadata object contains few methods: listIndexes(): array; addIndex(string); removeIndex(string);
  • when an entry is set as index, we keep the key/value pair within the metadata, and store the key in an array in the JSON itself. Then we update/create/maintains the related entry within the metadata_index table.

getMetadataForFile(fileid: string): {...}
setMetadataForFile(fileid: string, key: string, value: mixed, indexed: bool): {...}

maybe getMetadata(): Metadata; and saveMetadata(Metadata);

Should we add some lazy loading getMetadata(): Metadata; in Node ? Might helps a lot when having a list of files and a need for metadatas

If we start adding methods to Node, we could go with setMetadata(Metadata), so we directly feed the object freshly created from filecache with left joined metadata the data from the select statement.

getIndexedMetadataValueForUserAndKey(userId: string, key: string): <string|number>[]
getFilesForUserAndKeyAndIndexedValue(userId: string, key: string, value: string): string[]

This would be used to search for files in the database ?
I am more favorable in providing a small tool, QueryHelper ?, to help developers to join the right table and apply correct where conditions to an already existing request.

edit: both (queryhelper+prepared methods) will be available

@PhilippSchlesinger
Copy link

PhilippSchlesinger commented Nov 21, 2023

Is there an overview ticket that tracks clients implementations using the metadata API?
As nextcloud/photos#87 is listed as planned in this ticket, you may as well want to add (or track) requests for corresponding requests for Android nextcloud/android#10425 and Windows nextcloud/desktop#6052.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0. Needs triage Pending check for reproducibility or if it fits our roadmap enhancement ❄️ 2023-Winter
Projects
Archived in project
Development

No branches or pull requests

5 participants