[Data Liberation] Entity Stream Importer #1980

adamziel · 2024-11-02T15:00:15Z

Let's build plumbing to load data into WordPress.

I think any data source can be represented as a stream of structured entities.

WP_WXR_Reader sources them from a WXR file
A markdown importer could do the same for markdown files
WordPress -> Wordpress could be the same story

Importing data

WXR importers must answer these questions:

What if a post with a given ID does or doesn't exists?
What if there's a partial difference between the two posts? Do we ignore it? Reconcile? Ask the user? Which post wins?
What if the author does or doesn't exist in the database?
Ditto for tags, categories, post meta etc.

Let's view a WXR file as a flat list of entity objects such as posts, comments, meta, etc. We can now represent a lot of scenarios as list concatenation:

Importing WXR into a WordPress site is WordPress entities ++ WXR Entities
Importing two WXR files is WXR Entities ++ WXR Entities
Pausing and resuming WXR import is Entities before pause ++ Entities after pause
Importing WordPress -> WordPress is WordPress 1 Entities ++ WordPress 2 Entities.
Syncing WP -> WP is WordPress 1 Entities ++ WordPress 2 entities ++ WordPress 1 deletions ++ WordPress 2 deletions

From there, we'd need to reduce those lists to contain zero or one entries representing each object.

This is already similar to journaling MEMFS to OPFS in the Playground webapp. It also resembles map/reduce problems where parts of the processing can be parallelized while other parts must be processed sequentially.

I bet we can find a unified way of reasoning about all these scenarios and build a single data ingestion pipeline for any data source.

Let's see how far can we get with symbols and reasoning before writing code. I'm sure there are existing white papers and open source projects working through this exact problem.

Resources

Existing WXR importers
Importers from other data formats
Site sync plugins

cc @brandonpayton

The text was updated successfully, but these errors were encountered:

adamziel · 2024-11-02T18:01:50Z

For decision points such as "if element with ID exists" we could support large element sets via bloom filters. On "match" we'd optimistically try to insert and then backtrack on failure.

adamziel · 2024-11-17T13:10:09Z

A Review of Existing WXR Importers

I've reviewed a lot of existing WXR importers and here are the key ideas I've gathered:

I've been diving deep into various WordPress WXR importers and exporters. Here are the key patterns and insights I've discovered that could be valuable for this project:

Import steps

Make the WXR file available to WordPress – e.g. upload from disk, paste, provide external URL
Validate the WXR file
Download the attachments
User gets a chance to provide alternative assets for any failed downloads
Start inserting/updating posts, comments, and other entities into the database

Imported data formats

WXR (1.1., 1.0, older versions)
Compressed WXR (gzipped, zipped, many WXR chunks in a single zip file)
Non-WXR formats – via *_to_WXR converters

Streaming and Memory Management

Streaming data – some importers stream to avoid loading 100GB of XML into memory. Instead of libxml that's not universally available, we'll rely on WP_XML_Processor
- Byte offset tracking – after every successful insertion, save progress information and the byte offset in the XML stream to enable resuming later on.
File Splitting – other importers split large WXR files into hundreds of smaller ones and then import them one by one
Soft-limit for resource usage – some importers monitor their memory, disk, and network usage and short-circuit before exceeding quotas and exhausting the available resources

Handling Attachments

Download Management:
- Support loading attachments from a local directory instead of downloading from th network.
- Counting download attempts and retrying up to n attempts.
- Retrying the download with a different user agent.
- Using file hashes for deduplication. It's always a URL hash. I haven't seen a single content hash-based approach.
- Supporting various compression formats (zip, tar.gz)
- Implementing timeouts and rate limiting
- wp_suspend_cache_invalidation() before updating the post, restore it after updating it
- Store attachments using wp_handle_sideload followed by wp_insert_attachment
- Backfill missing image extensions from the remote server’s content-type response.
Security:
- Validating file types and extensions
- Quotas: max_attachment_size, max_number_of_attachments_per_post
- Checking remote IP addresses and domains
- Resolving the IP once, pinning it to the download to avoid DNS rebinding attacks
- Using WordPress's built-in security checks through functions like wp_handle_sideload
- Allowlist of extensions, e.g. not .php files. Add a filter. Maybe use existing WordPress upload functions to apply the default filtering?
Temporary Storage:
- Streaming data to a temporary file during downloads, moving it to a final location when the download is complete
- Cronjob to cleanup temporary files
- Managing disk quota limits
- Placeholders – Inserts a post of type attachment_proxy to act as a placeholder for a file we intend to download and create an attachment for. Update it metadata as the download progresses, succeeds, or fails.

Entity Processing

Several patterns emerge around handling WordPress entities:

Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.
Flexible Updates: Support both creating new database entries and updating the existing ones
- Updating a post creates a new revision
- Reconciling related entities, e.g. do not insert duplicate comments or post meta fields
Data Sanitization and Validation:
- Validate XML syntax validity and encoding before starting the import
- Gracefully import posts even when something obvious is missing from WXR, e.g. a post title

Progress Tracking and Recovery

Progress Monitoring:
- Provide status information to the user
- Track elapsed and remaining time
- Tracking completed and remaining entities
- Calculate total number of entities and attachments to import before starting
Recovery Mechanisms:
- Each import request gets its own "session ID". Progress information are associated with that.
- Saving import state after every successful insertion
- When restarted, resume from the last processed location
- Gracefully recover both from PHP fatal errors and the user pressing a "stop import" button
Error Management:
- Detailed error logging
- UI for the user to provide images that couldn't be downloaded, fix post content with invalid encoding, etc.
- On attachment download failure: add_post_meta( $post_id, self::$FAILURE_LAST_ERROR_KEY

Performance Optimizations

Resource Management:
- HTTP rate limiting and connection pooling
- Maximum time between download attempts of a file
- Check if we're near exceeding the disk quota
- Soft memory limit. Cleans up memory usage when exceeded. If that didn’t help, it proactively kills the import job.
- Cache intermediate state on the disk or in the database to avoid recomputing it on each restart
Parallel Processing:
- Process forking, spinning async jobs for downloads
- Prevent running overlapping imports
- Support for async processing via wp cron, custom job queues, and WP CLI

Extensibility Patterns

Hooks for pre– and post– processing entities to skip them, modify them, or alter the way the updates are reconciled with the existing database records
Filter to adjust or reject the URL before fetching any remote attachment.

github-project-automation bot added this to Playground Board Nov 2, 2024

github-project-automation bot moved this to Inbox in Playground Board Nov 2, 2024

This was referenced Nov 18, 2024

Adam's list of Data Liberation wishes and ideas #1957

Open

[Data Liberation] Tracking issue #1894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data Liberation] Entity Stream Importer #1980

[Data Liberation] Entity Stream Importer #1980

adamziel commented Nov 2, 2024 •

edited

Loading

adamziel commented Nov 2, 2024

adamziel commented Nov 17, 2024 •

edited

Loading

[Data Liberation] Entity Stream Importer #1980

[Data Liberation] Entity Stream Importer #1980

Comments

adamziel commented Nov 2, 2024 • edited Loading

Importing data

Resources

adamziel commented Nov 2, 2024

adamziel commented Nov 17, 2024 • edited Loading

A Review of Existing WXR Importers

Import steps

Imported data formats

Streaming and Memory Management

Handling Attachments

Entity Processing

Progress Tracking and Recovery

Performance Optimizations

Extensibility Patterns

adamziel commented Nov 2, 2024 •

edited

Loading

adamziel commented Nov 17, 2024 •

edited

Loading