Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] Entity Stream Importer #1980

Open
adamziel opened this issue Nov 2, 2024 · 2 comments
Open

[Data Liberation] Entity Stream Importer #1980

adamziel opened this issue Nov 2, 2024 · 2 comments

Comments

@adamziel
Copy link
Collaborator

adamziel commented Nov 2, 2024

Let's build plumbing to load data into WordPress.

I think any data source can be represented as a stream of structured entities.

  • WP_WXR_Reader sources them from a WXR file
  • A markdown importer could do the same for markdown files
  • WordPress -> Wordpress could be the same story

Importing data

WXR importers must answer these questions:

  • What if a post with a given ID does or doesn't exists?
  • What if there's a partial difference between the two posts? Do we ignore it? Reconcile? Ask the user? Which post wins?
  • What if the author does or doesn't exist in the database?
  • Ditto for tags, categories, post meta etc.

Let's view a WXR file as a flat list of entity objects such as posts, comments, meta, etc. We can now represent a lot of scenarios as list concatenation:

  • Importing WXR into a WordPress site is WordPress entities ++ WXR Entities
  • Importing two WXR files is WXR Entities ++ WXR Entities
  • Pausing and resuming WXR import is Entities before pause ++ Entities after pause
  • Importing WordPress -> WordPress is WordPress 1 Entities ++ WordPress 2 Entities.
  • Syncing WP -> WP is WordPress 1 Entities ++ WordPress 2 entities ++ WordPress 1 deletions ++ WordPress 2 deletions

From there, we'd need to reduce those lists to contain zero or one entries representing each object.

This is already similar to journaling MEMFS to OPFS in the Playground webapp. It also resembles map/reduce problems where parts of the processing can be parallelized while other parts must be processed sequentially.

I bet we can find a unified way of reasoning about all these scenarios and build a single data ingestion pipeline for any data source.

Let's see how far can we get with symbols and reasoning before writing code. I'm sure there are existing white papers and open source projects working through this exact problem.

Resources

  • Existing WXR importers
  • Importers from other data formats
  • Site sync plugins

cc @brandonpayton

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 2, 2024

For decision points such as "if element with ID exists" we could support large element sets via bloom filters. On "match" we'd optimistically try to insert and then backtrack on failure.

@adamziel
Copy link
Collaborator Author

adamziel commented Nov 17, 2024

A Review of Existing WXR Importers

I've reviewed a lot of existing WXR importers and here are the key ideas I've gathered:

I've been diving deep into various WordPress WXR importers and exporters. Here are the key patterns and insights I've discovered that could be valuable for this project:

Import steps

  1. Make the WXR file available to WordPress – e.g. upload from disk, paste, provide external URL
  2. Validate the WXR file
  3. Download the attachments
  4. User gets a chance to provide alternative assets for any failed downloads
  5. Start inserting/updating posts, comments, and other entities into the database

Imported data formats

  • WXR (1.1., 1.0, older versions)
  • Compressed WXR (gzipped, zipped, many WXR chunks in a single zip file)
  • Non-WXR formats – via *_to_WXR converters

Streaming and Memory Management

  1. Streaming data – some importers stream to avoid loading 100GB of XML into memory. Instead of libxml that's not universally available, we'll rely on WP_XML_Processor
    • Byte offset tracking – after every successful insertion, save progress information and the byte offset in the XML stream to enable resuming later on.
  2. File Splitting – other importers split large WXR files into hundreds of smaller ones and then import them one by one
  3. Soft-limit for resource usage – some importers monitor their memory, disk, and network usage and short-circuit before exceeding quotas and exhausting the available resources

Handling Attachments

  1. Download Management:

    • Support loading attachments from a local directory instead of downloading from th network.
    • Counting download attempts and retrying up to n attempts.
    • Retrying the download with a different user agent.
    • Using file hashes for deduplication. It's always a URL hash. I haven't seen a single content hash-based approach.
    • Supporting various compression formats (zip, tar.gz)
    • Implementing timeouts and rate limiting
    • wp_suspend_cache_invalidation() before updating the post, restore it after updating it
    • Store attachments using wp_handle_sideload followed by wp_insert_attachment
    • Backfill missing image extensions from the remote server’s content-type response.
  2. Security:

    • Validating file types and extensions
    • Quotas: max_attachment_size, max_number_of_attachments_per_post
    • Checking remote IP addresses and domains
    • Resolving the IP once, pinning it to the download to avoid DNS rebinding attacks
    • Using WordPress's built-in security checks through functions like wp_handle_sideload
    • Allowlist of extensions, e.g. not .php files. Add a filter. Maybe use existing WordPress upload functions to apply the default filtering?
  3. Temporary Storage:

    • Streaming data to a temporary file during downloads, moving it to a final location when the download is complete
    • Cronjob to cleanup temporary files
    • Managing disk quota limits
    • Placeholders – Inserts a post of type attachment_proxy to act as a placeholder for a file we intend to download and create an attachment for. Update it metadata as the download progresses, succeeds, or fails.

Entity Processing

Several patterns emerge around handling WordPress entities:

  1. Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.

  2. Flexible Updates: Support both creating new database entries and updating the existing ones

    • Updating a post creates a new revision
    • Reconciling related entities, e.g. do not insert duplicate comments or post meta fields
  3. Data Sanitization and Validation:

    • Validate XML syntax validity and encoding before starting the import
    • Gracefully import posts even when something obvious is missing from WXR, e.g. a post title

Progress Tracking and Recovery

  1. Progress Monitoring:

    • Provide status information to the user
    • Track elapsed and remaining time
    • Tracking completed and remaining entities
    • Calculate total number of entities and attachments to import before starting
  2. Recovery Mechanisms:

    • Each import request gets its own "session ID". Progress information are associated with that.
    • Saving import state after every successful insertion
    • When restarted, resume from the last processed location
    • Gracefully recover both from PHP fatal errors and the user pressing a "stop import" button
  3. Error Management:

    • Detailed error logging
    • UI for the user to provide images that couldn't be downloaded, fix post content with invalid encoding, etc.
    • On attachment download failure: add_post_meta( $post_id, self::$FAILURE_LAST_ERROR_KEY

Performance Optimizations

  1. Resource Management:

    • HTTP rate limiting and connection pooling
    • Maximum time between download attempts of a file
    • Check if we're near exceeding the disk quota
    • Soft memory limit. Cleans up memory usage when exceeded. If that didn’t help, it proactively kills the import job.
    • Cache intermediate state on the disk or in the database to avoid recomputing it on each restart
  2. Parallel Processing:

    • Process forking, spinning async jobs for downloads
    • Prevent running overlapping imports
    • Support for async processing via wp cron, custom job queues, and WP CLI

Extensibility Patterns

  • Hooks for pre– and post– processing entities to skip them, modify them, or alter the way the updates are reconciled with the existing database records
  • Filter to adjust or reject the URL before fetching any remote attachment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Inbox
Development

No branches or pull requests

1 participant