Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove images from posts containing a photo #52

Open
aaronpk opened this issue Jan 12, 2018 · 5 comments
Open

Remove images from posts containing a photo #52

aaronpk opened this issue Jan 12, 2018 · 5 comments

Comments

@aaronpk
Copy link
Owner

aaronpk commented Jan 12, 2018

Discussion from IRC:

@aaronpk
Copy link
Owner Author

aaronpk commented Jan 12, 2018

Encountered two blockers working on this:

  1. In a simple example of an img tag inside an e-content tag, the parsers are using the img tag as an implied photo property. This seems wrong to me. Example This means XRay sees a post like this as a photo post, and would remove the img tag from the content, which is definitely not the right thing to do.
<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg"></p></div>
{
    "type": [
        "h-entry"
    ],
    "properties": {
        "name": [
            "Hello World http://example.com/example.jpg"
        ],
        "content": [
            {
                "html": "Hello World <img src=\"http://example.com/example.jpg\">",
                "value": "Hello World http://example.com/example.jpg"
            }
        ],
        "photo": [
            "http://example.com/example.jpg"
        ]
    }
}
  1. At the point that XRay is sanitizing the HTML value, the Microformats parser has already converted the HTML to plaintext.

For example, XRay sees this object and runs the HTML sanitizer on the HTML value:

{
    "html": "Hello World <img src=\"http://example.com/example.jpg\">",
    "value": "Hello World http://example.com/example.jpg"
}

This means I can't remove the img tag from the plaintext value since it's already been parsed. I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can't be sure exactly what method was used to generate the plaintext value from the original HTML anyway.

@aaronpk
Copy link
Owner Author

aaronpk commented Jan 12, 2018

Another question/problem is what should I do in the case where the img tag in the e-content contains alt text? That alt text will have already been brought into the plaintext values for e-content and maybe even the p-name.

<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg" class="u-photo" alt="An Example Photo"></p></div>
{
  "type": [
    "h-entry"
  ],
  "properties": {
    "name": [
      "Hello World An Example Photo"
    ],
    "photo": [
      "http://example.com/example.jpg"
    ],
    "content": [
      {
        "html": "Hello World <img src=\"http://example.com/example.jpg\" class=\"u-photo\" alt=\"An Example Photo\">",
        "value": "Hello World An Example Photo"
      }
    ]
  }
}

(example)

This is filed as an issue on the parsing spec here: microformats/microformats2-parsing#16

Ideally the parsing spec would not have included that alt text in the plaintext values in the first place.

@aaronpk
Copy link
Owner Author

aaronpk commented Jan 12, 2018

Current status: blocked. Breaking this out into separate issues so they can be tracked.

@aaronpk aaronpk closed this as completed Jan 12, 2018
@aaronpk
Copy link
Owner Author

aaronpk commented Jan 12, 2018

whoops, not actually closing this because I haven't committed the code that actually does that logic yet.

@grantcodes
Copy link

An example feed where a photo property is created, but not removed from the content: http://feeds2.feedburner.com/thenextweb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants