Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a minimal action I/O implementation to improve SearchAction support #237

Merged
merged 2 commits into from
Nov 8, 2024

Conversation

Gallaecio
Copy link
Member

No description provided.

@Gallaecio Gallaecio requested review from kmike, wRAR and lopuhin July 25, 2024 08:04
@@ -235,6 +235,17 @@ def _extract_property_value(self, node, items_seen, base_url, itemids, force=Fal
elif node.get("content"):
return node.get("content")

# https://schema.org/docs/actions.html#part-4
Copy link
Member

@kmike kmike Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems most example in this link use an alternative representation, with a string ("url-output": "required").

  1. How is it handled in this PR, and how was it handled before? Could it be that we start to lose some data which was easily available before? I.e. previously we just returned "url-output": "required", but now it's "url-output": {}, something like this?
  2. Do you know which approach is more common? Is it common on real websites to use the "Textual representations of Input and Output", or is it a rare edge case?

Copy link
Member Author

@Gallaecio Gallaecio Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it handled in this PR, and how was it handled before?

As far as I know, before it was not handled at all, at least for the required and name fields of query-input. Now it is handled but only those 2 fields.

I could extend the scope to all fields from those docs, but I am not sure how to map defaultValue given its documented data type (Thing, DataType), which was the main reason I stuck to the fields that were actually of interest to me.

Do you know which approach is more common?

The string representation you mention is the only one I have actually seen, in JSON-LD structures of websites (e.g. required name=search_terms). But it was my impression that the output of extruct’s microdata extractor tends to be more structured (e.g. it has property fields that are not used in JSON-LD), so I assumed this output was preferred here. I am OK with going the string approach, though.

@kmike kmike merged commit 9b937ad into scrapinghub:master Nov 8, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants