-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Accept JSON parsing errors in JSON-LD extractor #45
Comments
Hi @giordand , thanks for the report.
|
Yes, i remember that in my case there was HTML comments too , so it should be fixed when you commit & push that changes. Let me ask you a question , when you commit that changes will it be available with a pip update command to the extruct library? |
I'll need to release a new version of extruct for the change to be available directly from PyPI via pip. |
@giordand , it would be most helpful if you can provide a real example of a URL (or the HTML of it) where extruct failed, just to check if my patch really does solve your issue. |
@redapple here is the json-ld script wich the jason.loads cannot load: {
"@context": "http://schema.org",
"@type": "Organization",
"name": "Action Car and Truck Accessories",
"url": "http://www.actiontrucks.com",
"sameAs" : [ "https://twitter.com/actioncar_truck",
"https://www.youtube.com/user/actioncarandtruck",
https://www.facebook.com/actioncarandtruck],
"logo": " http://actiontrucks.com/files/images/logo.png",
"contactPoint" : [
{ "@type" : "ContactPoint",
"telephone" : "+1-855-560-2233",
"contactType" : "sales"} ]
} Look at the red line, the double cuotes are missing in that element of the array. I did the test completing it with the double cuotes and no error were catched, so here we've got an example where apparently has no solution because the original json object is malformed and surely that object is not loading correctly in the web page. I think that the only solution for this without changing the reality is to catch the error and return an empty list |
Thanks for the feedback @giordand . |
Observed something similar while working on the same website as in #57; in here
Notice the missing double quotes around json_str = re.sub(
pattern=r'(\"\:\s)([^"\{\[])',
repl=r'":""\2',
string=json_str
) |
I’m looking at the code, and I see that When using a specific parser, I think it makes sense to keep the current behavior; users are free to catch the exception of let it raise further. |
Add jsonStringFixer.py, which has a function to add quotes around any required text in a json string. Used this in jsonld.py to handle invalid jsonld string.
When the JsonLdExtractor tries to parse json ld in some web page raise
ValueError; no json object could be decoded
.My solution was to catch the error in
JsonLdExtractor._extract_items(self, node)
(because maybe the extractor detected some microdata or rdfa in the webpage but the error only occurs with json-ld, and if we catch the error in extruct.extract we'll lose that data) and by default return an empty list:The text was updated successfully, but these errors were encountered: