Some URLs don't return Content-type #1591

sindikat · 2013-11-07T09:45:55Z

sindikat
Nov 7, 2013

Some URLs don't return Content-type in accept headers:

>>> g = Graph()
>>> g.parse('http://sw.opencyc.org/2009/04/07/concept/en/_http___dbpedia_org_ontology_PokerPlayer_')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/rdflib/graph.py", line 1018, in parse
    data=data, format=format)
  File "/usr/local/lib/python2.7/dist-packages/rdflib/parser.py", line 166, in create_input_source
    input_source = URLInputSource(absolute_location, format)
  File "/usr/local/lib/python2.7/dist-packages/rdflib/parser.py", line 102, in __init__
    self.content_type = self.content_type.split(";", 1)[0]
AttributeError: 'NoneType' object has no attribute 'split'
>>> from requests import get
>>> get('http://sw.opencyc.org/2009/04/07/concept/en/_http___dbpedia_org_ontology_PokerPlayer_').headers
CaseInsensitiveDict({'content-length': '7828', 'accept-ranges': 'bytes', 'server': 'Apache/2.2.22 (Ubuntu)', 'last-modified': 'Tue, 12 Mar 2013 21:27:21 GMT', 'etag': '"9500b8-1e94-4d7c0f4682040"', 'date': 'Thu, 07 Nov 2013 09:43:13 GMT'})

Should Graph.parse account for such cases? The URL itself returns valid RDF/XML.

ghost · 2013-11-07T12:44:46Z

ghost
Nov 7, 2013

A fair question. Section 7 of the HTTP 1.1 spec [1] i) declares Content-Type as a header that SHOULD be emitted and ii) offers the client a chance to calculate the media type ("MAY attempt to guess the media type via inspection of its content and/or the name extension(s) of the URI used to identify the resource"), with the instruction that "application/octet-stream" SHOULD be used as a fallback if no media type can be guessed.

But it looks like Real Work(tm) will be required in this particular case:

>>> mimetypes.guess_type('http://sw.opencyc.org/2009/04/07/concept/en/_http___dbpedia_org_ontology_PokerPlayer_')
(None, None)

We should arrange matters so that RDFLib emits a more informative exception message, "AttributeError: 'NoneType' object has no attribute 'split'" falls well short of "No Content-Type header, refusing to guess", for example.

There might be an argument for just bailing out: using "application/octet-stream" as the media type and registering that media type against the RDFLib XML parser as a kind of "best we can do in the circumstances" approach. It's a fix in this particular instance but I don't have any kind of a feel for what media type is actually typically served in the absence of a Content-Type header.

But this approach, apart from just papering over the cracks, is both arbitrary and "magical", neither of which are good ideas when consuming graph content. I'm inclined to believe that RDFLib should simply raise an informative exception.

(For the benefit of future seekers of enlightenment) - your code example contains all the elements of a workaround:

>>> import rdflib
>>> import requests
>>> resp = requests.get('http://sw.opencyc.org/2009/04/07/concept/en/_http___dbpedia_org_ontology_PokerPlayer_')
>>> g = rdflib.Graph()
>>> g.parse(data=resp.content, format="xml")
<Graph identifier=Nd4a95afa438444babea2cc5c368a34d1 (<class 'rdflib.graph.Graph'>)>

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec7.html

HTH

0 replies

sindikat · 2013-11-07T12:54:19Z

sindikat
Nov 7, 2013
Author

We should arrange matters so that RDFLib emits a more informative exception message, "AttributeError: 'NoneType' object has no attribute 'split'" falls well short of "No Content-Type header, refusing to guess", for example.

That's a good idea! Even if RDFLib developers refuse to implement "smart" parsers that consider such cases, i could handle this particular exception and workaround it myself.

0 replies

ghost · 2013-11-07T14:48:03Z

ghost
Nov 7, 2013

You've pretty much accurately summarised RDFLib's approach to its users. RDFLib is a library not an application, so give users the relevant information, at the appropriate level of detail and allow them space to develop their own solutions to the inevitable wrinkles that appear.

Just a small point of information: "refusing to guess" != "refuse to implement". The latter seems a little harsh and my off-the-cuff exception message may have given the wrong impression.

All of the RDFLib dev team have very strong professional and personal demands on their time, so people do what they can, when they can. This has recently included some outstanding contributions: Gunnar's SPARQL 1.1 algebra implementation, Ivans RDF/A 1.1 parser implementation, Niklas' JSON-LD implementation, all of which are significant pieces of high-level work.

In my case, I need to eat this month :-)

HTH

0 replies

joernhees · 2013-11-08T02:07:52Z

joernhees
Nov 8, 2013
Maintainer

👍

0 replies

sindikat · 2013-11-08T05:25:04Z

sindikat
Nov 8, 2013
Author

Oh, i didn't want to sound mean at all. The developers could refuse for various fair reasons.

I think the best solution currently is to provide meaningful errors and document all possible corner cases in manuals. Then the programmer that uses RDFLib would be given example workarounds. Something like:

You know, some websites don't provide Content-type, so you might want to use the following workaround: ...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some URLs don't return Content-type #1591

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Some URLs don't return Content-type #1591

sindikat Nov 7, 2013

Replies: 5 comments

ghost Nov 7, 2013

sindikat Nov 7, 2013 Author

ghost Nov 7, 2013

joernhees Nov 8, 2013 Maintainer

sindikat Nov 8, 2013 Author

sindikat
Nov 7, 2013

ghost
Nov 7, 2013

sindikat
Nov 7, 2013
Author

ghost
Nov 7, 2013

joernhees
Nov 8, 2013
Maintainer

sindikat
Nov 8, 2013
Author