Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate URIs somewhere #266

Closed
gromgull opened this issue Apr 13, 2013 · 9 comments
Closed

validate URIs somewhere #266

gromgull opened this issue Apr 13, 2013 · 9 comments

Comments

@gromgull
Copy link
Member

Currently you can pass any old nonsense to URIRef constructor, and get very broken serializations out of RDFLib:

In [13]: g=rdflib.Graph()

In [14]: g.add((rdflib.URIRef("LOL I'm not really a URI at all! <> <> :)"), rdflib.RDF.type, rdflib.RDFS.Class))

In [15]: print g.serialize(format='n3')
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<LOL I'm not really a URI at all! <> <> :)> a rdfs:Class .

We should probably validate URIs at some point.

Graham points to https://pypi.python.org/pypi/rfc3987

I am slightly worried about performance, maybe a test is in order.

This issue was distilled out of #263

@drewp
Copy link
Contributor

drewp commented Apr 16, 2013

This may even be a security hole:
catPic = URIRef(raw_input("URI to your cat picture: "))
I can enter "http://example.com/user/drewp> a :Admin . <http://example.com/my/cat/pic"
which will generate an extra n3 statement.

@joernhees
Copy link
Member

RDF injection, little bobby tables calling

This depends on the chosen serialization though: xml seems to be more robust:

In [1]: import rdflib

In [2]: g=rdflib.Graph()

In [3]: g.add((rdflib.URIRef("http://foo\">'\\\""), rdflib.RDF.type, rdflib.RDFS.Class))

In [4]: print g.serialize(format='xml')
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
   xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>
  <rdf:Description rdf:about="http://foo&quot;&gt;'\&quot;">
    <rdf:type rdf:resource="http://www.w3.org/2000/01/rdf-schema#Class"/>
  </rdf:Description>
</rdf:RDF>

From this POV it is a serializer bug.
Still I'd somehow also speak in favor of checking URI Validity on URIRef creation.

@uholzer
Copy link
Contributor

uholzer commented Apr 29, 2013

Same problem with URIRef.n3(), which I sometimes use to construct SPARQL queries:

>>> URIRef("foo:> <bar:").n3()
'<foo:> <bar:>'

@gromgull
Copy link
Member Author

gromgull commented May 2, 2013

I tested the rfc module, it is a bit slow - instead I just check for some invalid characters: <>" {}|^`

In my opinion this is a good tradeoff between speed, correctness and not allowing you you to shoot yourself TOO much in the foot.

@dgerber
Copy link

dgerber commented May 2, 2013

Did you test it with the compiled regex? It takes about 1e-6 to 1e-3 sec. on my cheap laptop (although I'm not sure how slow it would be for really pathological test cases):

In [2]: setup = '''import rfc3987
match = rfc3987.get_compiled_pattern("^%(IRI)s$").match
test = '''
In [3]: stmt = 'match(test)'

In [6]: timeit.timeit(stmt, setup+'u"http://ヒキワリ.ナットウ.ニホン"') / 1e6
Out[6]: 6.132636070251465e-06

In [7]: timeit.timeit(stmt, setup+'u"http://www.w3.org/International/articles/idn-and-iri/JP納豆/引き割り納豆.html"') / 1e6
Out[7]: 4.8285482883453366e-05

In [10]: timeit.timeit(stmt, setup+'u"http://stackoverflow.com/questions/2891574/how-do-i-resolve-a-http-414-request-uri-too-long-error?length=4000#'+4000*'0'+'"', number=1000) / 1e3
Out[10]: 0.0010755550861358642

In [11]: setup0 = '''import rdflib
match = rdflib.term._is_valid_uri
test = '''

In [12]: timeit.timeit(stmt, setup0+'u"http://stackoverflow.com/questions/2891574/how-do-i-resolve-a-http-414-request-uri-too-long-error?length=4000#'+4000*'0'+'"', number=1000) / 1e3
Out[12]: 6.966686248779297e-05

In [13]: timeit.timeit(stmt, setup0+'u"http://www.w3.org/International/articles/idn-and-iri/JP納豆/引き割り納豆.html"') / 1e6
Out[13]: 5.024517059326172e-06

In [14]: timeit.timeit(stmt, setup0+'u"http://ヒキワリ.ナットウ.ニホン"') / 1e6
Out[14]: 4.335633993148804e-06

@gromgull
Copy link
Member Author

gromgull commented May 2, 2013

This is a whole 6 times slower than the crappy solution! We dont have that sort of time - RDFLib is a high-performance project! ;)

I was also reluctant to add another dependency, it's tiny and if you install with pip or something else that fetches dependencies automatically, it's not a problem, but not everyone does.

I guess I could just copy the regex in question into rdflib ...

(Also I wonder about that regex, it has several character ranges in the high unicode range (>#ffff) - as I discovered when doing the sparql parser, these do not work at all in "narrow python" builds, i.e. the default osx/win builds. Some more details here: https://github.com/RDFLib/rdflib/blob/master/rdflib/plugins/sparql/parser.py#L135

@gromgull
Copy link
Member Author

gromgull commented May 2, 2013

@dgerber Oh I see you are even the rfc3987 author! Then it may interest you that on the pypi page you say that rfc3896 is the URI one, but it's actually 3986...

@uholzer
Copy link
Contributor

uholzer commented May 2, 2013

I just discovered that we have the same problem with languages of literals!

>>> l = Literal('foo', lang='en . <my> <own> <triple>')
>>> l.n3()
'"foo"@en . <my> <own> <triple>'
>>> g = Graph()
>>> g.add((URIRef("a:a"), URIRef("a:b"), l))
>>> print(g.serialize(format="turtle").decode("UTF-8"))
@prefix ns1: <a:> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xml: <http://www.w3.org/XML/1998/namespace> .

ns1:a ns1:b "foo"@en . <my> <own> <triple> .

@dgerber
Copy link

dgerber commented May 3, 2013

OK, this should give an incorrect one, but at least compilable in narrow-minded builds:

'^%s$' % rfc3987._interpret_unicode_escapes(re.sub(r'\\U[0-9A-F]{8}-\\U[0-9A-F]{8}', '', rfc3987.format_patterns()['IRI']))

Note the ~4000 chars long URI is actually 14 times slower... Maybe allowing for a non-validating constructor would make the high-performance-expecting happier.

(@gromgull Thanks for the typo)

gromgull added a commit that referenced this issue May 21, 2013
…method. Validating all URIs on creation time was creating too many problems. Related #288, #285, #287, #279, #266
mamash pushed a commit to TritonDataCenter/pkgsrc-wip that referenced this issue Feb 15, 2014
	2013/12/31 RELEASE 4.1
======================

This is a new minor version RDFLib, which includes a handful of new features:

* A TriG parser was added (we already had a serializer) - it is
  up-to-date wrt. to the newest spec from: http://www.w3.org/TR/trig/

* The Turtle parser was made up to date wrt. to the latest Turtle spec.

* Many more tests have been added - RDFLib now has over 2000
  (passing!) tests. This is mainly thanks to the NT, Turtle, TriG,
  NQuads and SPARQL test-suites from W3C. This also included many
  fixes to the nt and nquad parsers.

* ```ConjunctiveGraph``` and ```Dataset``` now support directly adding/removing
  quads with ```add/addN/remove``` methods.

* ```rdfpipe``` command now supports datasets, and reading/writing context
  sensitive formats.

* Optional graph-tracking was added to the Store interface, allowing
  empty graphs to be tracked for Datasets. The DataSet class also saw
  a general clean-up, see: RDFLib/rdflib#309

* After long deprecation, ```BackwardCompatibleGraph``` was removed.

Minor enhancements/bugs fixed:
------------------------------

* Many code samples in the documentation were fixed thanks to @PuckCh

* The new ```IOMemory``` store was optimised a bit

* ```SPARQL(Update)Store``` has been made more generic.

* MD5 sums were never reinitialized in ```rdflib.compare```

* Correct default value for empty prefix in N3
  [#312]RDFLib/rdflib#312

* Fixed tests when running in a non UTF-8 locale
  [#344]RDFLib/rdflib#344

* Prefix in the original turtle have an impact on SPARQL query
  resolution
  [#313]RDFLib/rdflib#313

* Duplicate BNode IDs from N3 Parser
  [#305]RDFLib/rdflib#305

* Use QNames for TriG graph names
  [#330]RDFLib/rdflib#330

* \uXXXX escapes in Turtle/N3 were fixed
  [#335]RDFLib/rdflib#335

* A way to limit the number of triples retrieved from the
  ```SPARQLStore``` was added
  [#346]RDFLib/rdflib#346

* Dots in localnames in Turtle
  [#345]RDFLib/rdflib#345
  [#336]RDFLib/rdflib#336

* ```BNode``` as Graph's public ID
  [#300]RDFLib/rdflib#300

* Introduced ordering of ```QuotedGraphs```
  [#291]RDFLib/rdflib#291

2013/05/22 RELEASE 4.0.1
========================

Following RDFLib tradition, some bugs snuck into the 4.0 release.
This is a bug-fixing release:

* the new URI validation caused lots of problems, but is
  nescessary to avoid ''RDF injection'' vulnerabilities. In the
  spirit of ''be liberal in what you accept, but conservative in
  what you produce", we moved validation to serialisation time.

* the   ```rdflib.tools```   package    was   missing   from   the
  ```setup.py```  script, and  was therefore  not included  in the
  PYPI tarballs.

* RDF parser choked on empty namespace URI
  [#288](RDFLib/rdflib#288)

* Parsing from ```sys.stdin``` was broken
  [#285](RDFLib/rdflib#285)

* The new IO store had problems with concurrent modifications if
  several graphs used the same store
  [#286](RDFLib/rdflib#286)

* Moved HTML5Lib dependency to the recently released 1.0b1 which
  support python3

2013/05/16 RELEASE 4.0
======================

This release includes several major changes:

* The new SPARQL 1.1 engine (rdflib-sparql) has been included in
  the core distribution. SPARQL 1.1 queries and updates should
  work out of the box.

  * SPARQL paths are exposed as operators on ```URIRefs```, these can
    then be be used with graph.triples and friends:

    ```py
    # List names of friends of Bob:
    g.triples(( bob, FOAF.knows/FOAF.name , None ))

    # All super-classes:
    g.triples(( cls, RDFS.subClassOf * '+', None ))
    ```

      * a new ```graph.update``` method will apply SPARQL update statements

* Several RDF 1.1 features are available:
  * A new ```DataSet``` class
  * ```XMLLiteral``` and ```HTMLLiterals```
  * ```BNode``` (de)skolemization is supported through ```BNode.skolemize```,
    ```URIRef.de_skolemize```, ```Graph.skolemize``` and ```Graph.de_skolemize```

* Handled of Literal equality was split into lexical comparison
  (for normal ```==``` operator) and value space (using new ```Node.eq```
  methods). This introduces some slight backwards incomaptible
  changes, but was necessary, as the old version had
  inconsisten hash and equality methods that could lead the
  literals not working correctly in dicts/sets.
  The new way is more in line with how SPARQL 1.1 works.
  For the full details, see:

  https://github.com/RDFLib/rdflib/wiki/Literal-reworking

* Iterating over ```QueryResults``` will generate ```ResultRow``` objects,
  these allow access to variable bindings as attributes or as a
  dict. I.e.

  ```py
  for row in graph.query('select ... ') :
     print row.age, row["name"]
  ```

* "Slicing" of Graphs and Resources as syntactic sugar:
  ([#271](RDFLib/rdflib#271))

  ```py
  graph[bob : FOAF.knows/FOAF.name]
            -> generator over the names of Bobs friends
  ```

* The ```SPARQLStore``` and ```SPARQLUpdateStore``` are now included
  in the RDFLib core

* The documentation has been given a major overhaul, and examples
  for most features have been added.


Minor Changes:
--------------

* String operations on URIRefs return new URIRefs: ([#258](RDFLib/rdflib#258))
  ```py
  >>> URIRef('http://example.org/')+'test
  rdflib.term.URIRef('http://example.org/test')
  ```

* Parser/Serializer plugins are also found by mime-type, not just
  by plugin name:  ([#277](RDFLib/rdflib#277))
* ```Namespace``` is no longer a subclass of ```URIRef```
* URIRefs and Literal language tags are validated on construction,
  avoiding some "RDF-injection" issues ([#266](RDFLib/rdflib#266))
* A new memory store needs much less memory when loading large
  graphs ([#268](RDFLib/rdflib#268))
* Turtle/N3 serializer now supports the base keyword correctly ([#248](RDFLib/rdflib#248))
* py2exe support was fixed ([#257](RDFLib/rdflib#257))
* Several bugs in the TriG serializer were fixed
* Several bugs in the NQuads parser were fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants