Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System emits triples that cause warnings #112

Open
paulhoule opened this issue Mar 17, 2014 · 2 comments
Open

System emits triples that cause warnings #112

paulhoule opened this issue Mar 17, 2014 · 2 comments

Comments

@paulhoule
Copy link
Owner

Andy Seaborne from the Jena project ran :BaseKB through a validator. He did not find errors but he found plenty of warnings. A sample of some common types are

WARN  [line: 1966326, col: 94] Bad IRI: <http://www.michaelnugent.comhttp://twit
ter.com/micknugent> Code: 12/PORT_SHOULD_NOT_BE_EMPTY in PORT: The colon introdu
cing an empty port component should be omitted entirely, or a port number should
 be specified.
WARN  [line: 2870786, col: 95] Bad IRI: <http://81.173.3.20:80> Code: 13/DEFAULT
_PORT_SHOULD_BE_OMITTED in PORT: If the port is the default one for the scheme i
t should be omitted.
WARN  [line: 2870786, col: 95] Bad IRI: <http://81.173.3.20:80> Code: 14/PORT_SH
OULD_NOT_BE_WELL_KNOWN in PORT: Ports under 1024 should be accessed using the appropriate scheme name.

There are also some problems with literals, most frequently xsd:dateTime literals:

WARN  [line: 2748739, col: 124] Lexical form 'T00:00' not valid for datatype http://www.w3.org/2001/XMLSchema#dateTime

however there also are problems with literals that are of type gYear

WARN  [line: 1603109, col: 84] Lexical form '-0410-09' not valid for datatype http://www.w3.org/2001/XMLSchema#gYear

and also for one quirky date

WARN  [line: 337710, col: 96] Lexical form '0000-08-27' not valid for datatype http://www.w3.org/2001/XMLSchema#date

I think this is because there isn't really a year zero.

There's another problem that occasionally affects URLs:

INFO  File: sieved/webpages/webpages-m-00003.nt.gz
WARN  [line: 1231492, col: 103] Bad IRI: <http://??.??/> Code: 57/REQUIRED_COMPO
NENT_MISSING in HOST: A component that is required by the scheme is missing.

there might be something weird going on here such as unicode characters that got dumbed down to '?' or maybe not.

@smyth64
Copy link

smyth64 commented Nov 10, 2014

My solution: I hunt the basekb through the parser and remove all invalid lines. Afterwards Stardog is happy to get fresh, consistent data :)

@bfreeman19871987
Copy link

how curated is the data you are able to validate ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants