-
Notifications
You must be signed in to change notification settings - Fork 85
##Overview
XML is markup language used to encode data into a document. It is both human-readable, and machine-readable. The benefit of using this markup language, is that there are no predefined tags. The author of a given XML document may create any tags to conform to any arbitrary structure that is logically needed.
###Sample Document
<?xml version='1.0'?>
<!-- Sample Dataset-->
<dataset>
<observation>
<dependent-variable>James Blonde</dependent-variable>
<independent-variable>
<label>SSN</label>
<value>0034773019</value>
</independent-variable>
<independent-variable>
<label>Salary</label>
<value>88500</value>
</independent-variable>
</observation>
<observation>
<dependent-variable>Boston Powers</dependent-variable>
<independent-variable>
<label>SSN</label>
<value>007000007</value>
</independent-variable>
<independent-variable>
<label>Salary</label>
<value>88500</value>
</independent-variable>
</observation>
...
</dataset>
###XML Declaration
An XML document may begin with an optional declaration. If one is used, it is important to remember that nothing may preceed the declaration, not even whitespace, or comments.
Generally, an xml declaration is as follows:
<?xml version='1.0'?>
where the version
attribute, indicates the xml version being used. Another optional attribute may be defined in the same declaration. Specifically, the encoding
attribute indicates the encoding standard being used in the xml document:
<?xml version='1.0' encoding='UTF-8'?>
By default, xml standard states that all XML software must understand both UTF-8
, and UTF-16
. When this attribute is not defined, the xml document defaults to UTF-8
.
Note: an XML declaration is case sensitive, and cannot begin as <?XML ..?>
.
###XML Document:
An XML document is syntactically similar to HTML, except the latter was designed to display data (presentation). XML on the otherhand, was designed to describe data, with a focus on what the data means. Both markup languages adhere to very similar syntax.
XML syntax requirements:
- An XML document must have exactly one root element (see above
<dataset>
) - The root element encapsulates all other elements
- An XML element is case sensitive
- Every XML element, with an opening tag, must have a corresponding closing tag
- A closing tag, must contain a slash (i.e.
</xxx
>). - XML elements may be nested
###XML Validation
An xml document can generally be validated by implementing a document type definition, or an xml schema. Both choices require validation logic, which would compare the xml document, against the defined rule set (i.e. dtd, xml schema).
####Document Type Definition
Document type definition (DTD), define the following properties:
- what elements are allowed in the xml document
- what attributes each element is allowed to have
- the ordering, and nesting of these elements
DTD's are declared within the DOCTYPE
element, under the xml declaration.
The following is an example of an inline definition:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE documentelement [definition]>
while, the following is an example of an external definition:
<?xml version="1.0"?>
<!DOCTYPE documentelement SYSTEM "https://localhost/dataset.dtd">
Both options can either expand definition
(code below inside the square brackets), or define dataset.dtd
as follows:
<!ELEMENT dataset (observation+)>
<!ELEMENT observation (dependent-variable,independent-variable+)>
<!ELEMENT dependent-variable (#CDATA)>
<!ELEMENT independent-variable (label,value)>
<!ELEMENT label (#CDATA)>
<!ELEMENT value (#CDATA)>
The above DTD defines the following structure:
- a
dataset
contains at least oneobservation
- an
observation
contains onedependent-variable
, and at least oneindependent-variable
- a
dependent-variable
containsCDATA
text - an
independent-variable
contains alabel
, and avalue
- both
label
, andvalue
containsCDATA
text (character data not supposed to be parsed by a parser)
Note: CDATA
can be replaced with PCDATA
, which means the corresponding text, will be parsed by a parser. A third alternative is ANY
, which means an element may contain any content.
Note: if observation+
was replaced with observation*
, then there would be 0, or more observations.
####XML Schema
XML schemas are a more powerful alternative to the above document type definition. Instead of adhering to the DTD syntax, to customize a particular rule set, xml schemas are xml documents. This allows the corresponding rule set to be as granular as needed. Also, schema's can be implemented as inline, or as an external dataset.xsd
.
The following, dataset.xsd
is equivalent to the above dataset.dtd
:
<?xml version='1.0'?>
<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
<xs:element name='dataset'>
<xs:complexType>
<xs:sequence>
<xs:element name='observation' maxOccurs='unbounded'>
<xs:complexType>
<xs:sequence>
<xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
<xs:element name='independent-variable' maxOccurs='unbounded'>
<xs:complexType>
<xs:sequence>
<xs:element name='label' type='xs:string'/>
<xs:element name='value' type='xs:decimal'/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
To implement the above xml schema, the following needs to be present at the top of each corresponding xml document:
<?xml version='1.0'?>
<dataset xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:noNamespaceSchemaLocation='dataset.xsd'>
...
A second alternative, is an inline xml schema:
<?xml version='1.0'?>
<dataset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="#mySchema">
<xs:schema id="mySchema" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name='dataset'>
<xs:complexType>
<xs:sequence>
<xs:element name='observation' maxOccurs='unbounded'>
<xs:complexType>
<xs:sequence>
<xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
<xs:element name='independent-variable' maxOccurs='unbounded'>
<xs:complexType>
<xs:sequence>
<xs:element name='label' type='xs:string'/>
<xs:element name='value' type='xs:decimal'/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
<observation>
<dependent-variable>dep-variable-1</dependent-variable>
<independent-variable>
<label>indep-variable-1</label>
<value>23.45</value>
</independent-variable>
...
</observation>
...
</dataset>
Note: the above inline schema, is equivalent to the former external dataset.xsd
.
Note: the following may be reviewed for more explicit syntax understanding:
-
schema
(including namespacing) complexType
element
-
type
attribute - indicators
####Validation
Validation requires additional logic. The examples below, implement python's lxml library, to perform the validation logic.
DTD Validation:
from lxml import etree, objectify
dtd = etree.DTD(open('schema.dtd', 'rb'))
tree = objectify.parse(open('document.xml', 'rb'))
valid = dtd.validate(tree)
if (valid):
print('XML was valid!')
else:
print('XML was not valid!')
for error in dtd.error_log.filter_from_errors():
print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))
Schema Validation:
from lxml import etree
schema = etree.parse('schema.xsd')
xmlschema = etree.XMLSchema(schema)
try:
document = etree.parse('document.xml')
print 'Parse complete!'
except etree.XMLSyntaxError, e:
print e
valid = xmlschema.validate(document)
if (valid):
print('XML was valid!')
else:
for error in xmlschema.error_log:
print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))
Note: additional error attributes can be implemented with the lxml library.