Skip to content
Jeff Levesque edited this page Nov 23, 2015 · 32 revisions

##Overview

XML is markup language used to encode data into a document. It is both human-readable, and machine-readable. The benefit of using this markup language, is that there are no predefined tags. The author of a given XML document may create any tags to conform to any arbitrary structure that is logically needed.

###Sample Document

<?xml version='1.0'?>

<!-- Sample Dataset-->
<dataset>
  <observation>
    <dependent-variable>James Blonde</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>0034773019</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  <observation>
    <dependent-variable>Boston Powers</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>007000007</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  ...
</dataset>

###XML Declaration

An XML document may begin with an optional declaration. If one is used, it is important to remember that nothing may preceed the declaration, not even whitespace, or comments.

Generally, an xml declaration is as follows:

<?xml version='1.0'?>

where the version attribute, indicates the xml version being used. Another optional attribute may be defined in the same declaration. Specifically, the encoding attribute indicates the encoding standard being used in the xml document:

<?xml version='1.0' encoding='UTF-8'?>

By default, xml standard states that all XML software must understand both UTF-8, and UTF-16. When this attribute is not defined, the xml document defaults to UTF-8.

Note: an XML declaration is case sensitive, and cannot begin as <?XML ..?>.

###XML Document:

An XML document is syntactically similar to HTML, except the latter was designed to display data (presentation). XML on the otherhand, was designed to describe data, with a focus on what the data means. Both markup languages adhere to very similar syntax.

XML syntax requirements:

  • An XML document must have exactly one root element (see above <dataset>)
  • The root element encapsulates all other elements
  • An XML element is case sensitive
  • Every XML element, with an opening tag, must have a corresponding closing tag
  • A closing tag, must contain a slash (i.e. </xxx>).
  • XML elements may be nested

###XML Validation

An xml document can generally be validated by implementing a document type definition, or an xml schema. Both choices require validation logic, which would compare the xml document, against the defined rule set (i.e. dtd, xml schema).

####Document Type Definition

Document type definition (DTD), define the following properties:

  • what elements are allowed in the xml document
  • what attributes each element is allowed to have
  • the ordering, and nesting of these elements

DTD's are declared within the DOCTYPE element, under the xml declaration.

The following is an example of an inline definition:

<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE documentelement [definition]>

while, the following is an example of an external definition:

<?xml version="1.0"?> 

<!DOCTYPE documentelement SYSTEM "https://localhost/dataset.dtd">

Both options can either expand definition (code below inside the square brackets), or define dataset.dtd as follows:

<!ELEMENT dataset (observation+)>
<!ELEMENT observation (dependent-variable,independent-variable+)>
<!ELEMENT dependent-variable (#CDATA)>
<!ELEMENT independent-variable (label,value)>
<!ELEMENT label (#CDATA)>
<!ELEMENT value (#CDATA)>

The above DTD defines the following structure:

  • a dataset contains at least one observation
  • an observation contains one dependent-variable, and at least one independent-variable
  • a dependent-variable contains CDATA text
  • an independent-variable contains a label, and a value
  • both label, and value contains CDATA text (character data not supposed to be parsed by a parser)

Note: CDATA can be replaced with PCDATA, which means the corresponding text, will be parsed by a parser. A third alternative is ANY, which means an element may contain any content.

Note: if observation+ was replaced with observation*, then there would be 0, or more observations.

####XML Schema

XML schemas are a more powerful alternative to the above document type definition. Instead of adhering to the DTD syntax, to customize a particular rule set, xml schemas are xml documents. This allows the corresponding rule set to be as granular as needed. Also, schema's can be implemented as inline, or as an external dataset.xsd.

The following, dataset.xsd is equivalent to the above dataset.dtd:

<?xml version='1.0'?>

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
  <xs:element name='dataset'>
    <xs:complexType>

      <xs:sequence>
        <xs:element name='observation' maxOccurs='unbounded'>
          <xs:complexType>

            <xs:sequence>
              <xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
              <xs:element name='independent-variable' maxOccurs='unbounded'>
                <xs:complexType>

                  <xs:sequence>
                    <xs:element name='label' type='xs:string'/>
                    <xs:element name='value' type='xs:decimal'/>
                  </xs:sequence>

                </xs:complexType>
              </xs:element>
            </xs:sequence>

           </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

To implement the above xml schema, the following needs to be present at the top of each corresponding xml document:

<?xml version='1.0'?>

<dataset xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:noNamespaceSchemaLocation='dataset.xsd'>
...

A second alternative, is an inline xml schema:

<?xml version='1.0'?>

<dataset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="#mySchema">
  <xs:schema id="mySchema" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name='dataset'>
    <xs:complexType>

      <xs:sequence>
        <xs:element name='observation' maxOccurs='unbounded'>
          <xs:complexType>

            <xs:sequence>
              <xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
              <xs:element name='independent-variable' maxOccurs='unbounded'>
                <xs:complexType>

                  <xs:sequence>
                    <xs:element name='label' type='xs:string'/>
                    <xs:element name='value' type='xs:decimal'/>
                  </xs:sequence>

                </xs:complexType>
              </xs:element>
            </xs:sequence>

           </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
 </xs:schema>

  <observation>
    <dependent-variable>dep-variable-1</dependent-variable>
    <independent-variable>
      <label>indep-variable-1</label>
      <value>23.45</value>
    </independent-variable>
...
  </observation>
...
</dataset>

Note: the above inline schema, is equivalent to the former external dataset.xsd.

Note: the following may be reviewed for more explicit syntax understanding:

####Validation

Validation requires additional logic. The examples below, implement python's lxml library, to perform the validation logic.

DTD Validation:

from lxml import etree, objectify

dtd = etree.DTD(open('schema.dtd', 'rb'))
tree = objectify.parse(open('document.xml', 'rb'))
valid = dtd.validate(tree)

if (valid):
    print('XML was valid!')
else:
    print('XML was not valid!')
    for error in dtd.error_log.filter_from_errors():
        print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))

Schema Validation:

from lxml import etree

schema = etree.parse('schema.xsd')
xmlschema = etree.XMLSchema(schema)

try:
    document = etree.parse('document.xml')
    print 'Parse complete!'
except etree.XMLSyntaxError, e:
    print e

valid = xmlschema.validate(document)

if (valid):
    print('XML was valid!')
else:
    for error in xmlschema.error_log:
        print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))

Note: additional error attributes can be implemented with the lxml library.

Clone this wiki locally