XML

##Overview

XML is markup language used to encode data into a document. It is both human-readable, and machine-readable. The benefit of using this markup language, is that there are no predefined tags. The author of a given XML document may create any tags to conform to any arbitrary structure that is logically needed.

###Sample Document

<?xml version='1.0'?>

<!-- Sample Dataset-->
<dataset>
  <observation>
    <dependent-variable>James Blonde</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>0034773019</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  <observation>
    <dependent-variable>Boston Powers</dependent-variable>
    <independent-variable>
      <label>SSN</label>
      <value>007000007</value>
    </independent-variable>
    <independent-variable>
      <label>Salary</label>
      <value>88500</value>
    </independent-variable>
  </observation>

  ...
</dataset>

###XML Declaration

An XML document may begin with an optional declaration. If one is used, it is important to remember that nothing may preceed the declaration, not even whitespace, or comments.

Generally, an xml declaration is as follows:

<?xml version='1.0'?>

where the version attribute, indicates the xml version being used. Another optional attribute may be defined in the same declaration. Specifically, the encoding attribute indicates the encoding standard being used in the xml document:

<?xml version='1.0' encoding='UTF-8'?>

By default, xml standard states that all XML software must understand both UTF-8, and UTF-16. When this attribute is not defined, the xml document defaults to UTF-8.

Note: an XML declaration is case sensitive, and cannot begin as <?XML ..?>.

###XML Document:

An XML document is syntactically similar to HTML, except the latter was designed to display data (presentation). XML on the otherhand, was designed to describe data, with a focus on what the data means. Both markup languages adhere to very similar syntax.

XML syntax requirements:

An XML document must have exactly one root element (see above <dataset>)
The root element encapsulates all other elements
An XML element is case sensitive
Every XML element, with an opening tag, must have a corresponding closing tag
A closing tag, must contain a slash (i.e. </xxx>).
XML elements may be nested

###XML Validation

An xml document can generally be validated by implementing a document type definition, or an xml schema. Both choices require validation logic, which would compare the xml document, against the defined rule set (i.e. dtd, xml schema).

####Document Type Definition

Document type definition (DTD), define the following properties:

what elements are allowed in the xml document
what attributes each element is allowed to have
the ordering, and nesting of these elements

DTD's are declared within the DOCTYPE element, under the xml declaration.

The following is an example of an inline definition:

<?xml version='1.0' encoding='UTF-8'?>

<!DOCTYPE documentelement [definition]>

while, the following is an example of an external definition:

<?xml version="1.0"?> 

<!DOCTYPE documentelement SYSTEM "https://localhost/dataset.dtd">

Both options can either expand definition (code below inside the square brackets), or define dataset.dtd as follows:

<!ELEMENT dataset (observation+)>
<!ELEMENT observation (dependent-variable,independent-variable+)>
<!ELEMENT dependent-variable (#CDATA)>
<!ELEMENT independent-variable (label,value)>
<!ELEMENT label (#CDATA)>
<!ELEMENT value (#CDATA)>

The above DTD defines the following structure:

a dataset contains at least one observation
an observation contains one dependent-variable, and at least one independent-variable
a dependent-variable contains CDATA text
an independent-variable contains a label, and a value
both label, and value contains CDATA text (character data not supposed to be parsed by a parser)

Note: CDATA can be replaced with PCDATA, which means the corresponding text, will be parsed by a parser. A third alternative is ANY, which means an element may contain any content.

Note: if observation+ was replaced with observation*, then there would be 0, or more observations.

####XML Schema

XML schemas are a more powerful alternative to the above document type definition. Instead of adhering to the DTD syntax, to customize a particular rule set, xml schemas are xml documents. This allows the corresponding rule set to be as granular as needed. Also, schema's can be implemented as inline, or as an external dataset.xsd.

The following, dataset.xsd is equivalent to the above dataset.dtd:

<?xml version='1.0'?>

<xs:schema xmlns:xs='http://www.w3.org/2001/XMLSchema'>
  <xs:element name='dataset'>
    <xs:complexType>

      <xs:sequence>
        <xs:element name='observation' maxOccurs='unbounded'>
          <xs:complexType>

            <xs:sequence>
              <xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
              <xs:element name='independent-variable' maxOccurs='unbounded'>
                <xs:complexType>

                  <xs:sequence>
                    <xs:element name='label' type='xs:string'/>
                    <xs:element name='value' type='xs:decimal'/>
                  </xs:sequence>

                </xs:complexType>
              </xs:element>
            </xs:sequence>

           </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

To implement the above xml schema, the following needs to be present at the top of each corresponding xml document:

<?xml version='1.0'?>

<dataset xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'
xsi:noNamespaceSchemaLocation='dataset.xsd'>
...

A second alternative, is an inline xml schema:

<?xml version='1.0'?>

<dataset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:noNamespaceSchemaLocation="#mySchema">
  <xs:schema id="mySchema" xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xs:element name='dataset'>
    <xs:complexType>

      <xs:sequence>
        <xs:element name='observation' maxOccurs='unbounded'>
          <xs:complexType>

            <xs:sequence>
              <xs:element name='dependent-variable' maxOccurs='unbounded' type='xs:string'/>
              <xs:element name='independent-variable' maxOccurs='unbounded'>
                <xs:complexType>

                  <xs:sequence>
                    <xs:element name='label' type='xs:string'/>
                    <xs:element name='value' type='xs:decimal'/>
                  </xs:sequence>

                </xs:complexType>
              </xs:element>
            </xs:sequence>

           </xs:complexType>
        </xs:element>
      </xs:sequence>
    </xs:complexType>
  </xs:element>
 </xs:schema>

  <observation>
    <dependent-variable>dep-variable-1</dependent-variable>
    <independent-variable>
      <label>indep-variable-1</label>
      <value>23.45</value>
    </independent-variable>
...
  </observation>
...
</dataset>

Note: the above inline schema, is equivalent to the former external dataset.xsd.

Note: the following may be reviewed for more explicit syntax understanding:

schema (including namespacing)
- noNamexpaceSchemaLocation
complexType
element
type attribute
indicators

####Validation

Validation requires additional logic. The examples below, implement python's lxml library, to perform the validation logic.

DTD Validation:

from lxml import etree, objectify

dtd = etree.DTD(open('schema.dtd', 'rb'))
tree = objectify.parse(open('document.xml', 'rb'))
valid = dtd.validate(tree)

if (valid):
    print('XML was valid!')
else:
    print('XML was not valid!')
    for error in dtd.error_log.filter_from_errors():
        print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))

Schema Validation:

from lxml import etree

schema = etree.parse('schema.xsd')
xmlschema = etree.XMLSchema(schema)

try:
    document = etree.parse('document.xml')
    print 'Parse complete!'
except etree.XMLSyntaxError, e:
    print e

valid = xmlschema.validate(document)

if (valid):
    print('XML was valid!')
else:
    for error in xmlschema.error_log:
        print "Error on line %s:%s, %s" % (error.line, error.column, error.message.encode('utf-8'))

Note: additional error attributes can be implemented with the lxml library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML

Clone this wiki locally