Skip to content

Structure of ALTO files

Nikolay Karelin edited this page Sep 28, 2017 · 3 revisions

Documentation

This document contains a listing of elements and their related attributes in ALTO version 2.0 with values or value sources where applicable. It is an "outline" of the schema, detailed by:

Root <alto> element
Top-level ALTO elements
    <Description> elements
    <Styles> elements
    <Layout> elements
ALTO attributes

ALTO requires use of the element as a child under the root element. The element requires use of a child element, which must carry a valid ID attribute value and a PHYSICAL_IMG_NR attribute value.

The 2.0 schema now has a target namespace URI: http://www.loc.gov/standards/alto/ns-v2#, to reflect that the standard is now maintained by the Library of Congress. The previous namespace URI reflected maintenance by CCS.

Root element in ALTO element set

alto

  • Required: Yes.
  • Usage: Root Element for bundling text layout technical metadata.
  • Attributes: None.
  • Contains AS SEQUENCE: Description, Styles, Layout.
  • Contained by: None.

Top-level ALTO elements

These elements are direct children of the root element. The sorting is based on the accepted sequence in which they may be used.

Description

  • Required: No.
  • Usage: Describes general settings of the alto file like measurement units and metadata.
  • Attributes: None.
  • Contains AS SEQUENCE: MeasurementUnit, sourceImageInformation, OCRProcessing.
  • Contained by: alto.

Styles

  • Required: No.
  • Usage: Styles define properties of layout elements. A style defined in a parent element is used as default style for all related children elements.
  • Attributes: None.
  • Contains AS SEQUENCE: TextStyle, ParagraphStyle.
  • Contained by: alto.

Layout

  • Required: Yes.
  • Usage: The root Layout element.
  • Attributes: STYLEREFS.
  • Contains AS SEQUENCE: Page.
  • Contained by: alto.

<Description> elements

These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.

MeasurementUnit

  • Required: No.
  • Usage: All measurement values inside the alto file except fontsize are related to this unit. The default is 1/10 of mm.
  • Attributes: none.
  • Contains ENUMERATED VALUES: dpi, pixel, mm10, inch1200.
  • Contained by: Description.

sourceImageInformation

  • Required: No.
  • Usage: Information to identify the image file from which the OCR text was created.
  • Attributes: none.
  • Contains SEQUENCE: fileName, fileIdentifier
  • Contained by: Description.

OCRProcessing

  • Required: No.
  • Usage: Information on how the text was created, including preprocessing, OCR processing, and postprocessing steps. Where possible, this draws from MIX's change history.
  • Attributes: ID.
  • Contains: preProcessingStep, ocrProcessingStep, postProcessingStep
  • Contained by: Description.

<Styles> elements

These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.

TextStyle

  • Required: No.
  • Usage: A text style defines font properties of text.
  • Attributes: ID, FONTWIDTH, FONTTYPE, FONTSTYLE, FONTFAMILY, FONTCOLOR, FONTSIZE.
  • Contains: EMPTY ELEMENT.
  • Contained by: Styles.

ParagraphStyle

  • Required: No.
  • Usage: A paragraph style defines formatting properties of text blocks.
  • Attributes: ID, RIGHT, LEFT, ALIGN, LINESPACE, FIRSTLINE
  • Contains: EMPTY ELEMENT.
  • Contained by: Styles.

<Layout> elements

These elements are contained by the element underneath . The sorting is based on the accepted sequence in which they may be used.

Page

  • Required: Yes.
  • Usage: One page of a book or journal.
  • Attributes: ID, PHYSICAL_IMG_NR, PRINTED_IMG_NR, PAGECLASS, PROCESSING, STYLEREFS, HEIGHT, WIDTH, QUALITY, POSITION.
  • Contains SEQUENCE: TopMargin, LeftMargin, RightMargin, BottomMargin, PrintSpace
  • Contained by: Layout.

textMD attributes

These attributes may appear on given elements within ALTO. The sorting is alphabetical.

ID

  • Usage: A valid identifier as defined by the XML Schema specification.
  • Contained by: OCRProcessing.

STYLEREFS

  • Usage: To bind to IDREFs of various Text* elements.
  • Contained by: Layout.