-
Notifications
You must be signed in to change notification settings - Fork 3
/
semanticDoc.txt
executable file
·77 lines (72 loc) · 3.53 KB
/
semanticDoc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
AMI2-SVG2XML Semantic document
==============================
rough nesting is shown. Order determines order of operations. simplistic checking of children.
Most attributes are specific to given element. Some (e.g. debug) are global
"may" are lower priority ideas
<semanticDocument>
Holds global vars in name-value store.
Uses s.*, d.*, p.* for vars with total, doc, and page scope
Vars can be
(1) injected by commandline
(2) explicit attributes on elements
(3) computed
Accessible to all elements
the run() command pokes in the values :
s.inputfile
s.outputfile
before analyzing the commands
<documentIterator>
determines policy of which documents to process.
$d.indir (input directory, may have recursive descent)
$d.outdir (output dir, may be computed or default. Always one directory per PDF)
$d.docpattern (Pattern of documents to filter, default *//*.pdf)
$d.maxdocs.
Always holds SVG in memory. May write intermediate if directed
May create overall fontList from all documents (and can write it)
<documentAnalyzer>
holds per-document info; starts with List<SVG>
may guess publisher/source
may make two-pass analysis to get tune parameters
<pageIterator>
Iterates over pages in current document;
regex or other selector
<normalizeCharacters>
check and transform unusual or ugly characters (e.g. ligatures)
reduce to ASCII where reasonable
<normaliseRotatedText>
<normalizeGraphicsPrimitives>
create rect, circle, line, polyline, symbols where possible
<whitespaceChunker>
currently 3 passes (horiz, vert, horiz).
Uses default gap settings (e.g. 10 pixels).
May be refined by multipass extraction of gaps
<suscriptProcessor>
works within a chunk
results in escaped <sub>I am a subscript</sub> for possible text searching
<createLines>
processes characters on geometric basis into words and lines
may need to create experimental per-character widths
<chunkSemantics>
annotates chunks as text, headings, pageDecor, maths, chemistry (maybe), Figures, Captions, Tables
<mergeText>
merges the text chunks in the page
must be contiguous and common fontsize.
may create paragraphs
</pageIterator>
<mergePages>
merge running text from contigous pages
create list of figures, tables (check numbering)
creates XTHML, Figures, Tables with links
<stmChunker>
breaks running text into logical chunks (Biblio, abstract, body (? subdivide), references)
<documentWriter>
XHTML, Menu (for per-page display) Fig/Table list, maybe PDF
</documentAnalyzer>
<domainPlugin>
discipline-specific plugins for specific chunks
figure -> CML chemistry
equations -> MathML
figure -> NEXML (phylo-tree)
figure -> XYPlot
<documentIterator>
</semanticDocument>