-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.html
662 lines (570 loc) · 49.2 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
<style>/*
Some simple Github-like styles, with syntax highlighting CSS via Pygments.
*/
html {
background:#fff;
margin:0;
padding:0;
}
body {
font: 14px helvetica,arial,freesans,clean,sans-serif;
line-height: 1.6;
margin: 0 auto;
padding: 20px;
text-align:left;
color: #333;
width:920px;
}
.md {
background-color: #eee;
border-radius:3px;
margin: 0;
padding:3px;
}
article {
padding: 30px;
border: 1px solid #cacaca;
background-color: white;
}
article > :first-child {
margin-top: 0!important;
}
h1 {
font-size:28px;
margin-bottom:10px;
color: black;
}
h2 {
font-size:24px;
margin:20px 0 10px;
color: black;
border-bottom: 1px solid #ccc;
}
h3 {
font-size:18px;
margin:20px 0 10px;
}
h4 {
font-size:16px;
font-weight:bold;
margin:20px 0 10px;
}
h5 {
font-size:14px;
font-weight:bold;
margin:20px 0 10px;
}
h6 {
color:#777;
font-size:14px;
font-weight:bold;
margin:20px 0 10px;
}
hr {
background: transparent url() repeat-x 0 0;
border: 0 none;
height: 4px;
margin: 15px 0;
padding: 0;
}
p {
margin: 15px 0;
}
pre, code {
font: 12px 'Bitstream Vera Sans Mono','Courier',monospace;
}
.highlight pre, pre {
background-color:#f8f8f8;
border:1px solid #ccc;
font-size:13px;
line-height:19px;
overflow:auto;
border-radius:3px;
-moz-border-radius:3px;
-webkit-border-radius:3px;
padding:6px 10px;
}
code {
white-space:nowrap;
border:1px solid #eaeaea;
background-color:#f8f8f8;
border-radius:3px;
-moz-border-radius:3px;
-webkit-border-radius:3px;
margin:0 2px;
padding:0 5px;
}
pre>code
{
white-space:pre;
border:none;
background:transparent;
margin:0;
padding:0;
}
a, a code {
color: #4183C4;
text-decoration:none;
}
blockquote
{
border-left:4px solid #ddd;
padding-left:11px;
color:#555;
margin:14px 0;
}
table
{
font-size: 14px;
border-collapse:collapse;
margin:20px 0 0;
padding:0;
}
table tr
{
border-top:1px solid #ccc;
background-color:#fff;
margin:0;
padding:0;
}
table tr:nth-child(2n)
{
background-color:#f8f8f8;
}
table tr th[align="center"], table tr td[align="center"] {
text-align:center;
}
table tr th, table tr td
{
border:1px solid #ccc;
text-align:left;
margin:0;
padding:6px 13px;
}
ul, ol
{
margin:15px 0;
}
ul li, ol li
{
margin-top:7px;
margin-bottom:7px;
}
.shadow {
-webkit-box-shadow:0 5px 15px #000;
-moz-box-shadow:0 5px 15px #000;
box-shadow:0 5px 15px #000;
}
/* Pygments coloring */
.highlight .c{color:#998;font-style:italic;}
.highlight .err{color:#a61717;background-color:#e3d2d2;}
.highlight .k{font-weight:bold;}
.highlight .o{font-weight:bold;}
.highlight .cm{color:#998;font-style:italic;}
.highlight .cp{color:#999;font-weight:bold;}
.highlight .c1{color:#998;font-style:italic;}
.highlight .cs{color:#999;font-weight:bold;font-style:italic;}
.highlight .gd{color:#000;background-color:#fdd;}
.highlight .gd .x{color:#000;background-color:#faa;}
.highlight .ge{font-style:italic;}
.highlight .gr{color:#a00;}
.highlight .gh{color:#999;}
.highlight .gi{color:#000;background-color:#dfd;}
.highlight .gi .x{color:#000;background-color:#afa;}
.highlight .go{color:#888;}
.highlight .gp{color:#555;}
.highlight .gs{font-weight:bold;}
.highlight .gu{color:#800080;font-weight:bold;}
.highlight .gt{color:#a00;}
.highlight .kc{font-weight:bold;}
.highlight .kd{font-weight:bold;}
.highlight .kn{font-weight:bold;}
.highlight .kp{font-weight:bold;}
.highlight .kr{font-weight:bold;}
.highlight .kt{color:#458;font-weight:bold;}
.highlight .m{color:#099;}
.highlight .s{color:#d14;}
.highlight .na{color:#008080;}
.highlight .nb{color:#0086B3;}
.highlight .nc{color:#458;font-weight:bold;}
.highlight .no{color:#008080;}
.highlight .ni{color:#800080;}
.highlight .ne{color:#900;font-weight:bold;}
.highlight .nf{color:#900;font-weight:bold;}
.highlight .nn{color:#555;}
.highlight .nt{color:#000080;}
.highlight .nv{color:#008080;}
.highlight .ow{font-weight:bold;}
.highlight .w{color:#bbb;}
.highlight .mf{color:#099;}
.highlight .mh{color:#099;}
.highlight .mi{color:#099;}
.highlight .mo{color:#099;}
.highlight .sb{color:#d14;}
.highlight .sc{color:#d14;}
.highlight .sd{color:#d14;}
.highlight .s2{color:#d14;}
.highlight .se{color:#d14;}
.highlight .sh{color:#d14;}
.highlight .si{color:#d14;}
.highlight .sx{color:#d14;}
.highlight .sr{color:#009926;}
.highlight .s1{color:#d14;}
.highlight .ss{color:#990073;}
.highlight .bp{color:#999;}
.highlight .vc{color:#008080;}
.highlight .vg{color:#008080;}
.highlight .vi{color:#008080;}
.highlight .il{color:#099;}
</style><div class="md"><article>
<h1>Parse-EZ : Clojure Parser Library</h1>
<p><a href="http://www.protoflex.com/parse-ez/api-doc/protoflex.parse-api.html" title="Parse-EZ API">API Documentation</a></p>
<p>Parse-EZ is a parser library for Clojure programmers. It allows easy
mixing of declarative and imperative styles and does not
require any special constructs, macros, monads, etc. to write custom parsers.
All the parsing is implemented using regular Clojure functions.</p>
<p>The library provides a number of
parse functions and combinators and comes with a built-in customizable infix
expression parser and evaluator. It allows the programmer to concisely specify
the structure of input text using Clojure functions and easily build parse trees
without having to step out of Clojure. Whether you are writing a parser
for some well structured data or for data scraping or prototyping a new language,
you can make use of this library to quickly create a parser.</p>
<h2>Features</h2>
<ul>
<li>Parse functions and Combinators</li>
<li>Automatic handling of whitespaces, comments</li>
<li>Marking positions and backtracking</li>
<li>Seek, read, skip string/regex patterns</li>
<li>Builtin customizable expression parser and evaluator</li>
<li>Exceptions based error handling</li>
<li>Custom error messages</li>
</ul>
<h2>Usage</h2>
<h3>Installation</h3>
<p>Just add Parse-EZ as a dependency to your lein project</p>
<div class="highlight"><pre><span class="p">[</span><span class="nv">protoflex/parse-ez</span> <span class="s">"0.4.2"</span><span class="p">]</span>
</pre></div>
<p>and run</p>
<div class="highlight"><pre><span class="nv">lein</span> <span class="nv">deps</span>
</pre></div>
<h2>A Taste of Parse-EZ</h2>
<p>Here are a couple of sample parsers to give you a taste of the parser library.</p>
<h3>CSV Parser</h3>
<p>A CSV file contains multiple records, one-record per line, with field-values separated by a delimiter
such as a comma or a tab. The field values may optionally be quoted either using a single or double
quotes. When field-values are quoted, they may contain the field-delimiter characters, and in such
cases they will not be treated as field separators.</p>
<p>First, let us define a parse function for parsing one-line of csv file:</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">csv-1</span> <span class="p">[</span><span class="nv">sep</span><span class="p">]</span>
<span class="p">(</span><span class="nf">sep-by</span> <span class="o">#</span><span class="p">(</span><span class="nf">any-string</span> <span class="nv">sep</span><span class="p">)</span> <span class="o">#</span><span class="p">(</span><span class="nf">chr</span> <span class="nv">sep</span><span class="p">)))</span>
</pre></div>
<p>In the above function definition, we make use of the parse combinator <code>sep-by</code>
which takes two arguments: the first one to read a field-value and the second
one to read the separator. Here, we have used Clojure's anonymous function shortcuts to
specify the desired behavior succinctly. The <code>any-string</code> function matches a single-quoted
string or a double-quoted string or a plain-string that is followed by the specified separator
<code>sep</code>. This is exactly the function that we need to read the field-value. The second argument
provided to <code>sep-by</code> above uses the primitive parse function <code>chr</code> which succeeds only when
the next character in the input matches its argument (<code>sep</code> parameter in this case). The <em>csv-1</em> function returns the field values as a vector.</p>
<p>The <code>sep-by</code> function actually takes a third, optional argument as record-separator
function with the default value of a function that matches a newline. We didn't
pass the third argument above because the default behavior suits our purpose.
Had the default behavior of <code>sep-by</code> been different, we would have written the
above function as:</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">csv-1</span> <span class="p">[</span><span class="nv">sep</span><span class="p">]</span>
<span class="p">(</span><span class="nf">sep-by</span> <span class="o">#</span><span class="p">(</span><span class="nf">any-string</span> <span class="nv">sep</span><span class="p">)</span> <span class="o">#</span><span class="p">(</span><span class="nf">chr</span> <span class="nv">sep</span><span class="p">)</span> <span class="o">#</span><span class="p">(</span><span class="nf">regex</span> <span class="o">#</span><span class="s">"\r?\n"</span><span class="p">)))</span>
</pre></div>
<p>Now that we have created a parse function to parse a single line of CSV
file, let us write another parse function that parses the entire CSV file
content and returns the result as a vector of vector of field values
(one-vector per record/line). All we need to do is to repeatedly apply the
above defined <code>csv-1</code> function and the <code>multi*</code> parse combinator does
just that.</p>
<p>Just one small but important detail: by default, Parse-EZ
automatically trims whitespace after successfully applying a parse function.
This means that the newline at the end of line would be consumed after reading
the last field value and the <code>sep-by</code> would be unable to match the end-of-line
which is the record-separator in this case. So, we will disable the newline
trimming functionality using the <code>no-trim</code> combinator.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">csv</span> <span class="p">[</span><span class="nv">sep</span><span class="p">]</span>
<span class="p">(</span><span class="nf">multi*</span> <span class="p">(</span><span class="k">fn </span><span class="p">[]</span> <span class="p">(</span><span class="nf">no-trim</span> <span class="o">#</span><span class="p">(</span><span class="nf">csv-1</span> <span class="nv">sep</span><span class="p">)))))</span>
</pre></div>
<p>Alternatively, you can express the above function a bit more easily using the macro versions of combinators introduced in Version 0.3.0 as follows:</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">csv</span> <span class="p">[</span><span class="nv">sep</span><span class="p">]</span>
<span class="p">(</span><span class="nf">multi*</span> <span class="p">(</span><span class="nf">no-trim_</span> <span class="p">(</span><span class="nf">csv-1</span> <span class="nv">sep</span><span class="p">))))</span>
</pre></div>
<p>Now, let us try out our csv parser. First let us define a couple of test
strings containing a couple of records (lines) each. Note that the second
string contains a comma inside the first cell (a quoted string). </p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="k">def </span><span class="nv">s1</span> <span class="s">"1abc,def,ghi\n2jkl,mno,pqr\n"</span><span class="p">)</span>
<span class="o">#</span><span class="ss">'user/s1</span>
<span class="nv">user></span> <span class="p">(</span><span class="k">def </span><span class="nv">s2</span> <span class="s">"'1a,bc',def,ghi\n2jkl,mno,pqr\n"</span><span class="p">)</span>
<span class="o">#</span><span class="ss">'user/s2</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">csv</span> <span class="sc">\,</span><span class="p">)</span> <span class="nv">s1</span><span class="p">)</span>
<span class="p">[[</span><span class="s">"1abc"</span> <span class="s">"def"</span> <span class="s">"ghi"</span><span class="p">]</span> <span class="p">[</span><span class="s">"2jkl"</span> <span class="s">"mno"</span> <span class="s">"pqr"</span><span class="p">]]</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">csv</span> <span class="sc">\,</span><span class="p">)</span> <span class="nv">s2</span><span class="p">)</span>
<span class="p">[[</span><span class="s">"1a,bc"</span> <span class="s">"def"</span> <span class="s">"ghi"</span><span class="p">]</span> <span class="p">[</span><span class="s">"2jkl"</span> <span class="s">"mno"</span> <span class="s">"pqr"</span><span class="p">]]</span>
<span class="nv">user></span>
</pre></div>
<p>Well, all we had to do was to write two lines of Clojure code to implement the CSV parser.
Let's add a bit more functionality: the CSV files may use a comma or a tab character to
separate the field values. Let's say we don't know ahead of time which character
a file uses as a separator and we want to detect the separator automatically. Note
that both characters may occur in a data file, but only one acts as a field-separator -- that too
only when it's not inside a quoted string.</p>
<p>Here is our strategy to detect the separator:</p>
<ul>
<li>if the first field value is quoted (single or double), read the quoted string</li>
<li>else, read until one of comma or tab occurs</li>
<li>the next char is our delimiter</li>
</ul>
<p>Here is the code:</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">detect-sep</span> <span class="p">[]</span>
<span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nv">m</span> <span class="p">(</span><span class="nf">mark-pos</span><span class="p">)</span>
<span class="nv">s</span> <span class="p">(</span><span class="nf">attempt</span> <span class="o">#</span><span class="p">(</span><span class="nf">any</span> <span class="nv">dq-str</span> <span class="nv">sq-str</span><span class="p">))</span>
<span class="nv">s</span> <span class="p">(</span><span class="k">if </span><span class="nv">s</span> <span class="nv">s</span> <span class="p">(</span><span class="nf">no-trim</span> <span class="o">#</span><span class="p">(</span><span class="nf">read-to-re</span> <span class="o">#</span><span class="s">",|\t"</span><span class="p">)))</span>
<span class="nv">sep</span> <span class="p">(</span><span class="nf">read-ch</span><span class="p">)]</span>
<span class="p">(</span><span class="nf">back-to-mark</span> <span class="nv">m</span><span class="p">)</span>
<span class="nv">sep</span><span class="p">))</span>
</pre></div>
<p>Note how we used the <code>mark-pos</code> and <code>back-to-mark</code> Parse-EZ functions to 'unconsume'
the consumed input. </p>
<p>The complete code for the sample CSV parser with the separator-detection functionality is
listed below (you can find this in <code>csv_parse.clj</code> file under the <code>examples</code> directory.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">ns </span><span class="nv">protoflex.examples.csv_parse</span>
<span class="p">(</span><span class="ss">:use</span> <span class="p">[</span><span class="nv">protoflex.parse</span><span class="p">]))</span>
<span class="p">(</span><span class="kd">declare </span><span class="nv">detect-sep</span> <span class="nv">csv-1</span><span class="p">)</span>
<span class="p">(</span><span class="kd">defn </span><span class="nv">csv</span>
<span class="s">"Reads and returns one or more records as a vector of vector of field-values"</span>
<span class="p">([]</span> <span class="p">(</span><span class="nf">csv</span> <span class="p">(</span><span class="nf">no-trim</span> <span class="o">#</span><span class="p">(</span><span class="nf">detect-sep</span><span class="p">))))</span>
<span class="p">([</span><span class="nv">sep</span><span class="p">]</span> <span class="p">(</span><span class="nf">multi*</span> <span class="p">(</span><span class="k">fn </span><span class="p">[]</span> <span class="p">(</span><span class="nf">no-trim-nl</span> <span class="o">#</span><span class="p">(</span><span class="nf">csv-1</span> <span class="nv">sep</span><span class="p">))))))</span>
<span class="p">(</span><span class="kd">defn </span><span class="nv">csv-1</span>
<span class="s">"Reads and returns the fields of one record (line)"</span>
<span class="p">[</span><span class="nv">sep</span><span class="p">]</span> <span class="p">(</span><span class="nf">sep-by</span> <span class="o">#</span><span class="p">(</span><span class="nf">any-string</span> <span class="nv">sep</span><span class="p">)</span> <span class="o">#</span><span class="p">(</span><span class="nf">chr</span> <span class="nv">sep</span><span class="p">)))</span>
<span class="p">(</span><span class="kd">defn </span><span class="nv">detect-sep</span>
<span class="s">"Detects the separator used in a csv file (a comma or a tab)"</span>
<span class="p">[]</span> <span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nv">m</span> <span class="p">(</span><span class="nf">mark-pos</span><span class="p">)</span>
<span class="nv">s</span> <span class="p">(</span><span class="nf">attempt</span> <span class="o">#</span><span class="p">(</span><span class="nf">any</span> <span class="nv">dq-str</span> <span class="nv">sq-str</span><span class="p">))</span>
<span class="nv">s</span> <span class="p">(</span><span class="k">if </span><span class="nv">s</span> <span class="nv">s</span> <span class="p">(</span><span class="nf">no-trim</span> <span class="o">#</span><span class="p">(</span><span class="nf">read-to-re</span> <span class="o">#</span><span class="s">",|\t"</span><span class="p">)))</span>
<span class="nv">sep</span> <span class="p">(</span><span class="nf">read-ch</span><span class="p">)]</span>
<span class="p">(</span><span class="nf">back-to-mark</span> <span class="nv">m</span><span class="p">)</span>
<span class="nv">sep</span><span class="p">))</span>
</pre></div>
<p>Let's try out the new auto-detect functionality. Let us define two new test
strings <code>s3</code> and <code>s4</code> that use <code>tab</code> character as field-separator.</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nf">use</span> <span class="ss">'protoflex.examples.csv_parse</span><span class="p">)</span>
<span class="nv">nil</span>
<span class="nv">user></span> <span class="p">(</span><span class="k">def </span><span class="nv">s3</span> <span class="s">"1abc\tdef\tghi\n2jkl\tmno\tpqr\n"</span><span class="p">)</span>
<span class="o">#</span><span class="ss">'user/s3</span>
<span class="nv">user></span> <span class="p">(</span><span class="k">def </span><span class="nv">s4</span> <span class="s">"'1a\tbc'\tdef\tghi\n2jkl\tmno\tpqr\n"</span><span class="p">)</span>
<span class="o">#</span><span class="ss">'user/s4</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="nv">csv</span> <span class="nv">s3</span><span class="p">)</span>
<span class="p">[[</span><span class="s">"1abc"</span> <span class="s">"def"</span> <span class="s">"ghi"</span><span class="p">]</span> <span class="p">[</span><span class="s">"2jkl"</span> <span class="s">"mno"</span> <span class="s">"pqr"</span><span class="p">]]</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="nv">csv</span> <span class="nv">s4</span><span class="p">)</span>
<span class="p">[[</span><span class="s">"1a\tbc"</span> <span class="s">"def"</span> <span class="s">"ghi"</span><span class="p">]</span> <span class="p">[</span><span class="s">"2jkl"</span> <span class="s">"mno"</span> <span class="s">"pqr"</span><span class="p">]]</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="nv">csv</span> <span class="nv">s1</span><span class="p">)</span>
<span class="p">[[</span><span class="s">"1abc"</span> <span class="s">"def"</span> <span class="s">"ghi"</span><span class="p">]</span> <span class="p">[</span><span class="s">"2jkl"</span> <span class="s">"mno"</span> <span class="s">"pqr"</span><span class="p">]]</span>
<span class="nv">user></span>
</pre></div>
<p>As you can see, this time we didn't specify what field-separator to use: the parser
itself detected the field-separator character and used it, returning us the desired
results.</p>
<h3>XML Parser</h3>
<p>Here is the listing of a sample XML parser implemented using Parse-EZ. You can find the
source file in the examples directory. The parser returns a map containing keys and values
for <code>:tag</code>, <code>:attributes</code> and <code>:children</code> for the root element. The value for <code>:attributes</code> key
is itself another map containing attribute names and their values. The value for <code>:children</code>
key is a vector (potentially empty) containing string content and/or maps for child elements.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">ns </span><span class="nv">protoflex.examples.xml_parse</span>
<span class="p">(</span><span class="ss">:use</span> <span class="p">[</span><span class="nv">protoflex.parse</span><span class="p">]))</span>
<span class="p">(</span><span class="kd">declare </span><span class="nv">pi</span> <span class="nv">prolog</span> <span class="nv">element</span> <span class="nv">attributes</span> <span class="nv">children-and-close</span> <span class="nv">cdata</span> <span class="nv">elem-or-text</span> <span class="nv">close-tag</span><span class="p">)</span>
<span class="p">(</span><span class="kd">defn </span><span class="nv">parse-xml</span> <span class="p">[</span><span class="nv">xml-str</span><span class="p">]</span>
<span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">between</span> <span class="nv">prolog</span> <span class="nv">element</span> <span class="nv">pi</span><span class="p">)</span> <span class="nv">xml-str</span> <span class="ss">:blk-cmt-delim</span> <span class="p">[</span><span class="s">"<!--"</span> <span class="s">"-->"</span><span class="p">]</span> <span class="ss">:line-cmt-start</span> <span class="nv">nil</span><span class="p">))</span>
<span class="p">(</span><span class="kd">defn- </span><span class="nv">pi</span> <span class="p">[]</span> <span class="p">(</span><span class="nf">while</span> <span class="p">(</span><span class="nf">starts-with?</span> <span class="s">"<?"</span><span class="p">)</span> <span class="p">(</span><span class="nf">skip-over</span> <span class="s">"?>"</span><span class="p">)))</span>
<span class="p">(</span><span class="kd">defn- </span><span class="nv">prolog</span> <span class="p">[]</span> <span class="p">(</span><span class="nf">pi</span><span class="p">)</span> <span class="p">(</span><span class="nf">attempt</span> <span class="o">#</span><span class="p">(</span><span class="nf">regex</span> <span class="o">#</span><span class="s">"(?s)<!DOCTYPE([^<]+?>)|(.*?\]\s*>)"</span><span class="p">))</span> <span class="p">(</span><span class="nf">pi</span><span class="p">))</span>
</pre></div>
<p>The function <em>parse-xml</em> is the entry point that kicks off parsing of input xml string <em>xml-str</em>. It passes the <em>between</em> combinator to <strong>Parse-EZ</strong>'s <em>parse</em> function. Here, the call to <em>between</em> returns the value returned by the <em>element</em> parse function, ignoring the content surrounding it (matched by <em>prolog</em> and <em>pi</em> functions). The block-comment delimiters are set to match XML's and the line-comment delimiter is cleared (by default these match Java comments).</p>
<p>The parse function <em>pi</em> is used to skip consecutive processing instructions by using the delimiters <strong><?</strong> and <strong>?></strong>.</p>
<p>The parse function <em>prolog</em> is used to skip DTD declaration (if any) and also any surrounding processing instructions. Note that the regex used to match DTD declaration is only meant for illustration purposes. It isn't complete but will work in most cases.</p>
<div class="highlight"><pre><span class="p">(</span><span class="k">def </span><span class="nv">name-start</span> <span class="s">":A-Z_a-z\\xC0-\\xD6\\xD8-\\xF6\\xF8-\\u02FF\\u0370-\\u037D\\u037F-\\u1FFF\\u200C-\\u200D\\u2070-\\u218F\\u2C00-\\u2FEF\\u3001-\\uD7FF\\uF900-\\uFDCF\\uFDF0-\\uFFFD"</span><span class="p">)</span>
<span class="p">(</span><span class="k">def </span><span class="nv">name-char</span> <span class="p">(</span><span class="nb">str </span><span class="nv">name-start</span> <span class="s">"\\-.0-9\\xB7\\u0300-\\u036F\\u203F-\\u2040"</span><span class="p">))</span>
<span class="p">(</span><span class="k">def </span><span class="nv">name-re</span> <span class="p">(</span><span class="nb">-> </span><span class="p">(</span><span class="nf">format</span> <span class="s">"[%s][%s]*"</span> <span class="nv">name-start</span> <span class="nv">name-char</span><span class="p">)</span> <span class="nv">re-pattern</span><span class="p">))</span>
</pre></div>
<p><em>name-re</em> is a regular expression that matches xml element and attribute names.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">element</span> <span class="p">[]</span>
<span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nv">tag</span> <span class="p">(</span><span class="k">do </span><span class="p">(</span><span class="nf">chr</span> <span class="sc">\<</span><span class="p">)</span> <span class="p">(</span><span class="nf">regex</span> <span class="nv">name-re</span><span class="p">))</span>
<span class="nv">attrs</span> <span class="p">(</span><span class="nf">attributes</span><span class="p">)</span>
<span class="nb">children </span><span class="p">(</span><span class="nf">look-ahead*</span> <span class="p">[</span>
<span class="s">">"</span> <span class="o">#</span><span class="p">(</span><span class="nf">children-and-close</span> <span class="nv">tag</span><span class="p">)</span>
<span class="s">"/>"</span> <span class="p">(</span><span class="k">fn </span><span class="p">[]</span> <span class="p">[])])]</span>
<span class="p">{</span><span class="ss">:tag</span> <span class="nv">tag</span>, <span class="ss">:attributes</span> <span class="nv">attrs</span>, <span class="ss">:children</span> <span class="nv">children</span><span class="p">}))</span>
</pre></div>
<p>The <em>element</em> parse function matches an xml element and returns the tag, attribute list and children in a hash map. Note the usage of the <em>look_ahead*</em> combinator to handle both the cases -- with children and without children. If it sees a ">" after reading the attributes, the <em>look-ahead*</em> function calls the <em>children-and-close</em> parse function to read children and the element close tag. On the other hand, if it sees "/>" after the attributes, it calls the (almost) empty parse function that simply returns an empty list.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn </span><span class="nv">attr</span> <span class="p">[]</span>
<span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nv">n</span> <span class="p">(</span><span class="nf">regex</span> <span class="nv">name-re</span><span class="p">)</span> <span class="nv">_</span> <span class="p">(</span><span class="nf">chr</span> <span class="sc">\=</span><span class="p">)</span>
<span class="nv">v</span> <span class="p">(</span><span class="nf">any</span> <span class="nv">sq-str</span> <span class="nv">dq-str</span><span class="p">)]</span>
<span class="p">[</span><span class="nv">n</span> <span class="nv">v</span><span class="p">]))</span>
<span class="p">(</span><span class="kd">defn </span><span class="nv">attributes</span> <span class="p">[]</span> <span class="p">(</span><span class="nb">apply hash-map </span><span class="p">(</span><span class="nf">flatten</span> <span class="p">(</span><span class="nf">multi*</span> <span class="nv">attr</span><span class="p">))))</span>
</pre></div>
<p>The <em>attr</em> parse function matches a single attribute. The attribute value may be
a single-quoted or double-quoted string. Note the usage of <em>any</em> parse combinator for this purpose.</p>
<p>The <em>attributes</em> parse function matches multiple attribute specifications by passing the <em>attr</em> parse function to <em>multi*</em> parse combinator.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn- </span><span class="nv">children-and-close</span> <span class="p">[</span><span class="nv">tag</span><span class="p">]</span>
<span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nb">children </span><span class="p">(</span><span class="nf">multi*</span> <span class="o">#</span><span class="p">(</span><span class="nf">between</span> <span class="nv">pi</span> <span class="nv">elem-or-text</span> <span class="nv">pi</span><span class="p">))]</span>
<span class="p">(</span><span class="nf">close-tag</span> <span class="nv">tag</span><span class="p">)</span>
<span class="nv">children</span><span class="p">))</span>
</pre></div>
<p>Each child item is read using the <em>elem-or-text</em> parse function while ignoring any surrounding processing instructions using the <em>between</em> combinator; the combinator <em>multi*</em> is used to read all the child items.</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn- </span><span class="nv">elem-or-text</span> <span class="p">[]</span>
<span class="p">(</span><span class="nf">look-ahead</span> <span class="p">[</span>
<span class="s">"<![CDATA["</span> <span class="nv">cdata</span>
<span class="s">"</"</span> <span class="p">(</span><span class="k">fn </span><span class="p">[]</span> <span class="nv">nil</span><span class="p">)</span>
<span class="s">"<"</span> <span class="nv">element</span>
<span class="s">""</span> <span class="o">#</span><span class="p">(</span><span class="nf">read-to</span> <span class="s">"<"</span><span class="p">)]))</span>
</pre></div>
<p>The <em>look-ahead</em> parse combinator is used to call different parse functions
based on different lookahead strings. Note that the <em>look-ahead</em> function
doesn't consume the lookahead string unlike the <em>look-ahead*</em> function used
earlier (in the definition of <em>element</em> parse function).</p>
<div class="highlight"><pre><span class="p">(</span><span class="kd">defn- </span><span class="nv">cdata</span> <span class="p">[]</span>
<span class="p">(</span><span class="nf">string</span> <span class="s">"<![CDATA["</span><span class="p">)</span>
<span class="p">(</span><span class="k">let </span><span class="p">[</span><span class="nv">txt</span> <span class="p">(</span><span class="nf">read-to</span> <span class="s">"]]>"</span><span class="p">)]</span> <span class="p">(</span><span class="nf">string</span> <span class="s">"]]>"</span><span class="p">)</span> <span class="nv">txt</span><span class="p">))</span>
<span class="p">(</span><span class="kd">defn- </span><span class="nv">close-tag</span> <span class="p">[</span><span class="nv">tag</span><span class="p">]</span>
<span class="p">(</span><span class="nf">string</span> <span class="p">(</span><span class="nb">str </span><span class="s">"</"</span> <span class="nv">tag</span><span class="p">))</span>
<span class="p">(</span><span class="nf">chr</span> <span class="sc">\></span><span class="p">))</span>
</pre></div>
<p>By now, it should be obvious what the above two functions do.</p>
<p>Well, an XML parser in under 50 lines. Let's try it with a few sample inputs:</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nf">use</span> <span class="ss">'protoflex.examples.xml_parse</span><span class="p">)</span>
<span class="nv">nil</span>
<span class="nv">user></span> <span class="p">(</span><span class="nf">parse-xml</span> <span class="s">"<abc>text</abc>"</span><span class="p">)</span>
<span class="p">{</span><span class="ss">:tag</span> <span class="s">"abc"</span>, <span class="ss">:attributes</span> <span class="p">{}</span>, <span class="ss">:children</span> <span class="p">[</span><span class="s">"text"</span><span class="p">]}</span>
<span class="nv">user></span> <span class="p">(</span><span class="nf">parse-xml</span> <span class="s">"<abc a1=\"1\" a2=\"attr 2\">sample text</abc>"</span><span class="p">)</span>
<span class="p">{</span><span class="ss">:tag</span> <span class="s">"abc"</span>, <span class="ss">:attributes</span> <span class="p">{</span><span class="s">"a1"</span> <span class="s">"1"</span>, <span class="s">"a2"</span> <span class="s">"attr2"</span><span class="p">}</span>, <span class="ss">:children</span> <span class="p">[</span><span class="s">"sample text"</span><span class="p">]}</span>
<span class="nv">user></span> <span class="p">(</span><span class="nf">parse-xml</span> <span class="s">"<abc a1=\"1\" a2=\"attr 2\"><def d1=\"99\">xxx</def></abc>"</span><span class="p">)</span>
<span class="p">{</span><span class="ss">:tag</span> <span class="s">"abc"</span>, <span class="ss">:attributes</span> <span class="p">{</span><span class="s">"a1"</span> <span class="s">"1"</span>, <span class="s">"a2"</span> <span class="s">"attr2"</span><span class="p">}</span>, <span class="ss">:children</span> <span class="p">[{</span><span class="ss">:tag</span> <span class="s">"def"</span>, <span class="ss">:attributes</span> <span class="p">{</span><span class="s">"d1"</span> <span class="s">"99"</span><span class="p">}</span>, <span class="ss">:children</span> <span class="p">[</span><span class="s">"xxx"</span><span class="p">]}]}</span>
<span class="nv">user></span>
</pre></div>
<h2>Comments and Whitespaces</h2>
<p>By default, Parse-EZ automatically handles comments and whitespaces. This
behavior can be turned on or off temporarily using the macros <code>with-trim-on</code>
and <code>with-trim-off</code> respectively. The parser option <code>:auto-trim</code> can be used to
enable or disable the auto handling of whitespace and comments. Use the parser
option <code>:blk-cmt-delim</code> to specify the begin and end delimiters for block
comments. The parser option <code>:line-cmt-start</code> can be used to specify the line
comment marker. By default, these options are set to java/C++ block and line
comment markers respectively. You can alter the whitespace recognizer by setting
the <code>:ws-regex</code> parser option. By default it is set to <code>#"\s+"</code>.</p>
<p>Alternatively, you can turn off auto-handling of whitespace and comments and use
the <code>lexeme</code> function which trims the whitespace/comments after application of the
parse-function passed as its argument.</p>
<p>Also see the <code>no-trim</code> and <code>no-trim-nl</code> functions.</p>
<h2>Primitive Parse Functions</h2>
<p>Parse-EZ provides a number of primitive parse functions such as: <code>chr</code>,
<code>chr-in</code>, <code>string</code>, <code>string-in</code>, <code>word</code>, <code>word-in</code>, <code>sq-str</code>, <code>dq-str</code>,
<code>any-string</code>, <code>regex</code>, <code>read-to</code>, <code>skip-over</code>, <code>read-re</code>, <code>read-to-re</code>,
<code>skip-over-re</code>, <code>read-n</code>, <code>read-ch</code>, <code>read-ch-in-set</code>, etc.
<a href="http://www.protoflex.com/parse-ez/api-doc/protoflex.parse-api.html" title="Parse-EZ API">See API Documentation</a></p>
<p>Let us try some of the builtin primitive parse functions:</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nf">use</span> <span class="ss">'protoflex.parse</span><span class="p">)</span>
<span class="nv">nil</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="nv">integer</span> <span class="s">"12"</span><span class="p">)</span>
<span class="mi">12</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="nv">decimal</span> <span class="s">"12.5"</span><span class="p">)</span>
<span class="mf">12.5</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">chr</span> <span class="sc">\a</span><span class="p">)</span> <span class="s">"a"</span><span class="p">)</span>
<span class="sc">\a</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">chr-in</span> <span class="s">"abc"</span><span class="p">)</span> <span class="s">"b"</span><span class="p">)</span>
<span class="sc">\b</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">string-in</span> <span class="p">[</span><span class="s">"abc"</span> <span class="s">"def"</span><span class="p">])</span> <span class="s">"abc"</span><span class="p">)</span>
<span class="s">"abc"</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">string-in</span> <span class="p">[</span><span class="s">"abc"</span> <span class="s">"def"</span><span class="p">])</span> <span class="s">"abcx"</span><span class="p">)</span>
<span class="nv">Parse</span> <span class="nv">Error</span><span class="err">:</span> <span class="nv">Extraneous</span> <span class="nv">text</span> <span class="nv">at</span> <span class="nv">line</span> <span class="mi">1</span>, <span class="nv">col</span> <span class="mi">4</span>
<span class="p">[</span><span class="nv">Thrown</span> <span class="nb">class </span><span class="nv">java.lang.Exception</span><span class="p">]</span>
</pre></div>
<p>Note the parse error for the last parse call. By default, the <code>parse</code> function parses to the
end of the input text. Even though the first 3 characters of the input text is recognized
as valid input, a parse error is generated because the input cursor would not be at the
end of input-text after recognizing "abc".</p>
<p>The parser option <code>:eof</code> can be set to false to allow recognition of partial input:</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">string-in</span> <span class="p">[</span><span class="s">"abc"</span> <span class="s">"def"</span><span class="p">])</span> <span class="s">"abcx"</span> <span class="ss">:eof</span> <span class="nv">false</span><span class="p">)</span>
<span class="s">"abc"</span>
<span class="nv">user></span>
</pre></div>
<p>You can start parsing by looking for some marker patterns using the <code>read-to</code>,
<code>read-to-re</code>, <code>skip-over</code>, <code>skip-over-re</code> functions.</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="k">do </span><span class="p">(</span><span class="nf">skip-over</span> <span class="s">">>"</span><span class="p">)</span> <span class="p">(</span><span class="nf">number</span><span class="p">))</span> <span class="s">"ignore upto this>> 456.7"</span><span class="p">)</span>
<span class="mf">456.7</span>
</pre></div>
<h2>Parse Combinators</h2>
<p>Parse Combinators in Parse-EZ are higher-order functions that take other parse
functions as input arguments and combine/apply them in different ways to
implement new parse functionality. Parse-EZ provides parse combinators such as:
<code>opt</code>, <code>attempt</code>, <code>any</code>, <code>series</code>, <code>multi\*</code>, <code>multi+</code>, <code>between</code>, <code>look-ahead</code>, <code>lexeme</code>,
<code>expect</code>, etc.
<a href="http://www.protoflex.com/parse-ez/api-doc/protoflex.parse-api.html" title="Parse-EZ API">See API Documentation</a></p>
<p>Let us try some of the builtin parse combinators:</p>
<div class="highlight"><pre><span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">opt</span> <span class="nv">integer</span><span class="p">)</span> <span class="s">"abc"</span> <span class="ss">:eof</span> <span class="nv">false</span><span class="p">)</span>
<span class="nv">nil</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">opt</span> <span class="nv">integer</span><span class="p">)</span> <span class="s">"12"</span><span class="p">)</span>
<span class="mi">12</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">any</span> <span class="nv">integer</span> <span class="nv">decimal</span><span class="p">)</span> <span class="s">"12"</span><span class="p">)</span>
<span class="mi">12</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">any</span> <span class="nv">integer</span> <span class="nv">decimal</span><span class="p">)</span> <span class="s">"12.3"</span><span class="p">)</span>
<span class="mf">12.3</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">series</span> <span class="nv">integer</span> <span class="nv">decimal</span> <span class="nv">integer</span><span class="p">)</span> <span class="s">"3 4.2 6"</span><span class="p">)</span>
<span class="p">[</span><span class="mi">3</span> <span class="mf">4.2</span> <span class="mi">6</span><span class="p">]</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">multi*</span> <span class="nv">integer</span><span class="p">)</span> <span class="s">"1 2 3 4"</span><span class="p">)</span>
<span class="p">[</span><span class="mi">1</span> <span class="mi">2</span> <span class="mi">3</span> <span class="mi">4</span><span class="p">]</span>
<span class="nv">user></span> <span class="p">(</span><span class="nb">parse </span><span class="o">#</span><span class="p">(</span><span class="nf">multi*</span> <span class="p">(</span><span class="k">fn </span><span class="p">[]</span> <span class="p">(</span><span class="nf">string-in</span> <span class="p">[</span><span class="s">"abc"</span> <span class="s">"def"</span><span class="p">])))</span> <span class="s">"abcabcdefabc abcdef"</span><span class="p">)</span>
<span class="p">[</span><span class="s">"abc"</span> <span class="s">"abc"</span> <span class="s">"def"</span> <span class="s">"abc"</span> <span class="s">"abc"</span> <span class="s">"def"</span><span class="p">]</span>
<span class="nv">user></span>
</pre></div>
<p>You can create your own parse functions on top of primitive parse-functions and/or
parse combinators provided by Parse-EZ.</p>
<h2>Committing to a particular parse branch</h2>
<p>Version 0.4.0 added support for committing to a particular parse branch via
the new parse combinators <code>commit</code> and <code>commit-on</code>. These functions make the
parser commit to the current parse branch, making the parser report subsequent
parse-failures in the current branch as parse-errors and preventing it
from trying other alternatives at higher levels.</p>
<h2>Nesting Parse Combinators Using Macros</h2>
<p>Version 0.3.0 of Parse-EZ adds macro versions of parse combinator functions
to make it easy to nest calls to parse combinators without having to write
nested anonymous functions using the "(fn [] ...)" syntax. Note that Clojure
does not allow nesting of anonymous functions of "#(...)" forms. Whereas
the existing parse combinators take parse functions as arguments and actually
perform parsing and return the parse results, the newly added macros take
parse expressions as arguments and return parse functions (to be passed
to other parse combinators). These macros are named the same as the
corresponding parse combinators but with an underscore ("_") suffix. For example
the macro version of "any" is named "any_".</p>
<h2>Error Handling</h2>
<p>Parse Errors are handled in Parse-EZ using Exceptions. The default error messages generated
by Parse-EZ include line and column number information and in some cases what is expected
at that location. However, you can provide your own custom error messages by using the
<code>expect</code> parse combinator.</p>
<h2>Expressions</h2>
<p>Parse-EZ includes a customizable expression parser <code>expr</code> for parsing expressions in infix
notation and an expression evaluator function <code>eval-expr</code> to evaluate infix expressions.
You can customize the operators, their precedences and associative properties using
<code>:operators</code> option to the <code>parse</code> function. For evaluating expressions, you can optionally
specify the functions to invoke for each operator using the <code>:op-fn-map</code> option.</p>
<h2>Parser State</h2>
<p>The parser state consists of the input cursor and various parser options (specified or derived)
such as those affecting whitespace and comment parsing, word recognizers, expression parsing,
etc. The parser options can be changed any time in your own parse functions using <code>set-opt</code>.</p>
<p>Note that most of the parse functions affect Parser state (e.g: input cursor) and hence they are
not pure functions. The side-effects could be avoided by making the Parser State an explicit
parameter to all the parse functions and returning the changed Parser State along with the parse
value from each of the parse functions. However, the result would be a significantly programmer
unfriendly API. We made a design decision to keep the parse fuctions simple and easy to use
than to fanatically keep the functions "pure".</p>
<h2>Relation to Parsec</h2>
<p>Parsec is a popular parser combinator library written in Haskell. While Parse-EZ
makes use of some of the ideas in there, it is <em>not</em> a port of Parsec to Clojure.</p>
<h2>License</h2>
<p>Copyright (C) 2012 Protoflex Software</p>
<p>Distributed under the Eclipse Public License, the same as Clojure.</p>
</article></div>