Add a basic EDN parser #149

joewalker · 2017-01-06T15:17:01Z

The parser mostly works and has a decent test suite. It parses all the
queries issued by tofino-user-agent with some caveats. Known flaws:

No support for tagged elements, comments, discarded elements or "'"
Incomplete support for escaped characters in strings and the range of
characters that are allowed in keywords and symbols
Possible whitespace handling problems
Possibly poor memory handling

rnewman

First pass!

rnewman · 2017-01-06T15:33:58Z

edn/src/edn.rustpeg

@@ -0,0 +1,106 @@
+


License blocks in all files, please.

rnewman · 2017-01-06T15:37:50Z

edn/build.rs

+extern crate peg;
+
+fn main() {
+    peg::cargo_build("src/edn.rustpeg");


Could you add .gitattributes and a modeline to have the .rustpeg file marked as Rust?

Details:

https://github.com/github/linguist/blob/master/README.md#using-gitattributes

It's fairly explicitly not rust though. It happens to have a use keyword that works like Rust's, along with vaguely Rust like code blocks (except that the braces must match, even in strings), and comments (except that they're 'C' style not Rust style, i.e. multi-line comments don't nest)

Hm. I suppose I was thinking that Rust syntax highlighting, indenting, etc. would be better than nothing? Up to you.

There are some syntax highlighting plugins for it:

https://github.com/treycordova/rustpeg.vim

with filetype set filetype=rust.rustpeg, so perhaps just add that as a Vim modeline and hope Linguist figures it out?

rnewman · 2017-01-06T15:39:58Z

edn/src/types.rs

+use std::cmp::{Ordering, Ord, PartialOrd};
+use ordered_float::OrderedFloat;
+
+/// We're using BTree{Set, Map} rather than Hash{Set, Map} because the BTree variants implement Hash


For ///, add a little bit of explanatory documentation about what this enum is.

I agree (and I've made it better) That said, we don't really have a defined API yet, so I'm OK with better docs being lower priority.

rnewman · 2017-01-06T15:43:00Z

edn/src/types.rs

+    }
+}
+
+// TODO: There has to be a better way to do `as i32` for Value


Not after FromPrimitive was removed. You can unsafely transmute, or you can use e.g., the num crate's automatically derived version, but it basically turns into this.

rnewman · 2017-01-06T15:45:17Z

edn/src/types.rs

+/// (unlike the Hash variants which don't in order to preserve O(n) hashing time which is hard given
+/// recurrsive data structures)
+/// See https://internals.rust-lang.org/t/implementing-hash-for-hashset-hashmap/3817/1
+/// TODO: We should probably Box the collection types


Why? Collection types are effectively boxes already, no? (That is, a Vec is a relatively small copyable structure…)

Do you have some interesting heterogeneous trait object issue in mind?

Well TBH I'm copying @jimblandy who uses Object(Box<HashMap<String, Json>>) in his JSON example (see enum chapter). Except that I note the reality is different.

rnewman · 2017-01-06T15:46:07Z

edn/src/tests.rs

@@ -0,0 +1,770 @@
+// TODO: Can't we do this just for tests?


Why isn't this file in edn/tests/?

rnewman · 2017-01-06T15:48:32Z

edn/src/edn.rustpeg

+
+#[export]
+integer -> Value = i:$( sign? digit+ ) {
+    Value::Integer(i.parse::<i32>().unwrap())


I think i32 is probably wrong. Clojure automatically walks the numeric tower:

user=> {:foo 12345423857458454} {:foo 12345423857458454} user=> {:foo 123454238574584543434343434343} {:foo 123454238574584543434343434343N}

but at the very least we will want to support microsecond timestamps, which are bigger than a 32-bit signed int.

I'm not good enough at Rust enough to be able to make more than a stab at the right answer here.

The trivial thing would be s/i32/i64/g "because 64 bits should be enough for anyone". But then because it wouldn't, s/i64/i128/g or s/i64/BigInt/g which suddenly gets weird and or slow.

We could have the parser auto select the smallest type that fits. Except that it seems awkward if a parse of a well defined data based on a known schema results in different types depending on the data.

My currently thought is that i64 is safe, predictable, easily converted and good enough for 99.9% of cases, and that we should fail to parse for numbers bigger than i64. And that when this becomes a problem we allow global configuration of the number type to use, and when this becomes cumbersome we allow some fancy case-by-case configuration.

Or Rust might do something fancy to help us with this problem that I'm not aware of?

I think i64 is a good default, and we might consider adding a BigInt alongside it — even if we only stub out the enum case at this point, and don't parse it.

The edn format docs say:

64-bit (signed integer) precision is expected. An integer can have the suffix N to indicate that arbitrary precision is desired

Thanks @palango
i64 fix is in 29bc808 and BigInt addition is in 41f5997

rnewman · 2017-01-06T15:49:22Z

edn/src/tests.rs

+}
+
+#[test]
+fn test_symbol() {


Please also test the symbols . and $.

rnewman · 2017-01-06T15:52:37Z

edn/src/edn.rustpeg

+
+keyword_char_initial = ":"
+// TODO: More chars here?
+keyword_char_subsequent = [a-z] / [A-Z] / [0-9] / "/"


For future correction: both keywords and symbols can contain /, but only once, and result in a namespaced keyword. That is, :foo/bar has namespace "foo" and name "bar".

(There are similar rules around ., which divides up the namespace into a hierarchy, but can also appear as part of the name.)

bgrins · 2017-01-06T20:26:49Z

edn/.gitignore

@@ -0,0 +1,7 @@
+# Generated by Cargo
+# will have compiled files and executables
+/target/


I think this file shouldn't be needed - our toplevel gitignore should handle these paths (or be updated to handle them)

ncalexan · 2017-01-06T22:35:55Z

I've been using https://github.com/Marwes/combine to parse an EDN AST into a yet more specialized transaction AST. One of the nice things about combine is that it makes providing decent error messages possible (if not easy).

While thinking about how to provide pleasing error messages that include input positions, I realized that I want my EDN input to includes its input positions as well. (Either that or I want to parse my own text, so that I can provide meaningful error messages that include input positions and ranges.)

This may not be something we want to handle at this time, but how hard is it to include input ranges using peg? At least that way I could join peg ranges to provide some hints as to which part of your input failed.

joewalker · 2017-01-09T20:55:34Z

In answer to ncalexan's question about retaining input positions, I think the answer is to include some sort of line+col / offset structure alongside each Value. I don't have a concrete plan, but I believe it should work.

mozilla#149 (comment)

joewalker · 2017-01-10T14:52:05Z

It's worth preserving a link to 29bc808 and 41f5997 as I'm about to rebase and squash and @rnewman probably wants to see those before any final r+.

The parser mostly works and has a decent test suite. It parses all the queries issued by tofino-user-agent with some caveats. Known flaws: * No support for tagged elements, comments, discarded elements or "'" * Incomplete support for escaped characters in strings and the range of characters that are allowed in keywords and symbols * Possible whitespace handling problems

rnewman · 2017-01-10T17:19:01Z

edn/Cargo.toml

+
+license = "Apache-2.0"
+repository = "https://github.com/mozilla/datomish"
+description = "EDN Parser for Datomish"


Two 'Datomish' to replace.

rnewman · 2017-01-10T17:19:14Z

edn/README.md

@@ -0,0 +1,2 @@
+# barnardsstar
+An experimental EDN parser for Datomish


rnewman · 2017-01-10T17:19:37Z

edn/src/edn.rustpeg

+
+#[export]
+integer -> Value = i:$( sign? digit+ ) {
+    Value::Integer(i.parse::<i64>().unwrap())


It occurs to me that unwrap here might be a bad thing — if I provide a query like

[:find ?foo :in $ :where [_ _ 12345678901234567890123456789012345678901234567890]]

— forgetting the 'N' — the parser will panic. Now, we could build a strategy that always handles panics in the parser, allowing us to avoid error handling, but can we instead signal a failure to parse at this point?

rnewman · 2017-01-10T17:21:28Z

edn/src/lib.rs

@@ -8,4 +8,17 @@
 // CONDITIONS OF ANY KIND, either express or implied. See the License for the
 // specific language governing permissions and limitations under the License.

-pub mod keyword;


You've unrooted the existing Keyword module with this change…

rnewman · 2017-01-10T17:22:13Z

edn/src/lib.rs

+    include!(concat!(env!("OUT_DIR"), "/edn.rs"));
+}
+
+fn main() {


This shouldn't be in a lib.rs.

rnewman · 2017-01-10T17:23:12Z

edn/src/types.rs

+    Float(OrderedFloat<f64>),
+    Text(String),
+    Symbol(String),
+    Keyword(String),


Probably this should be Keyword(Keyword), with the second being a reference to the richer namespace/name-containing Keyword type defined in keyword.rs.

… but that's #154. So roll on for now.

The parser mostly works and has a decent test suite. It parses all the queries issued by the Tofino UAS, with some caveats. Known flaws: * No support for tagged elements, comments, discarded elements or "'". * Incomplete support for escaped characters in strings and the range of characters that are allowed in keywords and symbols. * Possible whitespace handling problems.

rnewman · 2017-01-11T21:05:40Z

Landed in c473511, with the fixes from f12fbf2 squashed in.

The parser mostly works and has a decent test suite. It parses all the queries issued by the Tofino UAS, with some caveats. Known flaws: * No support for tagged elements, comments, discarded elements or "'". * Incomplete support for escaped characters in strings and the range of characters that are allowed in keywords and symbols. * Possible whitespace handling problems.

joewalker added the in progress label Jan 6, 2017

joewalker requested a review from rnewman January 6, 2017 15:17

rnewman reviewed Jan 6, 2017

View reviewed changes

bgrins reviewed Jan 6, 2017

View reviewed changes

joewalker self-assigned this Jan 9, 2017

joewalker mentioned this pull request Jan 9, 2017

Make the EDN parser support the correct tokens for Keyword and Symbol #154

Closed

joewalker added a commit to joewalker/mentat that referenced this pull request Jan 10, 2017

Add support for big integers to EDN parser

41f5997

mozilla#149 (comment)

joewalker force-pushed the rust branch from 41f5997 to 2fc57bd Compare January 10, 2017 14:53

rnewman suggested changes Jan 10, 2017

View reviewed changes

rnewman added a commit that referenced this pull request Jan 11, 2017

Address some review comments for #149.

f12fbf2

rnewman approved these changes Jan 11, 2017

View reviewed changes

rnewman closed this Jan 11, 2017

rnewman removed the in progress label Jan 11, 2017

ncalexan mentioned this pull request Feb 8, 2017

[edn] Expose line/column/character span position information from parsed EDN streams #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a basic EDN parser #149

Add a basic EDN parser #149

joewalker commented Jan 6, 2017

rnewman left a comment

rnewman Jan 6, 2017

rnewman Jan 6, 2017

joewalker Jan 9, 2017

rnewman Jan 9, 2017

rnewman Jan 6, 2017

joewalker Jan 9, 2017

rnewman Jan 6, 2017

rnewman Jan 6, 2017

joewalker Jan 9, 2017

rnewman Jan 6, 2017

rnewman Jan 6, 2017

joewalker Jan 9, 2017

rnewman Jan 9, 2017

palango Jan 9, 2017

joewalker Jan 10, 2017

rnewman Jan 6, 2017

rnewman Jan 6, 2017

joewalker Jan 9, 2017

bgrins Jan 6, 2017

ncalexan commented Jan 6, 2017

joewalker commented Jan 9, 2017

joewalker commented Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 10, 2017

rnewman Jan 11, 2017

rnewman commented Jan 11, 2017

		@@ -0,0 +1,2 @@
		# barnardsstar
		An experimental EDN parser for Datomish

Add a basic EDN parser #149

Add a basic EDN parser #149

Conversation

joewalker commented Jan 6, 2017

rnewman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncalexan commented Jan 6, 2017

joewalker commented Jan 9, 2017

joewalker commented Jan 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rnewman commented Jan 11, 2017