Skip to content
This repository has been archived by the owner on Sep 12, 2018. It is now read-only.

Add a basic EDN parser #149

Closed
wants to merge 1 commit into from
Closed

Add a basic EDN parser #149

wants to merge 1 commit into from

Conversation

joewalker
Copy link
Member

The parser mostly works and has a decent test suite. It parses all the
queries issued by tofino-user-agent with some caveats. Known flaws:

  • No support for tagged elements, comments, discarded elements or "'"
  • Incomplete support for escaped characters in strings and the range of
    characters that are allowed in keywords and symbols
  • Possible whitespace handling problems
  • Possibly poor memory handling

Copy link
Collaborator

@rnewman rnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass!

@@ -0,0 +1,106 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

License blocks in all files, please.

extern crate peg;

fn main() {
peg::cargo_build("src/edn.rustpeg");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add .gitattributes and a modeline to have the .rustpeg file marked as Rust?

Details:

https://github.com/github/linguist/blob/master/README.md#using-gitattributes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fairly explicitly not rust though. It happens to have a use keyword that works like Rust's, along with vaguely Rust like code blocks (except that the braces must match, even in strings), and comments (except that they're 'C' style not Rust style, i.e. multi-line comments don't nest)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. I suppose I was thinking that Rust syntax highlighting, indenting, etc. would be better than nothing? Up to you.

There are some syntax highlighting plugins for it:

https://github.com/treycordova/rustpeg.vim

with filetype set filetype=rust.rustpeg, so perhaps just add that as a Vim modeline and hope Linguist figures it out?

use std::cmp::{Ordering, Ord, PartialOrd};
use ordered_float::OrderedFloat;

/// We're using BTree{Set, Map} rather than Hash{Set, Map} because the BTree variants implement Hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ///, add a little bit of explanatory documentation about what this enum is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree (and I've made it better) That said, we don't really have a defined API yet, so I'm OK with better docs being lower priority.

}
}

// TODO: There has to be a better way to do `as i32` for Value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not after FromPrimitive was removed. You can unsafely transmute, or you can use e.g., the num crate's automatically derived version, but it basically turns into this.

/// (unlike the Hash variants which don't in order to preserve O(n) hashing time which is hard given
/// recurrsive data structures)
/// See https://internals.rust-lang.org/t/implementing-hash-for-hashset-hashmap/3817/1
/// TODO: We should probably Box the collection types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Collection types are effectively boxes already, no? (That is, a Vec is a relatively small copyable structure…)

Do you have some interesting heterogeneous trait object issue in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well TBH I'm copying @jimblandy who uses Object(Box<HashMap<String, Json>>) in his JSON example (see enum chapter). Except that I note the reality is different.

@@ -0,0 +1,770 @@
// TODO: Can't we do this just for tests?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this file in edn/tests/?


#[export]
integer -> Value = i:$( sign? digit+ ) {
Value::Integer(i.parse::<i32>().unwrap())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i32 is probably wrong. Clojure automatically walks the numeric tower:

user=> {:foo 12345423857458454}
{:foo 12345423857458454}
user=> {:foo 123454238574584543434343434343}
{:foo 123454238574584543434343434343N}

but at the very least we will want to support microsecond timestamps, which are bigger than a 32-bit signed int.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not good enough at Rust enough to be able to make more than a stab at the right answer here.

The trivial thing would be s/i32/i64/g "because 64 bits should be enough for anyone". But then because it wouldn't, s/i64/i128/g or s/i64/BigInt/g which suddenly gets weird and or slow.

We could have the parser auto select the smallest type that fits. Except that it seems awkward if a parse of a well defined data based on a known schema results in different types depending on the data.

My currently thought is that i64 is safe, predictable, easily converted and good enough for 99.9% of cases, and that we should fail to parse for numbers bigger than i64. And that when this becomes a problem we allow global configuration of the number type to use, and when this becomes cumbersome we allow some fancy case-by-case configuration.

Or Rust might do something fancy to help us with this problem that I'm not aware of?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i64 is a good default, and we might consider adding a BigInt alongside it — even if we only stub out the enum case at this point, and don't parse it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edn format docs say:

64-bit (signed integer) precision is expected. An integer can have the suffix N to indicate that arbitrary precision is desired

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @palango
i64 fix is in 29bc808 and BigInt addition is in 41f5997

}

#[test]
fn test_symbol() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test the symbols . and $.


keyword_char_initial = ":"
// TODO: More chars here?
keyword_char_subsequent = [a-z] / [A-Z] / [0-9] / "/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future correction: both keywords and symbols can contain /, but only once, and result in a namespaced keyword. That is, :foo/bar has namespace "foo" and name "bar".

(There are similar rules around ., which divides up the namespace into a hierarchy, but can also appear as part of the name.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,7 @@
# Generated by Cargo
# will have compiled files and executables
/target/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file shouldn't be needed - our toplevel gitignore should handle these paths (or be updated to handle them)

@ncalexan
Copy link
Member

ncalexan commented Jan 6, 2017

I've been using https://github.com/Marwes/combine to parse an EDN AST into a yet more specialized transaction AST. One of the nice things about combine is that it makes providing decent error messages possible (if not easy).

While thinking about how to provide pleasing error messages that include input positions, I realized that I want my EDN input to includes its input positions as well. (Either that or I want to parse my own text, so that I can provide meaningful error messages that include input positions and ranges.)

This may not be something we want to handle at this time, but how hard is it to include input ranges using peg? At least that way I could join peg ranges to provide some hints as to which part of your input failed.

@joewalker
Copy link
Member Author

In answer to ncalexan's question about retaining input positions, I think the answer is to include some sort of line+col / offset structure alongside each Value. I don't have a concrete plan, but I believe it should work.

joewalker added a commit to joewalker/mentat that referenced this pull request Jan 10, 2017
@joewalker
Copy link
Member Author

It's worth preserving a link to 29bc808 and 41f5997 as I'm about to rebase and squash and @rnewman probably wants to see those before any final r+.

The parser mostly works and has a decent test suite. It parses all the
queries issued by tofino-user-agent with some caveats. Known flaws:

* No support for tagged elements, comments, discarded elements or "'"
* Incomplete support for escaped characters in strings and the range of
  characters that are allowed in keywords and symbols
* Possible whitespace handling problems

license = "Apache-2.0"
repository = "https://github.com/mozilla/datomish"
description = "EDN Parser for Datomish"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two 'Datomish' to replace.

@@ -0,0 +1,2 @@
# barnardsstar
An experimental EDN parser for Datomish
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same.


#[export]
integer -> Value = i:$( sign? digit+ ) {
Value::Integer(i.parse::<i64>().unwrap())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that unwrap here might be a bad thing — if I provide a query like

[:find ?foo :in $ :where [_ _ 12345678901234567890123456789012345678901234567890]]

— forgetting the 'N' — the parser will panic. Now, we could build a strategy that always handles panics in the parser, allowing us to avoid error handling, but can we instead signal a failure to parse at this point?

@@ -8,4 +8,17 @@
// CONDITIONS OF ANY KIND, either express or implied. See the License for the
// specific language governing permissions and limitations under the License.

pub mod keyword;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've unrooted the existing Keyword module with this change…

include!(concat!(env!("OUT_DIR"), "/edn.rs"));
}

fn main() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be in a lib.rs.

Float(OrderedFloat<f64>),
Text(String),
Symbol(String),
Keyword(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this should be Keyword(Keyword), with the second being a reference to the richer namespace/name-containing Keyword type defined in keyword.rs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… but that's #154. So roll on for now.

rnewman added a commit that referenced this pull request Jan 11, 2017
rnewman pushed a commit that referenced this pull request Jan 11, 2017
The parser mostly works and has a decent test suite. It parses all the
queries issued by the Tofino UAS, with some caveats. Known flaws:

* No support for tagged elements, comments, discarded elements or "'".
* Incomplete support for escaped characters in strings and the range of
  characters that are allowed in keywords and symbols.
* Possible whitespace handling problems.
@rnewman
Copy link
Collaborator

rnewman commented Jan 11, 2017

Landed in c473511, with the fixes from f12fbf2 squashed in.

@rnewman rnewman closed this Jan 11, 2017
victorporof pushed a commit that referenced this pull request Feb 2, 2017
The parser mostly works and has a decent test suite. It parses all the
queries issued by the Tofino UAS, with some caveats. Known flaws:

* No support for tagged elements, comments, discarded elements or "'".
* Incomplete support for escaped characters in strings and the range of
  characters that are allowed in keywords and symbols.
* Possible whitespace handling problems.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants