Parsing a complex language without a parse tree

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsing a complex language without a parse tree

Michael Matthews-2
(I am new to mailing lists, so please be polite about formatting problems. Thank you!)

I am developing a programming language. The compiler is written in C++. The semantic objects (e.g. function definitions, variables, expressions) are modeled as C++ classes. Instances of these classes are to be organized into a tree-like structure to create an in-memory representation of the source program. I would like to use a generic algorithm to populate this semantics tree from the result of a Spirit parser. Sadly, all my attempts to do this have failed.

(My grammar compiles cleanly and produces a full match, BTW. I only need help with populating a semantics tree.)

ATTEMPT #1

I decided to use ast_parse to create an AST. Each significant AST node was given an ID corresponding to a particular semantic class. A recursive function would take the Spirit-generated AST and create semantic objects which would be placed into the semantics tree according to their nesting in the AST.

Problem:

Spirit culls _all_ nodes which have only one child node.

Proposed Solution 1:

Searching spirit.sourceforge.net, I found a patch submitted in 2003 which alters this behavior so that nodes for which I have explicitly set an ID will not be culled. But that fix is not in my Boost 1.3.2 package, so it is meaningless to me, and the default behavior is, frankly, silly. If I have a parameter_list in my grammar, I do not want it being replaced with a "literal" in my function definition; otherwise, processing the AST could get very, very messy.

Proposed Solution 2:

Searching the Spirit documentation, I found a rather lengthy page about Parse Trees and ASTs. There I found a cryptic paragraph about using gen_pt_node_d and gen_ast_node_d to switch tree generation between the parse tree and AST behaviors - exactly what I needed. Or so I thought. The page says nothing about how to use these directives. AFAIK, my grammar classes are defined _exactly_ as discussed in the Spirit docs. Results:

 * Using gen_pt_node_d on a grammar being parsed by ast_parse makes it fail to compile.
 * Using gen_ast_node_d with ast_parse appears to have absolutely no effect on the generated parse tree or AST.

So naturally I gave up on this approach and tried something else.

ATTEMPT #2

I thought that instead of trying to force ast_parse to generate certain nodes, I would try to force pt_parse to avoid generating certain nodes. However, when I switched from ast_parse to pt_parse, suddenly I had tons of extra objects populating my tree. Four program objects, in fact, even though the program grammar is defined to match the only whole source file.

Problem:

While I can specify custom rule IDs, Spirit will happily apply them to all the unnamed temporary rules, too. So they work wonderfully as indicators of which parser processed those nodes, but to me this behavior is silly. When I define an ID for a rule, I expect only that rule to have that ID. That ID means something in my application. I do not expect it to be propagated through other parts of the parse tree to nodes I couldn't possibly care less about. Note that I didn't catch this problem earlier since I was using no_node_d extensively with ast_parse, which just happened to remove all the spuriously ID'd nodes.

Proposed Solution:

This spurious rule ID duplication only seems to occur with static IDs. So I tried setting them dynamically using dynamic_tag and set_id. This does work; however, Spirit happily sets all the other rule IDs equal to the pointers of parsers. Which means I now have a parse tree containing IDs with two different meanings. While I could tell my IDs from Spirits by assuming that pointers never have values below 0x00001000 and my enumerated IDs will never have values above 0x00000FFF, it is absolutely absurd to have to make these assumptions in a relatively high-level thing like a compiler.

ATTEMPT #3

With neither of the tree parse functions doing what I wanted, I turned to semantic actions. If I could write a pair of prolog/epilog actions that are called when a match is found and after the match has been processed, I could build my semantics tree. Just use a stack and the order of the prolog/epilog calls to deduce the nesting level of each semantic object and add it to the tree in the proper place. How simple!

Problem:

As if. But when you have a grammar b as the first rule of a grammar a, there is no way for a to trigger its semantic action before b. You would get enter (b) enter (a) leave (b) leave (a) which is useless for generating a tree.

Proposed Solution:

Eh, so I thought I would simply create a custom parser that matched zero-length sequences. I could insert these parsers anywhere, and they would not affect the parser's logic in any way except to trigger my prolog/epilog functions. (And I later ran into epsilon_p, which does precisely that anyway. D'oh.) However, the ability to place semantic actions at any point in the tree is essentially useless to me because they are called for incomplete matches. That is, given a = x >> y >> z, x's actions will trigger as soon as x is matched. Even if z fails to match and causes the parser to backtrack completely out of a, x's action will have triggered. Which means that my epsilon_p semantic actions will _always_ trigger, and I'll have no way to tell when they've triggered on false positives. Great. Score one for parametric parsers. :P

ATTEMPT #4

[ ][ ]
[ ][x] <-you are here

Presently, I wonder whether parser contexts could be used to implement my prolog/epilog semantics-tree-building functions. According to the Spirit docs, they have pre_parse and post_parse functions which are both always called regardless of whether there is a match. Yay. However, I would need some way to backtrack on semantic node creation until I know that a match for its grammar has occurred. Is this doable? If so, how can those functions discern between a match and a failed match? I am thinking of running through the parse tree whilst cueing up semantic creation functions in some kind of stack, but removing the ones that happen to fail parser matching. Could such a stack also keep track of node nesting so that I would know which semantic nodes would need to be placed under which other nodes?

Having spent so many hours trying to get Spirit to do such an incredibly simple thing, I am reluctant to try this method until I know it is feasible.

An Aside:

I do not understand why people use parse trees and ASTs to glue generated parsers to applications. The coupling is ridiculous. Applications rarely need the exact structure of the parse tree, and ASTs are so generic that using them is invariably messy and error-prone. All I really need for my programming language  from the parser is a few enumerated IDs to determine which semantic objects to create and the occasional bit of token text to float up to the most recently created semantic object. This can be easily handled with three or four several-line callbacks - not a huge mass of spaghetti code if's and switch's.

I want something like this trivial example, which _incorrectly_ assumes that semantic actions are only called on complete match of all contained rules:

  program = epsilon_p  [tree_builder::create_and_push ("program")]
         >> str_p ("program")
         >> ch_p ('{')
         >> string     [tree_builder::send_token_to_top_object ()]
         >> ch_p ('}')
         >> epsilon_p  [tree_builder::pop ("program")];

It should be so simple! I must decide the structure of the grammars in the grammar definitions. Why duplicate those decisions within a huge AST evaluation function? I may have to fall back on that approach. *sigh* I hate writing crud.

 - Michael Matthews

--
_______________________________________________

Search for businesses by name, location, or phone number.  -Lycos Yellow Pages

http://r.lycos.com/r/yp_emailfooter/http://yellowpages.lycos.com/default.asp?SRC=lycos10



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a complex language without a parse tree

Dan Nuffer-2
Michael Matthews wrote:
> Problem:
>
> Spirit culls _all_ nodes which have only one child node.
>
> Proposed Solution 1:
>
> Searching spirit.sourceforge.net, I found a patch submitted in 2003 which alters this behavior so that nodes for which I have explicitly set an ID will not be culled. But that fix is not in my Boost 1.3.2 package, so it is meaningless to me, and the default behavior is, frankly, silly. If I have a parameter_list in my grammar, I do not want it being replaced with a "literal" in my function definition; otherwise, processing the AST could get very, very messy.
>

Are you referring to the code which can be activated by #define
BOOST_SPIRIT_NO_TREE_NODE_COLLAPSING ? See ast.hpp

Joel, maybe it would be worth breaking backward compatibility and
defaulting that to being on, and then have a macro to get back the old
behavior? Or if breaking bw compat is too painful, create an
ast2_parse() which does the right thing.




-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general
Reply | Threaded
Open this post in threaded view
|

Re: Parsing a complex language without a parse tree

Michael Matthews-2
In reply to this post by Michael Matthews-2
> > Problem:
> >
> > Spirit culls _all_ nodes which have only one child node.
> >
> > Proposed Solution 1:
> >
> > Searching spirit.sourceforge.net, I found a patch
> > submitted in 2003 which alters this behavior so that
> > nodes for which I have explicitly set an ID will not be
> > culled. But that fix is not in my Boost 1.3.2 package,
> > so it is meaningless to me, and the default behavior
> > is, frankly, silly. If I have a parameter_list in my
> > grammar, I do not want it being replaced with a
> > "literal" in my function definition; otherwise,
> > processing the AST could get very, very messy.
> >
>
> Are you referring to the code which can be activated by
> #define BOOST_SPIRIT_NO_TREE_NODE_COLLAPSING ? See ast.hpp

Before I reply, I would like to apologize for my unusually tense and cynical request. I spent many hours rewriting and reorganizing my code, hoping each time that I had found a solution to this problem, but when that fell through I failed to control my frustration. I should have read the docs more carefully, too.

Thank you for your quick response! Unfortunately, there is no BOOST_SPIRIT_NO_TREE_NODE_COLLAPSING present in my ast.hpp, so I must have an older version. My computer has Boost 1.32.0, but AFAIK there are no updated Boost RPMs for my linux distribution, and I do not know how to create them.

If you could give me a rough idea of how Spirit's parse tree and AST generators are glued to the core, that's probably all I'd need.

The following callbacks were written to create my semantics tree:

 * enter (id) - creates a semantic object corresponding to the id. The newly formed object is pushed onto the parent stack. If there was a previous parent, it is set as the parent of the newly formed object.

 * leave () - pops the top object from the parent stack.

The following pseudocode demonstrates the desired order of callbacks invoked through spirit::parse (indented to show their effect on the parent stack):

enter (program)
  enter (visibility-modifier)
  leave ()
  enter (object-definition)
    enter (name-reference)
    leave ()
    enter (name-definition)
    leave ()
  leave ()
  enter (function-definition)
    enter (parameter-list)
    leave ()
    enter (name-definition)
    leave ()
    enter (parameter-list)
    leave ()
    enter (function-body)
      enter (function-call)
        enter (name-reference)
        leave ()
        enter (parameter-list)
          enter (string)
          leave ()
        leave ()
      leave ()
    leave ()
  leave ()
leave ()

I just need a way to connect spirit::parse to these callbacks. If it requires backtracking, I can easily adapt these callbacks provided that "leave" is called reliably and that it has some way to determine whether the match failed or succeeded.

Thank you for your time,

 - Michael Matthews

--
_______________________________________________

Search for businesses by name, location, or phone number.  -Lycos Yellow Pages

http://r.lycos.com/r/yp_emailfooter/http://yellowpages.lycos.com/default.asp?SRC=lycos10



-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Spirit-general mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/spirit-general