It doesn't matter, optimized or not, the amount of work this entails is ridiculo...

Locke1689 · on Feb 1, 2013

Not only that, but the dragon book is terrible.

To quote my adviser, "writing a parser from scratch has no value except as a character building exercise."

Anyone who wanted to write their own parser should probably just use a packrat parser (I find it much simpler).

Whoever is organizing this course, frankly, is either way out of their depth or holds their 1985 compilers course in far too high esteem.

Patient0 · on Feb 1, 2013

I disagree that there's something wrong with writing your own parser. A simple recursive descent parser is very easy to write, and more importantly is very easy for someone else to understand. It does not require any esoteric knowledge of lookup tables and "advanced" parsing algorithms. I far prefer a stack trace from inside someone's hand-written recursive descent parser to the cryptic mess you get from Yacc or Antlr generated code. I also prefer not having to have a whole code generation stage in my build process just to support some small DSL I have embedded.

This whole "you should never write your own parser" thing is so often parroted out as "wisdom"... But I honestly believe that most of the people that say this don't realise how easy it is to "roll your own".

Locke1689 · on Feb 1, 2013

The pedagogical value of writing a parser for your compiler for any language more complex than, say, Scheme, is basically zero.

Skip the parser, learn the useful stuff.

eru · on Feb 1, 2013

Oh, writing parsers is useful outside of writing compilers. But yes, you shouldn't be writing parsers in low level languages like C++. Write them in OCaml or Haskell. It's a joy there.

helmut_hed · on Feb 1, 2013

This is my cue to mention Spirit http://boost-spirit.com/home/

bane · on Feb 1, 2013

unless you ever need to process data in a format for which a decent library doesn't already exist

Locke1689 · on Feb 1, 2013

Still no pedagogical value. Any smart undergrad can pick up how to write a parser with a little background reading and a couple hours. The trick is that you spend the rest of your life fixing maddening corner-case bugs in the thing.

Parsers, from a practical perspective, are child's play for people who seek to eventually master a compiler.

wglb · on Feb 1, 2013

Do you have an example in mind of such data?

helmut_hed · on Feb 1, 2013

New file formats of all kinds are invented every day, often with some proprietary tool attached. Writing your own parser allows you to add value and/or interoperate with the proprietary system. Even in the cases where an open source parser already exists it may have performance issues, or not support the latest version of the format. Being able to roll your own in those cases is empowering.

Of course you asked for examples... let me give that a try:

1) data from your favorite application that's been end-of-lifed and you're thinking of replacing with a competitor's tool 2) data from a later version of your favorite application that you want to use with an earlier version, because you don't want to upgrade 3) configuration information from some part of your IT infrastructure that you need to refer to as you restructure and upgrade 4) a big config file for some software, that contains an error somewhere and "grep" won't find it. Maybe it's a semantic error, for example. 5) a config file for some ancient crufty software you're replacing, but the config file is huge and contains a lot of institutional knowledge, so you want to automatically translate it to the new system's setup.

Being able to generate even simple parsers gives you a lot of power. It's not as uncommon as you might imagine.

wglb · on Feb 1, 2013

Having been in the business of reversing data formats in two different real-world contexts, I feel comfortable in saying that just about the last thing I would do is write a parser. One context was building a system to pull live financial data feeds. The other is in the software security business. In the former, often CSVs were what you would get, or fielded data. We built an engine that could easily inhale these. Including the bloomberg feed which was unusually complex.

In the security business, one is often asked to assess some not-very-well specified protocol, or some protocol for which there is no documentation. So to deal with it you 1) fuzz the hell out of it to make the end point fall over or 2) hexdump the protocol and write pieces of it in ruby or python to get messages through so that you can fuzz the hell out of it in a structured way.

And if there was some need to write a parser, you can bet it ain't gonna be LALR, it will be hand-crafted, likely recursive descent.

To reply to each of your points:

1) If you are lucky, this XML. I don't need to know how to write a parser if the data is XML. If it is some sort of Java serialization, dejad is your friend--no parser required. If it is binary, you are going to use the protocol reversing route mentioned above.

2) See #1

3) Maybe just insert parentheses around the whole bit of data, and insert more strategically, and you are all but done.

4) See #3 or #1.

5) See #4.

If I was working on a team, and I saw someone writing a parser for a data-related problem, I would seriously question what they are doing.

helmut_hed · on Feb 1, 2013

It's nice that things have gone in the direction of XML and JSON lately, and many people devise formats that build upon those. I was thinking more of arbitrary text formats. Even if it's XML or JSON though, the existing parsers only handle comprehending the structure of the data itself. You have to write some semantic analysis on top of those, because the standard "parser" will just give you the input data as a tree - but the semi-standard format certainly helps a lot.

I do think you overestimate the work required to write a parser for simpler formats - for someone familiar with one of the popular parser generators this can be a handful of hours, and the quality of results should be much higher than an ad hoc method. This can be a good design decision.

wglb · on Feb 1, 2013

No, I don't underestimate it, because I have done it, i know how long it takes, and what one gets from it.

There are simpler ways, as I point out above.

yan · on Feb 1, 2013

> To quote my adviser, "writing a parser from scratch has no value except as a character building exercise."

I think I just read the best potentially unintentional compiler pun I have ever seen.

wglb · on Feb 1, 2013

I disagree about the Dragon book. In particular, the code generation part was very valuable to me in the code generator.

But I think I see where you are going with this. Using a parser generator for compiling, in the words of Dave Conroy (author of MicroEmacs and many other things), "A Parser Generator makes the hard part harder and the easy part easier."

Edit: Wait--1985? Ah, that is the problem. I used the First dragon book, not the second. I only got the second after I did the compiler work. Was it much worse than the first?

Locke1689 · on Feb 1, 2013

No, even the second edition is hugely out of date.

First, note that basically the first half of the book is about parser implementation. Then, it basically just teaches you that there's only one parser and it's called LALR. Of course, they didn't include anything like packrat, but even worse, they pretend like a LALR generator is still better than LR, even though LR was extended to be tractable for large grammars years ago.

And even with all that, it's far too high level to be of use in actually engineering a compiler (which, by the way, is a good intro book).

niggler · on Feb 1, 2013

packrat is not a panacea: once you try to add left-recursion into the mix, it becomes much slower than LALR parsers.

LALR represents a nice tradeoff: both LR and packrat involve much much larger state tables than LALR parsers.

Locke1689 · on Feb 1, 2013

That's not the point. The point is to impart information and let the compiler designer choose which is appropriate.

The ordering is also confusing because the grammar parsers go from less powerful to more powerful. However, LALR is after LR.

Edit: Oh, and for C++ you need a GLR, which isn't even covered in dragon.

wglb · on Feb 1, 2013

I can't argue with your "hugely out of date" note. It is likely that I am also out of date, not having written compilers since the mid-80s.

Do you have an opinion about the Holub book? Also I have a collection of Davidson papers about code generation that I haven't looked at since back then.

Locke1689 · on Feb 1, 2013

Unless you need to for speed or compat reasons, I don't like writing parsers in C.

Remember, a compiler is a translator from text to (often)text. You're going to be dealing with a lot of strings and probably allocations for your AST. You should think about using a language which doesn't make string manipulation and memory management like pulling teeth.

In general, I'd say the ML family is best for writing a general-purpose compiler. SML even gives your compiler a formal semantics for free.

eru · on Feb 1, 2013

> And even with all that, it's far too high level to be of use in actually engineering a compiler (which, by the way, is a good intro book).

I actually found it way too low level. And not meaty enough.

timtadh · on Feb 1, 2013

as @wglb said below I can't disagree with you more on the Dragon book. Yes, there are probably better resources out there for lexing and parsing but have your looked at the rest of the book? It actually covers most of the important static analysis techniques in sufficient detail to implement many of the important optimizations. In short: there is a reason people keep recommending it: it is a really great resource.

Also, if you do want to implement a traditional Yacc like parser generator then it is a pretty good resource for that as well (having done it). Finally, while writing a parser might be a "character building exercise" sometimes there is also no getting around it.

Locke1689 · on Feb 1, 2013

Have you looked at it?

How can you say there are better references for lexing and parsing and leave out that the entire first half of the book is lexing and parsing?

There are a ton of books that are better at every single thing you would want. Engineering a Compiler. Modern Compiler Implementation in ML. The compiler Handbook.

timtadh · on Feb 1, 2013

I usually reference a combination of: Dragon Book, Muchnicks: Advanced Compiler Design, and Semantics with Applications. I have been meaning to pick up a copy of Engineering a Compiler. (I should also note I am currently reading: Principles of Program Analysis, it is a good book focusing on just theory. Read Semantics with Applications first it is basically the follow up)

A guess my point is this: I have learned a great deal from the Dragon book. I think it is a solid book that has taught me a lot. There may be better books out there but I haven't read one yet (Advanced Compiler Design is great but it really is only about optimization and analysis you need an undergrad book to supplement it).

Finally, I have encountered worse books on the subject of compilers. So yes, this is a book that I would recommend and continue to recommend.

ps. You mis-characterize the length of the lexing and parsing coverage. It starts on page 109 and ends on page 302, the content goes to 964. Chapter wise: 3-4 lexing->parsing, (chapter 1-2 are really an introduction and illustrative example so they don't count). Chapters 5-8 cover the rest of what you need to get a working compiler + some other stuff. Chapters 9-12 (page wise 583-964) cover optimization and analysis in depth. So really, nearly 40% is optimization while about 20% is syntax analysis. This book has a lot of good material most of it isn't to do with syntax analysis and the syntax analysis is for the most part high quality.

wglb · on Feb 1, 2013

And if I were to start writing a compiler today, I would certainly start with Holub's book Compiler Design in C.

laichzeit0 · on Feb 1, 2013

Except when you work on embedded systems (I'm talking about the type where you have a few hundred kilobytes of ram and flash, no OS under your ass and crap libraries). I've used a recursive descent parsing technique that's described in a paper of about 5 pages by David Hanson to great effect [1]. Depends on the situation you find yourself in, sometimes it's unavoidable to re-invent the wheel.

[1] http://drhanson.s3.amazonaws.com/storage/documents/compact.p...

qznc · on Feb 1, 2013

Pretty much all the good compilers (gcc,clang,...) have a hand-written parser. An exception is javac, but that one is a minimalistic parser.

Handwritten parsers have better error reporting and might even be faster.

Locke1689 · on Feb 1, 2013

Yes, that's called a production compiler.

If you glance in my profile you will see I'm well aware of the practices of production compilers and also how few people ever even look inside one.

eru · on Feb 1, 2013

> Not only that, but the dragon book is terrible.

Agreed. I can never believe the people who still praise it.

mossplix · on Feb 1, 2013

Which is the best book for compiler design then?

eru · on Feb 1, 2013

I haven't read all of them. But I liked "Modern Compiler Design" (http://www.amazon.com/Modern-Compiler-Design-D-Grune/dp/0471...).

If you have a functional bend, Simon Peyton Jones' book (https://research.microsoft.com/en-us/um/people/simonpj/Paper...) is worth reading, too. His book, however, is not a complete treatment. It assumes you know e.g. how to write a parser, and concentrates on the challenges unique to lazy functional languages.

dspeyer · on Feb 1, 2013

Is there any parser autogeneration tool that can cope with c++11's grammer? Including weird edge cases like template vs. comparison depending on previously declared types?

Locke1689 · on Feb 1, 2013

C++ is messed up that you're gonna need to do some weird stuff. First, the grammar is ambiguous. Second, the template language is undecidable. Nothing you can do about the second.

The first, you'll at least need something capable of parsing context-free languages. My recommendation is to start here[1].

[1] http://arxiv.org/abs/1010.5023

niggler · on Feb 1, 2013

I was just replying to the comment regarding GCC and MSVC -- the argument seemed to be that most of the development manhours were devoted to C++11 when in fact most of it was devoted to optimization strategies and cross-architectural compatibility.