It doesn't matter, optimized or not, the amount of work this entails is ridiculous. Writing a C++11 compiler requires intimate knowledge of the standard (which is a massive, hard to read tome), implementing pretty much all of the standard library from the ground up, and honestly, good luck writing all of that - that's all the containers (set, vector, list, map, unordered_map, multimap, unordered_multimap, forward_list, stack, deque, tuple), algorithms, time and regex libraries (chrono and regex), threading and thread models, type_traits and so on (actually, I just realized, they have it listed on the right hand side).
Honestly, it's worded very poorly. "Compliant with the latest 2011 standard (C++11)" suggests all of this. I have a hard time believing this isn't some kind of joke - there's literally no-one alive that could write all of this in the timeframe of a course.
I disagree that there's something wrong with writing your own parser. A simple recursive descent parser is very easy to write, and more importantly is very easy for someone else to understand. It does not require any esoteric knowledge of lookup tables and "advanced" parsing algorithms. I far prefer a stack trace from inside someone's hand-written recursive descent parser to the cryptic mess you get from
Yacc or Antlr generated code. I also prefer not having to have a whole code generation stage in my build process just to support some small DSL I have embedded.
This whole "you should never write your own parser" thing is so often parroted out as "wisdom"... But I honestly believe that most of the people that say this don't realise how easy it is to "roll your own".
Oh, writing parsers is useful outside of writing compilers. But yes, you shouldn't be writing parsers in low level languages like C++. Write them in OCaml or Haskell. It's a joy there.
Still no pedagogical value. Any smart undergrad can pick up how to write a parser with a little background reading and a couple hours. The trick is that you spend the rest of your life fixing maddening corner-case bugs in the thing.
Parsers, from a practical perspective, are child's play for people who seek to eventually master a compiler.
New file formats of all kinds are invented every day, often with some proprietary tool attached. Writing your own parser allows you to add value and/or interoperate with the proprietary system. Even in the cases where an open source parser already exists it may have performance issues, or not support the latest version of the format. Being able to roll your own in those cases is empowering.
Of course you asked for examples... let me give that a try:
1) data from your favorite application that's been end-of-lifed and you're thinking of replacing with a competitor's tool
2) data from a later version of your favorite application that you want to use with an earlier version, because you don't want to upgrade
3) configuration information from some part of your IT infrastructure that you need to refer to as you restructure and upgrade
4) a big config file for some software, that contains an error somewhere and "grep" won't find it. Maybe it's a semantic error, for example.
5) a config file for some ancient crufty software you're replacing, but the config file is huge and contains a lot of institutional knowledge, so you want to automatically translate it to the new system's setup.
Being able to generate even simple parsers gives you a lot of power. It's not as uncommon as you might imagine.
Having been in the business of reversing data formats in two different real-world contexts, I feel comfortable in saying that just about the last thing I would do is write a parser. One context was building a system to pull live financial data feeds. The other is in the software security business. In the former, often CSVs were what you would get, or fielded data. We built an engine that could easily inhale these. Including the bloomberg feed which was unusually complex.
In the security business, one is often asked to assess some not-very-well specified protocol, or some protocol for which there is no documentation. So to deal with it you 1) fuzz the hell out of it to make the end point fall over or 2) hexdump the protocol and write pieces of it in ruby or python to get messages through so that you can fuzz the hell out of it in a structured way.
And if there was some need to write a parser, you can bet it ain't gonna be LALR, it will be hand-crafted, likely recursive descent.
To reply to each of your points:
1) If you are lucky, this XML. I don't need to know how to write a parser if the data is XML. If it is some sort of Java serialization, dejad is your friend--no parser required. If it is binary, you are going to use the protocol reversing route mentioned above.
2) See #1
3) Maybe just insert parentheses around the whole bit of data, and insert more strategically, and you are all but done.
4) See #3 or #1.
5) See #4.
If I was working on a team, and I saw someone writing a parser for a data-related problem, I would seriously question what they are doing.
It's nice that things have gone in the direction of XML and JSON lately, and many people devise formats that build upon those. I was thinking more of arbitrary text formats. Even if it's XML or JSON though, the existing parsers only handle comprehending the structure of the data itself. You have to write some semantic analysis on top of those, because the standard "parser" will just give you the input data as a tree - but the semi-standard format certainly helps a lot.
I do think you overestimate the work required to write a parser for simpler formats - for someone familiar with one of the popular parser generators this can be a handful of hours, and the quality of results should be much higher than an ad hoc method. This can be a good design decision.
I disagree about the Dragon book. In particular, the code generation part was very valuable to me in the code generator.
But I think I see where you are going with this. Using a parser generator for compiling, in the words of Dave Conroy (author of MicroEmacs and many other things), "A Parser Generator makes the hard part harder and the easy part easier."
Edit: Wait--1985? Ah, that is the problem. I used the First dragon book, not the second. I only got the second after I did the compiler work. Was it much worse than the first?
No, even the second edition is hugely out of date.
First, note that basically the first half of the book is about parser implementation. Then, it basically just teaches you that there's only one parser and it's called LALR. Of course, they didn't include anything like packrat, but even worse, they pretend like a LALR generator is still better than LR, even though LR was extended to be tractable for large grammars years ago.
And even with all that, it's far too high level to be of use in actually engineering a compiler (which, by the way, is a good intro book).
I can't argue with your "hugely out of date" note. It is likely that I am also out of date, not having written compilers since the mid-80s.
Do you have an opinion about the Holub book? Also I have a collection of Davidson papers about code generation that I haven't looked at since back then.
Unless you need to for speed or compat reasons, I don't like writing parsers in C.
Remember, a compiler is a translator from text to (often)text. You're going to be dealing with a lot of strings and probably allocations for your AST. You should think about using a language which doesn't make string manipulation and memory management like pulling teeth.
In general, I'd say the ML family is best for writing a general-purpose compiler. SML even gives your compiler a formal semantics for free.
as @wglb said below I can't disagree with you more on the Dragon book. Yes, there are probably better resources out there for lexing and parsing but have your looked at the rest of the book? It actually covers most of the important static analysis techniques in sufficient detail to implement many of the important optimizations. In short: there is a reason people keep recommending it: it is a really great resource.
Also, if you do want to implement a traditional Yacc like parser generator then it is a pretty good resource for that as well (having done it). Finally, while writing a parser might be a "character building exercise" sometimes there is also no getting around it.
How can you say there are better references for lexing and parsing and leave out that the entire first half of the book is lexing and parsing?
There are a ton of books that are better at every single thing you would want. Engineering a Compiler. Modern Compiler Implementation in ML. The compiler Handbook.
I usually reference a combination of: Dragon Book, Muchnicks: Advanced Compiler Design, and Semantics with Applications. I have been meaning to pick up a copy of Engineering a Compiler. (I should also note I am currently reading: Principles of Program Analysis, it is a good book focusing on just theory. Read Semantics with Applications first it is basically the follow up)
A guess my point is this: I have learned a great deal from the Dragon book. I think it is a solid book that has taught me a lot. There may be better books out there but I haven't read one yet (Advanced Compiler Design is great but it really is only about optimization and analysis you need an undergrad book to supplement it).
Finally, I have encountered worse books on the subject of compilers. So yes, this is a book that I would recommend and continue to recommend.
ps. You mis-characterize the length of the lexing and parsing coverage. It starts on page 109 and ends on page 302, the content goes to 964. Chapter wise: 3-4 lexing->parsing, (chapter 1-2 are really an introduction and illustrative example so they don't count). Chapters 5-8 cover the rest of what you need to get a working compiler + some other stuff. Chapters 9-12 (page wise 583-964) cover optimization and analysis in depth. So really, nearly 40% is optimization while about 20% is syntax analysis. This book has a lot of good material most of it isn't to do with syntax analysis and the syntax analysis is for the most part high quality.
Except when you work on embedded systems (I'm talking about the type where you have a few hundred kilobytes of ram and flash, no OS under your ass and crap libraries). I've used a recursive descent parsing technique that's described in a paper of about 5 pages by David Hanson to great effect [1]. Depends on the situation you find yourself in, sometimes it's unavoidable to re-invent the wheel.
If you have a functional bend, Simon Peyton Jones' book (https://research.microsoft.com/en-us/um/people/simonpj/Paper...) is worth reading, too. His book, however, is not a complete treatment. It assumes you know e.g. how to write a parser, and concentrates on the challenges unique to lazy functional languages.
Is there any parser autogeneration tool that can cope with c++11's grammer? Including weird edge cases like template vs. comparison depending on previously declared types?
C++ is messed up that you're gonna need to do some weird stuff. First, the grammar is ambiguous. Second, the template language is undecidable. Nothing you can do about the second.
The first, you'll at least need something capable of parsing context-free languages. My recommendation is to start here[1].
I was just replying to the comment regarding GCC and MSVC -- the argument seemed to be that most of the development manhours were devoted to C++11 when in fact most of it was devoted to optimization strategies and cross-architectural compatibility.
Honestly, it's worded very poorly. "Compliant with the latest 2011 standard (C++11)" suggests all of this. I have a hard time believing this isn't some kind of joke - there's literally no-one alive that could write all of this in the timeframe of a course.