Some years ago I posted an article about the code size in the MySQL Workbench project and talked a bit about the different subprojects and modules. At that time the project consisted of ~400K LOC (including third-party code) and already then the parser was the second biggest part with nearly a forth of the size of the entire project. This parser project back then used the yacc grammar from the MySQL server codebase and was our base for all parsing tasks in the product. Well, things have changed a lot since these days and this blog post discusses the current parsing infrastructure in MySQL Workbench.
We started looking into a more flexible way of creating our parser infrastructure. Especially the generation of lexer and parser from the grammar was a long winded process, that included a lot of manual tweaking. The most important advantage of using the MySQL server yacc grammar is however that we always stay in sync easily. Though, this is true only for the server version the grammar we picked is for. But MySQL Workbench needs more flexibility, supporting a whole range of server versions (from 5.1 up to the latest 5.7.8). Hence we decided to switch to a different tool: ANTLR. Not so surprising, however: the yacc based parser is still part of MySQL Workbench, because it’s not possible to switch such an important part in one single step. However over time the ANTLR based parsers will ultimately become our central parsing engine and one day we can rule out the yacc parser entirely.
Files created by ANTLR are the biggest single source files I have ever seen. The MySQLLexer.c file is 40MB in size with almost 590K LOC. No wonder our project metrics have changed remarkably, though not only because of the ANTLR based parser. Here are the current numbers (collected by a script shipped with the source zip):
machine:workbench Mike$ sh tools/count_loc.sh c (antlr parser): 1484033 loc 6 files cpp: 418494 loc 704 files cxx: 28484 loc 2 files mm: 31926 loc 97 files m: 9795 loc 37 files py: 87652 loc 170 files cs: 43149 loc 150 files h: 143743 loc 928 files Total: 2247276 (763243 without ANTLR parser) Total Files: 2094 (1166 without headers)
The reason for the big size is the support of the full Unicode BMP for identfiers, which requires some really big state tables in the lexer.
ANTLR 3 – The Workhorse
The current version of ANTLR is 4, published almost 2 years ago. However, we are still on version 3, for a good reason. ANTLR can generate parsers for various target languages, like C#, Java and Python. However, still today, there is no C or C++ target for ANTLR 4, while ANTLR 3 supports both languages well. Hence we decided to stay with ANTLR 3 and with every addition we do (e.g. see the code completion engine) we are more tight to it and unlikely to upgrade to version 4 any time soon. At least a C target should have been one of the first targets, really.
But why not stay with the server’s parser, you might ask. It’s thoroughly tested and obviously is as compatible as a parser can be for the MySQL language. Well, a flexible client tool has different needs compared to a server and that’s why. It starts with the ability to support multiple server versions (the server parser always only supports its current version), continues with different requirements for handling errorneous sql code and really goes own ways when it comes to tooling (like the mentioned code completion engine or the quick syntax checker). ANTLR 3 generates socalled top-down parsers (recursive descent), while YACC creates bottom-up parsers, which use a different approach to parse text. Our ANTLR based parser usually gives better error message, e.g. for a query like:
select 1, from sakila.actor;
the server shows that dreaded “Error Code: 1064. You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ‘from sakila’ at line 1”), while the ANTLR parser says: Syntax error: unexpected comma (with precise position). You can see this in action in the MySQL Workbench SQL editors (via error tooltips).
Another really useful ability in an ANTLR based parser is that you can use any rule in the grammar as starting point, which is not easily possible with a YACC based parser. All grammar rules are generated as functions (remember: recursive descent parser). So you can always call any of them with your input, e.g. you can easily only parse expressions, instead of a full query. We use this ability to parse SQL code in our object editors (stored procedures, triggers etc.) which implicitly disallows any other SQL code not allowed at this point. Also datatype parsing uses the new parser, allowing for maximum conformity of user specified data types and good feedback to the user in case of an error. For developers very important is also the fact that you can easily debug the parser code, if needed. Try that with a YACC based parser which is only iterating over states.
Finally, the server grammar has got quite a number of problems, like a big number of shift-reduce conflicts, a handwritten lexer which is hard to maintain, trouble with semantic action execution (multi execution if not carefully placed) and others. However, the server team is constantly working on improving their parser. It’s just not a good choice for MySQL Workbench.
We decided to use the C target from ANTLR because the C++ support was not only incomplete (and still is) but lead to huge compilation times. Integrating a C module in a C++ environment is trivial and rewards us with a high performance parser.
A Dynamic Parser
Above I mentioned that a GUI tool like MySQL Workbench needs to support multiple server version. Additionally, it must be possible to switch behavior based on the SQL mode. All this is possible by use of socalled semantic predicates. These constructs allow to switch rules or alternatives off and on based on some condition (e.g. the server version). This will ensure a user will always get the right syntax check and proper code completion, regardless which server he actually connects to. This goes so far that we can easily toggle language parts that were only valid for a certain version range (e.g. the NONBLOCKING keyword, which was valid only between 5.7.0 and 5.7.6).
Though we not only use the generated files for standard parsing tasks, but also have a dedicated scanner (based on the lexer) that helps us in determining a context for the context sensitive help. This way we can even handle partially invalid SQL code easily.
In opposition to the server grammar and parser generation from it, our ANTLR parser is generated and compiled automatically during a build and only when there was a change in the grammar file. This allows for easy development of the parser grammar. A test tool exists that uses the same parser and allows to take a deeper look at a query and the parsing process, by printing the AST (abstract syntax tree) in a windows. Additionally it allows to run a subset of our parser unit tests and even can generate test result code we use to feed certain parser tests.
The Parsers in Code
Generating the parsers requires Java installed (because ANTLR is a Java tool), which is the reason why we include the generated files in the source package. This way you are not forced to install Java when you want to build MySQL Workbench. The generation step is simply skipped as the grammar and generated files have the same current timestamp. However, as soon as you change the grammar you will need Java (and the ANTLR jar) to regenerate the parser files, when you build MySQL Workbench yourself.
Starting with Workbench 6.3 we use 2 parser variants: one that generates an AST (abstract syntax tree) and one without. The latter is used for our quick syntax checker as it is twice as fast as the one generating an AST (generation cannot be switched dynamically). The AST however is needed to easily walk the parsed elements, e.g. to find the type of a statement, convert query details into our internal representation, manipulate (rewrite) queries and other things.
The entire parsing engine is organized in 3 layers:
- The generated C parsers wrapped by small classes to provide a C++ interface (including a tree walker, the mention syntax checker and a scanner). You can find all related files in the “library/mysql.parser” subfolder. The generated and wrapper files are:
- MySQL.tokens (a list of token names and their values)
- MySQLLexer.h/c (the generated lexer)
- mysql-scanner.h/cpp (the C++ wrapper around the generated lexer)
- MySQLParserc.h/c (the generated (full) parser)
- mysql-parser.h/cpp (the C++ wrapper around the generated (full) parser)
- MySQLSimpleParser.h/c (the generated parser without AST creation) + its token file
- mysql-syntax-check.h/cpp (the C++ wrapper for that, it shares the lexer with the main parser)
- Some support files (mysql-parser-common.h/cpp, mysql-recognition-types.h)
- The “library/mysql.parser/grammar” folder contains the 2 grammar files (full + simplified parser), build scripts for each platform and the test application (currently for OSX only).
- A module providing our socalled parsing services, including parsing of individual create statements (e.g. for our table or view editors). The parsing services mostly deal with conversion of sql text into our grt tree structure, which is the base for all object editors etc. Currently this is separated into a dynamically loadable module, containing the actual implementation and an abstract class for direct use of the module within Workbench. The related files are:
- modules/db.mysql.parser/src/mysql_parser_module.h/cpp (the module with most of the actual code)
- backend/wbpublic/grtsqlparser/mysql_parser_services.h/cpp (the abstract interface for the module + some support code)
- The editor backend driving the UI, connecting the parsing services and implementing error checking + markup as well as code completion. This layer is spread over a couple of files, all dealing with a specific aspect of handling sql code, which includes query determination and data type parsing as well as our object editors and the sql editor backend. This backend is a good example of the integration of GUI, Workbench backend and parser services including syntax checks and code completion (backend/wbpublic/sqlide/sql_editor_be.h/cpp).
After some years while our grammar and parser matured we reached not only full conformity with the server parser, but could even add language features that aren’t released yet. Our grammar is as close to the server parser as one can be and is the most complete grammar you can get for free (as part of the MySQL Workbench package, which ships the grammar not only in the source zip, but also in the binary package, because it is needed for code completion, or download it MySQL.g). Once in a while (also before big releases) we scan all changes in the server’s sql_yacc.y file and incorporate them, to stay up to date.
Additionally, we have a large set of unit tests that check proper behavior of the generated parser. Some of them (e.g. the sql mode and operator precedence tests) were taken from the MySQL server tests. We have a set of ~1000 queries of all types, to cover most of the language and a special set of commands that stress the use of identifiers (as documented for MySQL) as well as huge table definitions with hundreds of columns and indices etc.
Finally I’d like to express my respect and thankfulness for the guys that stand behind such an extremely useful and well done tool as ANTLR is. These are mainly Prof. Terence Parr (the ANTLR guy) and Sam Harwell for their dedication to ANTLR over all the years as well as Jim Idle for solving the complex task of converting an OOP ANTLR target (Java) to a non OOP language (C), which is the foundation we build everything on.
ANTLR has got an own Google discussion group. Please join our discussion there about ANTLR in MySQL Workbench.