Tree-sitter: Revolutionizing Parsing with an Incremental Parsing Library
Programmers love to add the word “tree” to everything with concepts like binary trees, red-black trees, trie, tree-shaking, and now, Tree-sitter. While 'Tree-sitter' might sound whimsical, it represents a significant innovation in parsing, the fundamental task in understanding and processing code. Traditional parsing methods often face limitations in flexibility and efficiency, but Tree-sitter seeks to improve on this and more, while also doing this in real time with every key stroke. This is an ambitious goal and requires solving many difficult problems.
The Fundamentals of Parsing
To understand the significance of Tree-sitter, it is essential to understand the basics of parsing. Parsing involves analyzing a sequence of tokens to determine its syntactic structure, often represented by a parse tree or an abstract syntax tree (AST). Traditional parsing methods, such as hand-written parsers or parser generators like Bison or ANTLR, have long been the standard tools for this task.
They work by using a grammar file often written in a version of Extended Backus-Naur Form. For many languages this is challenging due to the many quirks and edge cases. Here is a sample of some C grammar.
grammar C;
primaryExpression
: Identifier
| Constant
| StringLiteral+
| '(' expression ')'
| genericSelection
| '__extension__'? '(' compoundStatement ')' // Blocks (GCC extension)
| '__builtin_va_arg' '(' unaryExpression ',' typeName ')'
| '__builtin_offsetof' '(' typeName ',' unaryExpression ')'
;
genericSelection
: '_Generic' '(' assignmentExpression ',' genericAssocList ')'
;
genericAssocList
: genericAssociation (',' genericAssociation)*
;
genericAssociation
: (typeName | 'default') ':' assignmentExpression
;
postfixExpression
:
( primaryExpression
| '__extension__'? '(' typeName ')' '{' initializerList ','? '}'
)
('[' expression ']'
| '(' argumentExpressionList? ')'
| ('.' | '->') Identifier
| '++'
| '--'
)*
;
argumentExpressionList
: assignmentExpression (',' assignmentExpression)*
;
unaryExpression
:
('++' | '--' | 'sizeof')*
(postfixExpression
| unaryOperator castExpression
| ('sizeof' | '_Alignof') '(' typeName ')'
| '&&' Identifier // GCC extension address of label
)
;
unaryOperator
: '&' | '*' | '+' | '-' | '~' | '!'
;
castExpression
: '__extension__'? '(' typeName ')' castExpression
| unaryExpression
| DigitSequence // for
;
For other languages like Schemes or Common Lisp, the grammar is much simpler due to how the languages’ syntax is constructed. Pharo, a Smalltalk, even boasts about how simple it is to parse, highlighting their LL(1) grammar, which only needs one symbol of look ahead. This is why Pharo's syntax fits on a postcard.
But since most programming languages are not developed by people with PhDs in Programming Language Theory, they usually accumulate language specific quirks that need to be handled as special cases within a compiler to fully parse the language. Because of this, creating plugins for IDEs that actually understand a programming language on a fundamental level is rare. The ugly truth is that most of the syntax highlighting you see for plugins on your favorite text editor are just a series of regex hacks. And we all know the issue with regexes
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Enter Tree-sitter.
The Birth of Tree-sitter:
Tree-sitter was announced to the public in a GitHub blog post in 2018. When it was first announced it had support for 11 languages.
Bash
C
C++
ERB
EJS
Go
HTML
JavaScript
Python
Ruby
TypeScript
The author of the post, Max Brunsfeld was the creator of Tree-sitter. He developed it out of the need for an incremental parsing solution that was faster than the state of the art at the time. Incremental parsing allows IDEs to parse the file when it is first opened into a syntax tree. Then when edits are made in the file, Tree-sitter can give you a new syntax tree that reflects the edit. This is in contrast to how most IDEs worked pre Tree-sitter. They would reparse the whole file from the beginning to show you the new syntax highlighting. On very large files this contributed to noticeable lags in highlighting.
In Tree-sitter since the new tree shares the part of the old tree that you didn’t edit, creating the new syntax tree is fast and does not consume much memory. This makes it possible to parse source code every time a new key is pressed in the editor. But that is only part of the solution.
If an IDE tries to perform syntax highlighting on every keystroke it will invariably run into the situation where the code is in an incomplete state. This can cause errors when trying to build the syntax tree. But Tree-sitter is able to perform error recovery by determining where the start and end of every error is and give you back a working syntax tree to that point.
How it works
To use Tree-sitter you need to hope that someone has written a context free grammar for your language, or write it yourself. You do this by creating a grammar.js
file. By developing the grammar in a programming language like Javascript it makes it easy to programmatically define grammars, and define grammar in terms of other grammars. An example of this is the C++ parser which is defined in terms of the C grammar. This grammar can then be parsed using a "GLR-based" algorithm. This algorithm allows Tree-sitter to handle ambiguous grammars effectively, where multiple parse trees can be generated for a given input and the right one can be picked.
After the grammar is written, you run a command line tool that spits out a parser.c
file. The first part of the parser.c
file is a lexer function that tokenizes the characters in a file. The second part is the parse table, which is a large 2 dimensional array that tells the parser what to do in a given state.
From there you can compile the parser.c
and put the compiled library in a place that an editor can find it. See my most recent article to see what that looks like.
While parsing is Tree-sitter's primary focus, it offers a flexible API that allows developers to leverage its capabilities for various use cases beyond parsing. For example, since the code is parsed into a tree, the tree can be navigated and modified using the parse tree. This leads to navigation in files that is "aware" of the language it is in.. For example, many editors have the ability to jump to the end of a function from the beginning of a function, by finding the closing curly brace
int find_max(int num1, int num2) {
if (num1 > num2) {
return num1;
} else {
return num2;
}
}
But that same logic fails in a language like Python which uses whitespace for scope.
def find_max(num1, num2):
if num1 > num2:
return num1
else:
return num2
With Tree-sitter, when you execute a "move to the end of the function" command, it knows where the end of a function is based on the tree, and does what you expect. This structured navigation and editing makes it easy to do all sorts of things that used to require ugly hacks, or horrific regexes. See Mickey Peterson's Combobulate library for examples of how this works in Emacs.
Tree-sitter's impact has been undeniable, with adoption in Vscode, Emacs, Neovim, and many other editors. It unlocks structured editing which allows unified movement and behavior between editors based on a parse tree. It advances on what came before, and is one of the few tools that lets you have your cake and eat it to. To top it all off it is open source, allowing everyone to benefit from its incredible achievements.
Call To Action 📣
Hi 👋 my name is Diego Crespo and I like to talk about technology, niche programming languages, and AI. I have a Twitter and a Mastodon, if you’d like to follow me on other social media platforms. If you liked the article, consider liking and subscribing. And if you haven’t why not check out another article of mine listed below! Thank you for reading and giving me a little of your valuable time. A.M.D.G