Treesitter is a parser generator tool i.e. it generates parsers for programming languages based on their given grammar. With the parsers, you can then generate Concrete Syntax Trees (or CST) from your source code. It supports incremental parsing, which means it can generate/update CST as the source code is edited in real time. It will only regenerate the CST for part of the code which is edited/updated. Source code to CST conversion
CST is a hierarchical representation of your source code which includes every detail, including punctuation and comments. Because of its hierarchical structure it can be navigated and queried very easily as opposed to the linear source code format.
The treesitter parser for a given language converts your code from linear text format to a hierarchical structure which is more easy to analyse and query. In this blog, we will take the example of Javascript, we will generate the parser for javascript and later use the tree-sitter CLI to query and highlight javascript files with treesitters’ Javascript parser. We will even look at how to use/setup the features of treesitter within neovim. So lets go over this tree format and understand what exactly does it comprise of.
The CST comprises of nodes, each node in the CST correspond to a syntactic element or construct in the source code, such as an expression, statement, or punctuation.
Treesitter produces two types of nodes:
Tree-sitter distinguishes between anonymous and named nodes to make it easier to analyze code. Anonymous nodes represent things like commas or parentheses, which are important for code structure but not for understanding what the code does. Named nodes represent meaningful elements like functions or variables. By focusing on named nodes, you can ignore unnecessary details, making code analysis simpler and more like working with an abstract syntax tree (AST).
various parts of CST
Every named node has the following details in the CST:
Nodes in a CST are related through parent/child/sibling relations. In order to understand this relations, we must have a good grasp of the languages grammar, for example if you check the Javascript’s grammar in tree-sitter-javascript repo, you will see the grammar file has a rules property, tree sitter uses these rules and other details in the grammar files along with the help of some other files in the repository to generate the parser.
Lets use the tree-sitter cli to parse and highlight a sample Javascript file.
npm install -g tree-sitter-cli
git clone https://github.com/tree-sitter/tree-sitter-javascript
cd tree-sitter-javascript
tree-sitter generate
tree-sitter init-config
{
"parser-directories": [
"[parent directory name where you cloned the tree-sitter-javascript repo in step 2]",
],
"file-types": {
"js": "javascript"
},
rest of the config...
}
Now that we have the javascript parser setup with treesitter, we can finally use it to parse and highlight javascript code.
Create a javascript file with the following contents
function add(a, b) {
return a + b;
}
tree-sitter parse app.js
to produce the CST.
tree-sitter highlight app.js
to highlight the file.
If you are interested in exploring tree-sitter’s cli further then you can run tree-sitter -h
for further help.To analyse our code treesitter provides a very simple but powerful query language. Queries are just patterns to match the nodes in the CST. Each pattern must be an S-expression, which can match zero or more nodes in the CST.
To match a function declaration for example we can use the query:
(function_declaration)
however, using just that is not enough we need capture it too, you can use the following syntax to capture it.
(function_declaration) @function.declaration
Why capture, you ask? Well, sometimes we might be writing queries to find specific code patterns. For example, if we want to find all the functions that contain if
statements, but are only interested in capturing the function names, we could write a query like this:
(function_definition
name: (identifier) @function.name
(if_statement)
)
In the above case, we are using the @function_name
capture to extract just the function names, which is the information we care about. This is why capturing is useful—it allows us to focus on the specific data we need, even when querying for larger patterns.
Captures allow you to associate names with specific nodes in a pattern, so that you can later refer to those nodes by those names. Capture names are written after the nodes that they are capturing, and start with an @
character.
To match child nodes, for example to capture a function name:
(function_declaration
name: (identifier) @function.name
)
If you want to learn more about the query language and its syntax, treesitters’ documentations has covered it in full details here
Neovim has treesitter support built-in, but you still need to install parsers for your languages. You can install parsers with the nvim-treesitter plugin. Once you have the plugin you can run the following commands
Below is a short video which shows how to do it
That’s it for this post. Tree-sitter is a game-changer for syntax parsing and code analysis. Its flexibility and efficiency make it a must-explore tool for developers. If you’d like to talk more, feel free to message or email me—I’m always happy to chat about dev tools, and programming languages. Thanks for reading, and happy coding!