🔬Build A Parser By Rust (Part 1)

The content of this document is copied from Feishu documents for search purposes, and there may be formatting incompatibilities. It is recommended to view the original Feishu document

Background#

Recently, I have been following mbrubeck to learn how to write a simple browser engine using Rust through Robinson. During this process, since it is necessary to parse formats such as HTML and CSS, you need to write relevant parsers to complete the task.

Writing a parser from scratch is a very tedious and error-prone task, as you not only need to consider the specific protocol rules that need to be parsed but also need to think about error handling, extensibility, and parsing performance. Therefore, the author also mentioned in the article that it is recommended to optimize using existing third-party libraries like pest.

Reflecting on my daily development work, I encounter scenarios that require constructing parsers relatively infrequently. Often, if there is a need to parse information of a certain format or protocol, such as JSON or CSV, I directly use a third-party library specifically for parsing that format or protocol for efficiency. However, not all protocols or formats have pre-written parsers, especially for various network communication protocols, and it is also difficult to customize existing parsers. So, taking this opportunity, it is a good chance to learn about parsers and the relevant practices of constructing parsers, which will be useful for similar scenarios in the future.

Note:

This document will not delve deeply into the principles of parsers and parser libraries; it is more about "an introduction to parsers and their practice."
This series of documents is divided into two parts: the first part mainly discusses understanding parsers and using third-party libraries, while the second part focuses on specific practices.
The source code mentioned in this series of documents can be viewed at https://github.com/catwithtudou/parser_toy.

Prerequisite Knowledge#

Below are some introductory background knowledge related to parsers to help understand the following content.

Parser#

The parser mentioned here is actually a broader definition, usually referring to a component that converts information of a certain format into a certain data structure.

It is like "decoding" information of a certain format into an organized data structure, making it easier to understand and process the information.

For example 🌰: There is a text of a mathematical expression "1 + 2", and we expect the program to be able to calculate the result.

To allow the program to recognize the arithmetic expression, a parser for arithmetic expressions can be used to convert it into a structure (left, op, right) for calculation.

In the field of computer science, parsers are essential in the process of handling data and can be applied in various data processing scenarios, such as the more common ones:

In the underlying compiler or interpreter, the parser's main role is to perform lexical and syntactic analysis on the source code, extracting the abstract syntax tree (AST).
For the commonly used data exchange format JSON in web applications, the corresponding parser can serialize the required data structure for processing.
Others include parsing network communication protocols, scripting languages, database languages, etc.

PEG#

Before introducing PEG (Parsing Expression Grammar), we can use (hypothetical) more common regular expressions for easier understanding.

The connection between regular expressions and PEG is mainly that both can match and parse character text using specific syntax. The differences are:

【Syntax aspect】The former uses a specific syntax to describe string patterns, usually for simple string matching and searching. The latter uses a more complex syntax to describe language structures, typically for handling complex language parsing and analysis needs.
【Application domain】The former is mainly used for simple text processing needs, such as finding specific patterns in text or validating input formats. The latter is primarily used for handling complex language structures, such as syntax analysis and interpreter construction for programming languages.

Through this introduction, I believe everyone has a simple understanding of PEG.

The reason for introducing PEG is that tools that can be implemented through PEG (called Parser Generators) can be used to create custom Parsers.

Next, let's formally introduce PEG, which is Parsing Expression Grammar:

Introduction to PEG

Parsing Expression Grammar, abbreviated as PEG (English: Parsing Expression Grammar):

It is a type of analytical formal grammar. Introduced by Bryan Ford in 2004, it is related to the family of top-down parsing languages introduced in the 1970s.
As a grammar that describes language structures, it can handle more complex language structures compared to regular expressions, due to its recursive nature that can describe infinitely nested structures.
It uses a simple and flexible way to define grammar rules, which can be used to parse input strings and generate syntax trees.
It has advantages in ease of use, correctness, and performance, providing features such as error reporting and reusable rule templates, thus widely used in parsing and analyzing text.

Introduction to PEG Applications

The syntax of PEG is similar to programming languages, using operators and rules to describe language structures:

Operators include “|” (or), “&” (and), “?” (optional), etc., while rules are used to describe the specific structure of the language.
For example, here is a simple PEG rule that describes the syntax of an integer:
```
int := [0-9]+
```

Because it can be directly converted into efficient parser code, there are currently many parsers implemented using PEG, such as ANTLR, PEG.js, etc.

Parser Combinator#

With the previous understanding of Parsers, understanding Parser Combinators becomes easier.

Definition and Concept of Parser Combinator

In simple terms, Parser Combinators are components constructed by combining various parser components.

The idea of Parser Combinators aligns well with software engineering; it is a technique for building parsers based on function composition, allowing for the construction of complex parsers by combining small, reusable, and testable parser components. This approach makes the construction of parsers more flexible and extensible, greatly improving development efficiency and facilitating subsequent maintenance.

Parser Combinator vs. Parser Generator

Parser Combinators are actually parallel concepts to the previously mentioned Parser Generators. Here is an example:

If we consider the parser we want to implement (such as a JSON Parser) as a large building,
Using a Parser Generator would mean starting from scratch each time to build that building, and similar parts (like doors and windows) between buildings cannot be reused.
However, using Parser Combinators is like building with LEGO blocks, continuously constructing small, reusable, and testable components, and then using these components to build the building. If we want to construct a new building, we can reuse the components created earlier, which is very convenient. At the same time, when a parser has issues, it is easy to locate a specific component, making subsequent maintenance easier.

Parser Combinator vs. PEG-based Parser Generator

【Expression aspect】Parser Combinators are more flexible in their expressive power, allowing for the direct use of programming language features to combine and define parsers. In contrast, PEG-based Parser Generators describe parsers through specific syntax rules, and their expressive power is limited by those syntax rules, meaning you need to learn how to use the Parser Generator's interface and also master the syntax rules of PEG.

【Performance aspect】The performance comparison between Parser Combinators and Parser Generators depends on the specific implementation and usage scenarios. However, generally speaking, Parser Generators tend to generate efficient parser code, which may perform better when handling large grammars and complex inputs. On the other hand, Parser Combinators typically incur some performance overhead because they dynamically combine parsers at runtime.

However, currently in Rust, the performance of nom (based on Parser Combinators) is somewhat higher than that of pest (based on PEG).

Rust Parser Library#

Next, we will introduce classic third-party libraries used to implement parsers in Rust, namely Pest (based on PEG) and Nom (based on Parser Combinators).

pest#

Introduction#

pest-parser/pest
The Elegant Parser
Rust5014277

Pest is a general-purpose parser written in Rust, focusing on accessibility, correctness, and performance. It uses PEG as input, providing enhanced expressive power needed to parse complex languages while defining and generating parsers in a concise and elegant manner.

It also features automatic error reporting, automatic generation of parser trait implementations through derive attributes, and the ability to define multiple parsers in a single file.

Usage Example#

Add pest dependency in cargo.toml

[dependencies]
pest = "2.6"
pest_derive = "2.6"

Create a new src/grammar.pest file to write parsing expression syntax

This syntax represents the parsing rules for the field, where each character is an ASCII digit and may include a decimal point and negative sign, with + indicating that this pattern can occur multiple times.

field = { (ASCII_DIGIT | "." | "-")+ }

Create a new src/parser.rs file to define the parser

The following code defines a struct Parser, which automatically implements a parser that satisfies the syntax file's patterns through the derive macro (every time you compile this file).

use pest_derive::Parser;

#[derive(Parser)] 
#[grammar = "grammer.pest"]
pub struct Parser;

// Every time you compile this file, pest will automatically generate such items using the grammar file
#[cfg(test)]
mod test {
    use std::fs;

    use pest::Parser;
    
    use crate::{Parser, Rule};

    #[test]
    pub fn test_parse() {
        let successful_parse = Parser::parse(Rule::field, "-273.15");
        println!("{:?}", successful_parse);

        let unsuccessful_parse = Parser::parse(Rule::field, "China");
        println!("{:?}", unsuccessful_parse);
    }
}

Specific Usage#

Official Documentation

Parser API#

Pest provides various methods to access successful parsing results. Below, we will introduce its methods according to the following syntax example:

number = { ASCII_DIGIT+ }                // one or more decimal digits
enclosed = { "(.." ~ number ~ "..)" }    // for instance, "(..1024..)"
sum = { number ~ " + " ~ number }        // for instance, "1024 + 12"

Tokens

Pest uses tokens to represent success, generating two tokens each time a rule matches, representing the start and end of the match, such as:

"3130 abc"
 |   ^ end(number)
 ^ start(number)

Currently, rustrover has a plugin that supports pest format, which can validate rules and view tokens, among other features.

Nested Rules

If a named rule contains another named rule, tokens will be generated for both, as shown below:

"(..6472..)"
 |  |   |  ^ end(enclosed)
 |  |   ^ end(number)
 |  ^ start(number)
 ^ start(enclosed)

At the same time, in certain scenarios, tokens may not appear at different character positions:

"1773 + 1362"
 |   |  |   ^ end(sum)
 |   |  |   ^ end(number)
 |   |  ^ start(number)
 |   ^ end(number)
 ^ start(number)
 ^ start(sum)

Interface

Tokens will be exposed in the form of a Token enum, which has Start and End variants, allowing you to call tokens on the parsing result to get an iterator:

let parse_result = DemoParser::parse(Rule::sum, "1773 + 1362").unwrap();
let tokens = parse_result.tokens();

for token in tokens {
    println!("{:?}", token);
}

Pairs

If considering matching pairs of tokens to explore the parsing tree, Pest provides the Pair type to represent a pair of matched tokens, mainly used as follows:

Determine which rule produced the Pair
Use Pair as the original &str
Check the internal named rules that generated the Pair

let pair = DemoParser::parse(Rule::enclosed, "(..6472..) and more text")
    .unwrap().next().unwrap();

assert_eq!(pair.as_rule(), Rule::enclosed);
assert_eq!(pair.as_str(), "(..6472..)");

let inner_rules = pair.into_inner();
println!("{}", inner_rules); // --> [number(3, 7)]

Pairs can have any number of internal rules, and you can return Pairs as an iterator through Pair::into_inner():

let pairs = DemoParser::parse(Rule::sum, "1773 + 1362")
    .unwrap().next().unwrap()
    .into_inner();

let numbers = pairs
    .clone()
    .map(|pair| str::parse(pair.as_str()).unwrap())
    .collect::<Vec<i32>>();
assert_eq!(vec![1773, 1362], numbers);

for (found, expected) in pairs.zip(vec!["1773", "1362"]) {
    assert_eq!(Rule::number, found.as_rule());
    assert_eq!(expected, found.as_str());
}

Parse Method

The derived Parser provides a parse method that returns Result<Pairs, Error>. To access the underlying parsing tree, you need to match or unwrap the result:

// check whether parse was successful
match Parser::parse(Rule::enclosed, "(..6472..)") {
    Ok(mut pairs) => {
        let enclosed = pairs.next().unwrap();
        // ...
    }
    Err(error) => {
        // ...
    }
}

Parsing Expression Syntax#

The basic logic of PEG syntax is actually very simple and straightforward, which can be summarized in three steps:

Try to match the rule
If successful, try the next step
If failed, try another rule

The characteristics of its syntax mainly include the following four points:

Eagerness

When running a repeated PEG expression on an input string, it will greedily (as many times as possible) run the expression, resulting in the following:

If the match is successful, it will consume any content it matched and pass the remaining input to the next step of the parser.
If the match fails, it will not consume any characters, and if that failure propagates, it will ultimately lead to a parsing failure unless the failure is caught during propagation.

// Expression
ASCII_DIGIT+      // one or more characters from '0' to '9'

// Matching process
"42 boxes"
 ^ Running ASCII_DIGIT+

"42 boxes"
   ^ Successfully took one or more digits!

" boxes"
 ^ Remaining unparsed input.

Ordered Choice

The syntax includes an ordered choice operator |, for example, one|two, which means first try the former one, and if it fails, then try the latter one.

If there is a requirement for order, it is important to pay attention to the placement of rules in the expression. For example:

In the expression "a"|"ab", when matching the string "abc", hitting the earlier rule "a" will prevent parsing of the subsequent "bc".

Therefore, when writing a choice parser, it is usually best to place the longest or most specific choice first and the shortest or most general choice last.

Non-backtracking

During the parsing process, an expression either succeeds or fails.

If successful, it proceeds to the next step; if it fails, the expression will fail, and the engine will not backtrack to try again, which is different from regular expressions that have backtracking.

For example, in the following case (where ~ indicates that the next step will be executed after the preceding rule matches successfully):

word = {     // to recognize a word...
    ANY*     //   take any character, zero or more times...
    ~ ANY    //   followed by any character
}

"frumious"

When matching the string "frumious", ANY* will first consume the entire string, while the next step ANY will not match any content, causing it to fail.

"frumious"
 ^ (word)

"frumious"
         ^ (ANY*) Success! Continue to ANY with remaining input "".
 
""
 ^ (ANY) Failure! Expected one character, but found end of string.

In this scenario, for systems with backtracking (like regular expressions), it would backtrack one step, "spitting out" a character and trying again.

Unambiguous

Each rule in PEG runs on the remaining part of the input string, consuming as much input as possible. Once a rule completes, the remaining input will be passed to other parts of the parser. For example, the expression ASCII_DIGIT+ indicates matching one or more digits, which will always match the largest possible contiguous sequence of digits. There is no unexpected backtracking that allows later rules to "steal" some digits in a non-intuitive and non-local manner.

This sharply contrasts with other parsing tools, such as regular expressions and CFG, where the results of rules often depend on some distant code.

Parser Syntax & Built-in Rules#

Important Syntax

Pest's syntax is significantly fewer than that of regular expressions. Below is a brief display of the main syntax and meanings; for more details, you can search for them yourself:

Syntax	Meaning	Syntax	Meaning
`foo = { ... }`	regular rule	`baz = @{ ... }`	atomic
`bar = _{ ... }`	silent	`qux = ${ ... }`	compound-atomic
`#tag = ...`	tags	`plugh = !{ ... }`	non-atomic
`"abc"`	exact string	`^"abc"`	case insensitive
`'a'..'z'`	character range	`ANY`	any character
`foo ~ bar`	sequence	`baz	qux`
`foo*`	zero or more	`bar+`	one or more
`baz?`	optional	`qux{n}`	exactly n
`qux{m, n}`	between m and n (inclusive)
`&foo`	positive predicate
`PUSH(baz)`	match and push	`!bar`	negative predicate
`POP`	match and pop
`DROP`	pop without matching	`PEEK`	match without pop
`PEEK_ALL`	match entire stack

Built-in Rules

In addition to ANY, Pest also provides many built-in rules to make parsing text more convenient. Here are a few commonly used ones; for more details, you can look it up yourself:

Built-in Rule	Equivalent	Built-in Rule	Equivalent
ASCII_DIGIT	`'0'..'9'`	ASCII_ALPHANUMERIC	any digit or letter `ASCII_DIGIT
UPPERCASE_LETTER	Uppercase	NEWLINE	any line feed format `"\n"
LOWERCASE_LETTER	Lowercase	SPACE_SEPARATOR	Space separator
MATH_SYMBOL	Math symbol	EMOJI	Emoji

nom#

Introduction#

rust-bakery/nom
Rust parser combinator framework
Rust10007836

Nom is a parser combinator library written in Rust, which has the following features:

Build safe parsers without compromising speed or memory consumption.
Rely on Rust's powerful type system and memory safety to generate both correct and efficient parsers.
Provide functions, macros, and traits to abstract most easily error-prone pipelines, while also allowing for easy combination and reuse of parsers to build complex parsers.

Nom can support a wide range of application scenarios, such as the following common scenarios:

Binary format parsers: The performance of Nom is as fast as parsers written in C, and it is not susceptible to buffer overflow vulnerabilities, with common processing patterns built-in.
Text format parsers: Capable of handling formats like CSV and more complex nested formats like JSON, it can not only manage data but also has multiple useful tools built-in.
Programming language parsers: Nom can serve as a prototype parser for languages, supporting custom error types and reports, automatically handling whitespace, and constructing AST in place.
In addition to the above scenarios, it also supports streaming formats (like HTTP network processing) and parser combinators that are more in line with software engineering.

Usage Example#

Here, we will introduce the "Hex Color Parser" example provided in the nom repo README:

The specific format of hex colors is:

Starts with "#", followed by six characters, with every two characters representing the values of the red, green, and blue color channels.

For example, "#2F14DF" means "2F" represents the value of the red channel, "14" represents the value of the green channel, and "DF" represents the value of the blue channel.

Add nom dependency in cargo.toml

[dependencies]
nom = "7.1.3"

Create a new src/nom/hex_color.rs file, import nom to construct the hex color parsing method hex_color

tag will match the starting character pattern, tag("#") returns a function whose return value is IResult<Input, Input, Error>
- Here, Input is the type of the function's input parameter, the first value is the input value after removing the matching pattern, the second is the matched content, and the last is the error value.
The take_while_m_n method provided by nom has the first two parameters as the minimum and maximum matching quantities, and the last parameter as the matching rule, returning a result similar to the above.
The map_res method provided by nom can convert the result obtained from the first parameter according to the pattern of the second parameter.
The tuple method provided by nom accepts a group of combinators, applies them to the input in order, and then returns the parsing result as a tuple.

use nom::{AsChar, IResult};
use nom::bytes::complete::tag;
use nom::bytes::complete::take_while_m_n;
use nom::combinator::map_res;
use nom::sequence::tuple;

#[derive(Debug, PartialEq)]
pub struct Color {
    pub red: u8,
    pub green: u8,
    pub blue: u8,
}


// Check if it is a hex digit
pub fn is_hex_digit(c: char) -> bool {
    c.is_hex_digit()
}

// Convert string to decimal result
pub fn to_num(input: &str) -> Result<u8, std::num::ParseIntError> {
    u8::from_str_radix(input, 16)
}

// Match input by two digits according to is_hex_digit rule and convert the result to decimal using to_hex_num
pub fn hex_primary(input: &str) -> IResult<&str, u8> {
    map_res(
        take_while_m_n(2, 2, is_hex_digit),
        to_num,
    )(input)
}

// Hex color parser
pub fn hex_color(input: &str) -> IResult<&str, Color> {
    let (input, _) = tag("#")(input)?;
    let (input, (red, green, blue)) = tuple((hex_primary, hex_primary, hex_primary))(input)?;

    Ok((input, Color { red, green, blue }))
}

#[cfg(test)]
mod test {
    use super::*;

    #[test]
    fn test_hex_color() {
        assert_eq!(hex_color("#2F14DF"), Ok(("", Color {
            red: 47,
            green: 20,
            blue: 223,
        })))
    }
}

Specific Usage#

Parser Result#

In the previous example, the return type of the nom parsing method is IResult, which is one of the core structures of nom, representing the return result of nom parsing.

First, the results defined by the parsers constructed by nom are as follows:

Ok(...) indicates the content found after successful parsing, while Err(...) indicates that the corresponding content was not found during parsing.
If parsing is successful, it will return a tuple where the first contains all the content that the parser did not match, and the second contains all the content that the parser matched.
If parsing fails, it may return multiple errors.

                                   ┌─► Ok(
                                   │      what the parser didn't touch,
                                   │      what matched the regex
                                   │   )
             ┌─────────┐           │
 my input───►│my parser├──►either──┤
             └─────────┘           └─► Err(...)

Thus, to represent this model, nom defines the structure IResult<Input, Output, Error>:

It can be seen that Input and Output can actually be defined as different types, while Error can be any type that implements the ParseError trait.

Tag & Character Classes#

Tag Byte Set Tag

Nom refers to simple byte sets as tags. Since these are very common, it also has a built-in tag() function that returns a parser for the given string.

For example, if you want to parse the string "abc", you can use tag("abc").

It should be noted that nom has multiple different definitions of tag, and unless otherwise specified, it is usually best to use the following definition to avoid unexpected errors:
pub use nom::bytes::complete::tag;

The signature of the tag function is as follows, showing that tag returns a function, and that function is a parser that takes &str and returns IResult:

At the same time, here, the function that creates the parser returns its parser function, which is a common pattern in nom.

pub fn tag<T, Input, Error: ParseError<Input>>(
    tag: T
) -> impl Fn(Input) -> IResult<Input, Input, Error> where
    Input: InputTake + Compare<T>,
    T: InputLength + Clone,

Here is an example of a function that implements the use of tag:

use nom::bytes::complete::tag;
use nom::IResult;

pub fn parse_input(input: &str) -> IResult<&str, &str> {
    tag("abc")(input)
}

#[cfg(test)]
mod test {
    use super::*;
    #[test]
    fn test_parse_input() {
        let (leftover_input, output) = parse_input("abcWorld!").unwrap();
        assert_eq!(leftover_input, "World!");
        assert_eq!(output, "abc");
        assert!(parse_input("defWorld").is_err());
    }
}

Character Classes

Considering that tag can only be used for characters in the starting sequence, nom provides another pre-written parser known as character classes, which allows for accepting any one of a group of characters. Below are some commonly used built-in parsers:

Parser	Function	Parser	Function
alpha0/alpha1	Recognize zero or more lowercase and uppercase letter characters; the latter requires at least one character to be returned	multispace0/multispace1	Recognize zero or more spaces, tabs, carriage returns, and line feeds; the latter requires at least one character to be returned
alphanumeric0/alphanumeric1	Recognize zero or more digit characters or letter characters; the latter requires at least one character to be returned	space0/space1	Recognize zero or more spaces and tabs; the latter requires at least one character to be returned
digit0/digit1	Recognize zero or more digit characters; the latter requires at least one character to be returned	newline	Recognize line feed characters

Here is a simple example to see how it is used:

use nom::character::complete::alpha0;
use nom::IResult;

fn parse_alpha(input: &str) -> IResult<&str, &str> {
    alpha0(input)
}

#[test]
fn test_parse_alpha() {
    let (remaining, letters) = parse_alpha("abc123").unwrap();
    assert_eq!(remaining, "123");
    assert_eq!(letters, "abc");
}

Alternatives & Composition#

Alternatives

Nom provides the alt() combinator to satisfy the choice of multiple parsers, which will execute each parser in the tuple until it finds a successful match.

If all parsers in the tuple fail to parse, then an error will be returned.

Below is a simple example to illustrate this:

use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::IResult;

fn parse_abc_or_def(input: &str) -> IResult<&str, &str> {
    alt((
        tag("abc"),
        tag("def"),
    ))(input)
}

#[test]
fn test_parse_abc_or_def() {
    let (leftover_input, output) = parse_abc_or_def("abcWorld").unwrap();
    assert_eq!(leftover_input, "World");
    assert_eq!(output, "abc");
    let (_, output) = parse_abc_or_def("defWorld").unwrap();
    assert_eq!(output, "def");
    assert!(parse_abc_or_def("ghiWorld").is_err());
}

Composition

In addition to the choice of multiple parsers, combining parsers is also a very common requirement, so nom provides built-in combinators.

For example, tuple(), which takes a tuple of parsers and returns Ok along with all successfully parsed results as a tuple, or returns the first failing Err parser.

use nom::branch::alt;
use nom::bytes::complete::tag_no_case;
use nom::IResult;
use nom::sequence::tuple;

fn parse_base(input: &str) -> IResult<&str, &str> {
    alt((
        tag_no_case("a"), // Case-insensitive tag compared to tag
        tag_no_case("t"),
        tag_no_case("c"),
        tag_no_case("g"),
    ))(input)
}

fn parse_pair(input: &str) -> IResult<&str, (&str, &str)> {
    tuple((
        parse_base, parse_base
    ))(input)
}

#[test]
fn test_parse_pair() {
    let (remaining, parsed) = parse_pair("aTcG").unwrap();
    assert_eq!(parsed, ("a", "T"));
    assert_eq!(remaining, "cG");
    assert!(parse_pair("Dct").is_err());
}

In addition to what has been mentioned, Rust also supports the following parsers with similar operations:

Combinator	Usage	Input	Output
delimited	`delimited(char('('), take(2), char(')'))`	"(ab)cd"	Ok(("cd", "ab"))
preceded	`preceded(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", "XY"))
terminated	`terminated(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", "ab"))
pair	`pair(tag("ab"), tag("XY"))`	"abXYZ"	Ok(("Z", ("ab", "XY")))
separated_pair	`separated_pair(tag("hello"), char(','), tag("world"))`	"hello,world!"	Ok(("!", ("hello", "world")))

Parsers With Custom Return Types#

As mentioned, the Input and Output in IResult can actually be different types. If we want to convert the results of tags into specific values, we can use the nom provided value combinator to convert the successfully parsed results into specific values. For example, in the following example:

use nom::branch::alt;
use nom::bytes::complete::tag;
use nom::combinator::value;
use nom::IResult;

fn parse_bool(input: &str) -> IResult<&str, bool> {
    alt((
        value(true, tag("true")),    // Convert to bool type
        value(false, tag("false")),
    ))(input)
}

#[test]
fn test_parse_bool() {
    let (remaining, parsed) = parse_bool("true|false").unwrap();
    assert_eq!(parsed, true);
    assert_eq!(remaining, "|false");
    assert!(parse_bool(remaining).is_err());
}

Repeating Predicates and Parsers#

Repeating With Predicates

The predicates here are actually like the while loop we encountered earlier. To satisfy the function of repeating parser processing that includes specific conditions, nom provides several different categories of predicate parsers, mainly including take_till, take_until, and take_while:

Combinator	Function	Usage	Input	Output
`take_till`	Continuously consume input until the input satisfies the predicate	`take_while(is_alphabetic)`	"abc123"	Ok(("123", "abc"))
`take_while`	Continuously consume input until the input does not satisfy the predicate	`take_till(is_alphabetic)`	"123abc"	Ok(("abc", "123"))
`take_until`	Consume until the first occurrence that satisfies the predicate	`take_until("world")`	"Hello World"	Ok(("World", "Hello "))

Here, we can also add some:

The combinators mentioned above actually have "twins," that is, names ending with 1, which differ in that they require at least one matching character to be returned, otherwise an error will occur.
The take_while_m_n used earlier is actually similar to take_while, serving as a special case that guarantees consuming [m,n] bytes.

Repeating Parsers

In addition to repeating predicates for single parsers, nom also provides combinators for repeating parsers, such as many0, which can apply a parser as many times as possible and return a vector of these parsing results, as shown in the following example:

use nom::bytes::complete::tag;
use nom::IResult;
use nom::multi::many0;

fn repeat_parser(s: &str) -> IResult<&str, Vec<&str>> {
    many0(tag("abc"))(s)
}

#[test]
fn test_repeat_parser() {
    assert_eq!(repeat_parser("abcabc"), Ok(("", vec!["abc", "abc"])));
    assert_eq!(repeat_parser("abc123"), Ok(("123", vec!["abc"])));
    assert_eq!(repeat_parser("123123"), Ok(("123123", vec![])));
    assert_eq!(repeat_parser(""), Ok(("", vec![])));
}

Below are some commonly used combinators:

Combinator	Usage	Input	Output
count	`count(take(2), 3)`	"abcdefgh"	Ok(("gh", vec!["ab", "cd", "ef"]))
many0	`many0(tag("ab"))`	"abababc"	Ok(("c", vec!["ab", "ab", "ab"]))
many_m_n	`many_m_n(1, 3, tag("ab"))`	"ababc"	Ok(("c", vec!["ab", "ab"]))
many_till	`many_till(tag( "ab" ), tag( "ef" ))`	"ababefg"	Ok(("g", (vec!["ab", "ab"], "ef")))
separated_list0	`separated_list0(tag(","), tag("ab"))`	"ab,ab,ab."	Ok((".", vec!["ab", "ab", "ab"]))
fold_many0	`fold_many0(be_u8,		0,
fold_many_m_n	`fold_many_m_n(1, 2, be_u8,		0,
length_count	`length_count(number, tag("ab"))`	"2ababab"	Ok(("ab", vec!["ab", "ab"]))

Error Management#

Nom's errors are designed with multiple needs in mind:

Indicate which parser failed and the position in the input data.
Accumulate more context as errors propagate up the parser chain.
Have very low overhead since calling parsers typically discards errors.
Can be modified according to user needs, as some languages require more information.

To meet these needs, the result type of nom parsers is designed as follows:

pub type IResult<I, O, E=nom::error::Error<I>> = Result<(I, O), nom::Err<E>>;

pub enum Err<E> {
    Incomplete(Needed),    // Indicates that the parser does not have enough data to make a decision, usually encountered in streaming scenarios
    Error(E),              // Normal parser error, such as when a child parser of the alt combinator returns Error, it will try another child parser
    Failure(E),            // Unrecoverable error, such as when a child parser returns Failure, the alt combinator will not try other branches
}

Common Error Types in nom::Err<E>

The default error type nom::error::Error, which will return the specific parser's error and the position of the error in the input.
```
  #[derive(Debug, PartialEq)]
  pub struct Error<I> {
    /// position of the error in the input data
    pub input: I,
    /// nom error code
    pub code: ErrorKind,
  }
```
- This error type is fast and has low overhead, making it suitable for parsers that are called repeatedly, but it is also limited in functionality, such as not returning call chain information.

To obtain more information, nom::error::VerboseError will return more information about the error encountered in the parser chain, such as the type of parser, etc.

  #[derive(Clone, Debug, PartialEq)]
  pub struct VerboseError<I> {
    /// List of errors accumulated by `VerboseError`, containing the affected
    /// part of input data, and some context
    pub errors: crate::lib::std::vec::Vec<(I, VerboseErrorKind)>,
  }
  
  #[derive(Clone, Debug, PartialEq)]
  /// Error context for `VerboseError`
  pub enum VerboseErrorKind {
    /// Static string added by the `context` function
    Context(&'static str),
    /// Indicates which character was expected by the `char` function
    Char(char),
    /// Error kind given by various nom parsers
    Nom(ErrorKind),
  }

By viewing the original input and the error chain, more user-friendly error messages can be constructed. The nom::error::convert_error function can be used to build such messages.

Custom Error Types via ParseError Trait

You can define your own error types by implementing the ParseError<I> trait.

Since all nom combinators use a common error type, you only need to define it in the result type of the parser, and it will be used everywhere.

pub trait ParseError<I>: Sized {
    // Indicate which parser encountered an error based on the input position and ErrorKind enum
    fn from_error_kind(input: I, kind: ErrorKind) -> Self;
    // Allow creating a series of errors when backtracking the parser tree (various combinators will add more context)
    fn append(input: I, kind: ErrorKind, other: Self) -> Self;
    // Create an error indicating which character was expected
    fn from_char(input: I, _: char) -> Self {
        Self::from_error_kind(input, ErrorKind::Char)
    }
    // Allow choosing (or accumulating) between errors from various branches in combinators like alt
    fn or(self, other: Self) -> Self {
        other
    }
}

You can also implement the ContextError trait to support VerboseError<I> using the context() combinator.

Below is a simple example to introduce its usage, where a debug error type is defined to print additional information each time an error is generated:

use nom::error::{ContextError, ErrorKind, ParseError};

#[derive(Debug)]
struct DebugError {
    message: String,
}

impl ParseError<&str> for DebugError {
    // Print the specific type of parser error
    fn from_error_kind(input: &str, kind: ErrorKind) -> Self {
        let message = format!("【{:?}】:\t{:?}\n", kind, input);
        println!("{}", message);
        DebugError { message }
    }

    // If encountering multiple errors in combinators, print additional context information
    fn append(input: &str, kind: ErrorKind, other: Self) -> Self {
        let message = format!("【{}{:?}】:\t{:?}\n", other.message, kind, input);
        println!("{}", message);
        DebugError { message }
    }

    // Print the specific expected character
    fn from_char(input: &str, c: char) -> Self {
        let message = format!("【{}】:\t{:?}\n", c, input);
        print!("{}", message);
        DebugError { message }
    }

    fn or(self, other: Self) -> Self {
        let message = format!("{}\tOR\n{}\n", self.message, other.message);
        println!("{}", message);
        DebugError { message }
    }
}

impl ContextError<&str> for DebugError {
    fn add_context(_input: &str, _ctx: &'static str, other: Self) -> Self {
        let message = format!("【{}「{}」】:\t{:?}\n", other.message, _ctx, _input);
        print!("{}", message);
        DebugError { message }
    }
}

Debugging Parsers

During the process of writing parsers, if you need to track the execution process of the parsers, you can use the dbg_dmp function to print the input and output of the parsers:

fn f(i: &[u8]) -> IResult<&[u8], &[u8]> {
    dbg_dmp(tag("abcd"), "tag")(i)
}

let a = &b"efghijkl"[..];

// Will print the following message:
// tag: Error(Error(Error { input: [101, 102, 103, 104, 105, 106, 107, 108], code: Tag })) at:
// 00000000        65 66 67 68 69 6a 6b 6c         efghijkl
f(a);

Summary#

Through this article, we have gained a basic understanding of the prerequisite knowledge of parsers (Parser, PEG, and Parser Combinator) and how to use classic third-party libraries (pest and nom) in Rust to implement parsers. Whether it is pest based on PEG or nom based on Parser Combinators, both can meet the general scenarios for implementing custom parsers, eliminating the need to write parsers from scratch, significantly reducing the cost of implementing custom parsers. If considering more complex situations (such as performance, implementation cost, usage documentation, etc.), it is necessary to choose the appropriate third-party library for implementation based on specific scenarios.

In the next article, we will use both pest and nom to implement several common parsers to better grasp the concept of parsers.

References#

https://zhuanlan.zhihu.com/p/427767002

https://zh.wikipedia.org/wiki/%E8%A7%A3%E6%9E%90%E8%A1%A8%E8%BE%BE%E6%96%87%E6%B3%95

https://zhuanlan.zhihu.com/p/355364928

https://ohmyweekly.github.io/notes/2021-01-20-pest-grammars/#

https://pest.rs/book/parser_api.html

https://rustmagazine.github.io/rust_magazine_2021/chapter_4/nom_url.html

https://tfpk.github.io/nominomicon/chapter_1.html