Skip to content

Crafting Interpreters Part Ⅱ

Posted on:2021.08.20

TOC

Open TOC

A TREE-WALK INTERPRETER

Statements and State

状态和语句是相辅相成的。由于无法对语句求值,因此它们需要做一些其他事情才能有用。这种东西叫做副作用

在这一部分里,我们将要扩展我们解释器的功能:

初探语句

Pascal is an outlier. It distinguishes between procedures and functions. Functions return values, but procedures cannot. There is a statement form for calling a procedure, but functions can only be called where an expression is expected. There are no expression statements in Pascal.

为了加入表达式语句和打印语句,我们需要先定义对应的语法规则:

program → statement* EOF ;
statement → exprStmt
| printStmt ;
exprStmt → expression ";" ;
printStmt → "print" expression ";" ;

注意到语句和表达式的语法是分离的,也就是说语法中没有一个地方同时允许出现表达式和语句。因此,我们简单的为语句的语法树新建一个类 Stmt,使用 GenerateAst 生成:

defineAst(outputDir, "Stmt", Arrays.asList(
"Expression : Expr expression",
"Print : Expr expression"
));

接下来的工作是修改 Parser 类,主要内容是修改对外接口方法 parse()(返回值变成语句的集合)和添加几个新的方法(因为语法树的的最顶层从表达式变成了语句)。

语法的工作结束后,剩下的内容便是语义的工作。我们让 Interpreter 类实现接口 Stmt.Visitor<Void>,由于语句没有值的概念,所以这里的的泛型类型为 Voidvoid 类型的包装类)。并在 Interpreter 类中添加几个新的方法,对于表达式语句,我们在求值表达式后简单的丢弃它的值;对于打印语句,我们使用 stringify 方法显示表达式的值。类似的,我们需要修改 Interpreter 类对外的接口方法 interpret()(参数变成语句的集合)。

变量声明

You could make a language that treats variable declarations as expressions that both create a binding and produce a value. The only language I know that does that is Tcl. Scheme seems like a contender, but note that after a let expression is evaluated, the variable it bound is forgotten. The define syntax is not an expression.

简单起见,我们这里的变量指的是全局变量。引入全局变量需要三个东西:

由于变量声明语句并不是能出现在所有语句可以出现的地方,例如在 if 语句的分句中不能出现变量声明语句(我们这样规定),于是某种程度上语句也有所谓优先级之分。

Some places where a statement is allowed - like inside a block or at the top level - allow any kind of statement, including declarations. Others allow only the “higher” precedence statements that don’t declare names.

In this analogy, block statements work sort of like parentheses do for expressions. A block is itself in the “higher” precedence level and can be used anywhere, like in the clauses of an if statement. But the statements it contains can be lower precedence. You’re allowed to declare variables and other names inside the block. The curlies let you escape back into the full statement grammar from a place where only some statements are allowed.

我们调整语句的语法规则:

program → declaration* EOF ;
declaration → varDecl
| statement ;
statement → exprStmt
| printStmt ;
varDecl → "var" IDENTIFIER ( "=" expression )? ";" ;
exprStmt → expression ";" ;
printStmt → "print" expression ";" ;

这就引入了变量声明语句,为了引入变量表达式,我们调整语表达式的语法规则:

primary → "true" | "false" | "nil"
| NUMBER | STRING
| "(" expression ")"
| IDENTIFIER ;

使用 GenerateAst 生成这些变化,在 Stmt 中添加变量声明语句:

"Var : Token name, Expr initializer"

Expr 中添加变量表达式:

"Variable : Token name"

接下来的工作是修改 Parser 类,我们需要添加一个新的方法 declaration

private Stmt declaration() {
try {
if (match(VAR)) return varDeclaration();
return statement();
} catch (ParseError error) {
synchronize();
return null;
}
}

注意这里使用了上上一章中定义的 synchronize() 方法,回顾一下:

private void synchronize() {
advance();
while (!isAtEnd()) {
if (previous().type == SEMICOLON) return;
switch (peek().type) {
case CLASS:
case FUN:
case VAR:
case FOR:
case IF:
case WHILE:
case PRINT:
case RETURN:
return;
}
advance();
}
}

如果输入为:

var a = ;
var b = 2;

使用 synchronize() 方法,我们可以得到第二条语句的语法树,而非在第一条语句解析错误后停止解析。

语法的工作结束后,剩下的内容便是语义的工作。在此之前,我们需要定义环境的概念。

简单来说,环境就是一个从变量名到实际引用对象的映射表。我们新建一个类 Environment,其中有一个 HashMap,它使用字符串作为键,而不是 token,这是为了确保所有同名的 tokens 都映射到相同的对象。

目前,我们要为 Environment 类添加两个方法,一个是 setter,一个是 getter。首先谈谈 setter,这里有一个小小的细节,我们允许重定义变量,而不是报错,也就是说,下面的代码是可以运行的:

var a = "before";
print a; // "before".
var a = "after";
print a; // "after".

这是为了用户在使用 REPL 时不用记住自己所有定义过的变量名。

My rule about variables and scoping is, “When in doubt, do what Scheme does”. The Scheme folks have probably spent more time thinking about variable scope than we ever will - one of the main goals of Scheme was to introduce lexical scoping to the world - so it’s hard to go wrong if you follow in their footsteps.

Scheme allows redefining variables at the top level.

下面是 getter,若访问了未定义的变量,我们以下三种选择:

先放弃第三种选择🤣,所以关键是前两种选择,这里涉及到一个认识:使用变量并不等同于引用变量。如果我们选择让其成为语法错误,那么定义递归函数就会变得很困难:

We could accommodate single recursion - a function that calls itself - by declaring the function’s own name before we examine its body. But that doesn’t help with mutually recursive procedures that call each other. Consider:

fun isOdd(n) {
if (n == 0) return false;
return isEven(n - 1);
}
fun isEven(n) {
if (n == 0) return true;
return isOdd(n - 1);
}

The isEven() function isn’t defined by the time we are looking at the body of isOdd() where it’s called. If we swap the order of the two functions, then isOdd() isn’t defined when we’re looking at isEven()’s body.

Some statically typed languages like Java and C# solve this by specifying that the top level of a program isn’t a sequence of imperative statements. Instead, a program is a set of declarations which all come into being simultaneously. The implementation declares all of the names before looking at the bodies of any of the functions.

Older languages like C and Pascal don’t work like this. Instead, they force you to add explicit forward declarations to declare a name before it’s fully defined. That was a concession to the limited computing power at the time. They wanted to be able to compile a source file in one single pass through the text, so those compilers couldn’t gather up all of the declarations first before processing function bodies.

因此我们最终的选择是让其成为运行时错误,也就是说,只要不使用变量(对变量求值),就可以在定义变量之前引用它。

在得到了环境后,我们就可以 Interpreter 类中定义变量声明语句和变量表达式的语义了。需要注意的有两点:

赋值表达式

赋值表达式使变量可变,改变一个变量是赋值表达式的副作用,而赋值表达式的值便是改变后变量的值。

In some other languages, like Pascal, Python, and Go, assignment is a statement.

下面是赋值表达式的语法规则:

expression → assignment ;
assignment → IDENTIFIER "=" assignment
| equality ;

若考虑三目运算符和逗号表达式,assignment 层介于这两者之间

使用 GenerateAst 生成:

"Assign : Token name, Expr value"

注意这里左操作数的类型为 Token 而非 Expr,这反映了赋值表达式的左操作数为左值。对应在 Parser 类实现如下(考虑了三目运算符和逗号表达式):

The classic terms for these two constructs are l-value and r-value.

private Expr assignment() {
Expr expr = ternary();
if (match(EQUAL)) {
Token equals = previous();
Expr value = assignment();
if (expr instanceof Expr.Variable) {
Token name = ((Expr.Variable)expr).name;
return new Expr.Assign(name, value);
}
error(equals, "Invalid assignment target.");
}
return expr;
}

如果左操作数不是有效的赋值目标,我们会报告错误,但我们不会抛出它。

目前有效的赋值目标只有变量表达式,之后会扩展。

这里的实现还有一些细节问题。与二元运算符的一个细微区别是,我们不会通过循环来构建相同运算符的序列。由于赋值是右结合的,我们改为递归调用 assignment() 来解析右操作数。

The trick is that right before we create the assignment expression node, we look at the left-hand side expression and figure out what kind of assignment target it is. We convert the r-value expression node into an l-value representation.

This conversion works because it turns out that every valid assignment target happens to also be valid syntax as a normal expression.

You can still use this trick even if there are assignment targets that are not valid expressions. Define a cover grammar, a looser grammar that accepts all of the valid expression and assignment target syntaxes. When you hit an =, report an error if the left-hand side isn’t within the valid assignment target grammar. Conversely, if you don’t hit an =, report an error if the left-hand side isn’t a valid expression.

这里的解释有点怪,下面的旁注也看不懂 🤣。

Way back in the parsing chapter, I said we represent parenthesized expressions in the syntax tree because we’ll need them later. This is why. We need to be able to distinguish these cases:

a = 3; // OK.
(a) = 3; // Error.

语法的工作结束后,剩下的内容便是语义的工作。这里的实现与变量声明语句类似,我们在 Environment 类中添加一个新的方法 assign。只不过赋值不允许创建新变量。否则会出现运行时错误。

Unlike Python and Ruby, Lox doesn’t do implicit variable declaration.

词法作用域

词法作用域又被称为静态作用域,其中代码本身显示了作用域的开始和结束位置。与之对应的是动态作用域,对于动态作用域,在执行代码之前无法知道变量名所引用的对象。Lox 的变量是静态的,但对象上的方法和字段是动态的。

对于动态作用域,另见柯里化的前生今世(五):动态作用域

由于 Lox 的语法类似 C,词法作用域也就是块作用域。

为了实现块作用域,我们需要修改我们的 Environment 类,在类中新增一个字段,指向当前环境的父环境,对于全局环境,该字段为 null

The boring name for this is a parent-pointer tree.

对应的,我们修改类中的方法 getterassign,以 getter 为例:

Object get(Token name) {
if (values.containsKey(name.lexeme)) {
return values.get(name.lexeme);
}
if (enclosing != null) return enclosing.get(name);
throw new RuntimeError(name, "Undefined variable '" + name.lexeme + "'.");
}

从当前环境开始,若能够解析变量名,则返回实际引用的对象,否则递归调用父环境的 getter 方法。这样可以正确处理两种情况,一是当前环境使用了父环境中的对象:

var global = "outside";
{
var local = "inside";
print global + local;
}

二是当前环境的变量名遮蔽了父环境中的变量名:

var volume = "outside";
{
var volume = "inside";
print volume;
}

在做好准备后,我们就可以实现块语句的语法和语义了,首先是语句的语法规则:

program → declaration* EOF ;
declaration → varDecl
| statement ;
statement → exprStmt
| printStmt
| block ;
block → "{" declaration* "}" ;
varDecl → "var" IDENTIFIER ( "=" expression )? ";" ;
exprStmt → expression ";" ;
printStmt → "print" expression ";" ;

使用 GenerateAst 生成:

"Block : List<Stmt> statements"

并对应增改 Parser 类中的方法。对于语义的部分,代码如下:

@Override
public Void visitBlockStmt(Stmt.Block stmt) {
executeBlock(stmt.statements, new Environment(environment));
return null;
}
void executeBlock(List<Stmt> statements,
Environment environment) {
Environment previous = this.environment;
try {
this.environment = environment;
for (Stmt statement : statements) {
execute(statement);
}
} finally {
this.environment = previous;
}
}

还记得我们在 Interpreter 类添加了一个环境的字段吗?构造函数中的 environmentprevious 是父环境,在执行了 this.environment = environment; 后,Interpreter 类的环境变化了,并以此来执行块中的各条语句。执行结束后,也就是块语句结束后,我们再恢复原来的环境。

Manually changing and restoring a mutable environment field feels inelegant. Another classic approach is to explicitly pass the environment as a parameter to each visit method. To “change” the environment, you pass a different one as you recurse down the tree. You don’t have to restore the old one, since the new one lives on the Java stack and is implicitly discarded when the interpreter returns from the block’s visit method.

I considered that for jlox, but it’s kind of tedious and verbose adding an environment parameter to every single visit method. To keep the book a little simpler, I went with the mutable field.

隐式变量声明

当相同的语法可以赋值或创建变量时,每种语言都必须决定在不清楚用户想要哪种行为时会发生什么。特别是,每种语言都必须选择隐式声明变量如何与shadowing交互,以及隐式声明变量的作用域。

Python 中,赋值总是在当前函数的作用域内创建一个变量,即使在函数外声明了一个同名的变量。

a = 1
def f():
a = 2
print(a)
f()

更多的内容详见 Python 拾遗 作用域部分

隐式声明的主要优点是简单。需要学习的语法和“声明”概念更少。用户只需开始分配东西,语言就会计算出来。

较旧的静态类型语言(如 C)受益于显式声明,因为它们为用户提供了一个位置来告诉编译器每个变量的类型以及为其分配多少存储空间。在有垃圾回收机制的动态类型语言中,这并不必要,因此可以使用隐式变量声明。

隐式声明也有一些问题。例如,用户可能打算赋值给现有变量,但可能拼错了它。解释器不知道这一点,所以它继续并默默地创建一些新变量,而用户想要赋值给的变量仍然具有其旧值。

随着时间的推移,带有隐式变量声明的语言最终增加了更多的功能和复杂性来处理这些问题。例如 Python 引入了 globalnonlocal 关键词。

鉴于这些,作者认为简单性的论点大部分都站不住脚了。作者的观点是,隐式声明在过去几年中是有意义的,当时大多数脚本语言都是非常命令式的,代码非常扁平。随着程序员对深度嵌套、函数式编程和闭包越来越熟悉,想要访问外部作用域中的变量变得越来越普遍。这使得用户更有可能遇到棘手的情况,即不清楚他们的赋值是要创建一个新变量还是重用周围的变量。

习题

1

The REPL no longer supports entering a single expression and automatically printing its result value. That’s a drag. Add support to the REPL to let users type in both statements and expressions. If they enter a statement, execute it. If they enter an expression, evaluate it and display the result value.

我们只需要保留对表达式的 parse 方法和 interpret 方法:

Expr parse() {
try {
return expression();
} catch (ParseError error) {
return null;
}
}
void interpret(Expr expression) {
try {
Object value = evaluate(expression);
System.out.println(stringify(value));
} catch (RuntimeError error) {
Lox.runtimeError(error);
}
}

然后在 Lox 类的 runPrompt 中添加分支判断输入的是语句还是表达式。

2

Maybe you want Lox to be a little more explicit about variable initialization. Instead of implicitly initializing variables to nil, make it a runtime error to access a variable that has not been initialized or assigned to, as in:

// No initializers.
var a;
var b;
a = "assigned";
print a; // OK, was assigned first.
print b; // Error!

简单修改getter方法即可:

Object get(Token name) {
if (values.containsKey(name.lexeme)) {
Object res = values.get(name.lexeme);
if (res == null) {
throw new RuntimeError(name, "Uninitialized variable '" + name.lexeme + "'.");
}
return res;
}
if (enclosing != null) return enclosing.get(name);
throw new RuntimeError(name, "Undefined variable '" + name.lexeme + "'.");
}

3

What does the following program do?

var a = 1;
{
var a = a + 2;
print a;
}

What did you expect it to do? Is it what you think it should do? What does analogous code in other languages you are familiar with do? What do you think users will expect this to do?

结果为 3,将父环境中变量引用的值作为初始值,并在当前环境创建了一个同名变量遮蔽了父环境中的变量。

对比 C,会报错使用了未初始化的局部变量 a

#include<stdio.h>
int main() {
int a = 1;
{
int a = a + 2;
printf("%d", a);
}
}

Control Flow

这一部分添加的功能有:

分支

下面是 if 语句的语法规则:

statement → exprStmt
| ifStmt
| printStmt
| block ;
ifStmt → "if" "(" expression ")" statement
( "else" statement )? ;

要注意区分两种括号之间的区分,带引号的是 terminal,不带引号的是 metasyntax

其实这里的左括号没有实际的用途,右括号是为了间隔条件表达式和语句,有些语言则使用 if...then...else 的形式。

使用 GenerateAst 生成:

"If : Expr condition, Stmt thenBranch, Stmt elseBranch"

if 语句没有 else 分句,elseBranch 就是 null

对应在 Parser 类实现较为简单:

private Stmt ifStatement() {
consume(LEFT_PAREN, "Expect '(' after 'if'.");
Expr condition = expression();
consume(RIGHT_PAREN, "Expect ')' after if condition.");
Stmt thenBranch = statement();
Stmt elseBranch = null;
if (match(ELSE)) {
elseBranch = statement();
}
return new Stmt.If(condition, thenBranch, elseBranch);
}

需要注意的是,这种实现解决了 dangling else 的问题,因为在 return 之前就已经完成了 else 分句的匹配,也就是最近匹配

语义的部分更为简单,只是做了简单的包装:

public Void visitIfStmt(Stmt.If stmt) {
if (isTruthy(evaluate(stmt.condition))) {
execute(stmt.thenBranch);
} else if (stmt.elseBranch != null) {
execute(stmt.elseBranch);
}
return null;
}

大多数其他语法树总是求值(或执行)它们的子树。在这里,我们可能不去求值(或执行)thenelse 分句

if 语句结束后,下面是逻辑运算符的实现,也就是 andor。先看优先级,从低到高依次为 assignment 层、ternary 层、or 层、and 层和 equality 层,于是语法规则可以写成(不考虑 ternary 层):

assignment → IDENTIFIER "=" assignment
| logic_or ;
logic_or → logic_and ( "or" logic_and )* ;
logic_and → equality ( "and" equality )* ;

因为逻辑运算符有短路求值,我们为其语法树设计一个新的名字:

"Logical : Expr left, Token operator, Expr right"

对应在 Parser 类实现较为简单(二元运算符),从略。

语义的部分实现如下:

@Override
public Object visitLogicalExpr(Expr.Logical expr) {
Object left = evaluate(expr.left);
if (expr.operator.type == TokenType.OR) {
if (isTruthy(left)) return left;
} else {
if (!isTruthy(left)) return left;
}
return evaluate(expr.right);
}

在这里,我们首先求值左操作数,看其是否可以短路。如果没有,则继续求值右操作数,否则,则直接返回实际值,而非 truefalse

由于 Lox 是动态类型的,我们预先定义每个操作数的真值,所以允许接受或返回任何类型的操作数。

循环

下面是 while 语句的语法规则:

statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
whileStmt → "while" "(" expression ")" statement ;

使用 GenerateAst 生成:

"While : Expr condition, Stmt body"

剩余部分较为简单,从略。

最后只剩下了 for 语句,我们换一种思路处理。实际上,for 语句不过是 while 语句的语法糖。

关于语法糖的评论,可见 DESIGN NOTE: SPOONFULS OF SYNTACTIC SUGAR

对于 for 语句:

for (var i = 0; i < 10; i = i + 1) print i;

与下面的 while 语句几乎等价(加上了 continue 语句就不一定了),注意最外层的括号,因为循环变量的作用域问题:

{
var i = 0;
while (i < 10) {
print i;
i = i + 1;
}
}

于是,我们先定义 for 语句的语法规则:

statement → exprStmt
| forStmt
| ifStmt
| printStmt
| whileStmt
| block ;
forStmt → "for" "(" ( varDecl | exprStmt | ";" )
expression? ";"
expression? ")" statement ;

Inside the parentheses, you have three clauses separated by semicolons:

  1. The first clause is the initializer. It is executed exactly once, before anything else. It’s usually an expression (stmt), but for convenience, we also allow a variable declaration. In that case, the variable is scoped to the rest of the for loop - the other two clauses and the body.
  2. Next is the condition. As in a while loop, this expression controls when to exit the loop. It’s evaluated once at the beginning of each iteration, including the first. If the result is truthy, it executes the loop body. Otherwise, it bails.
  3. The last clause is the increment. It’s an arbitrary expression that does some work at the end of each loop iteration. The result of the expression is discarded, so it must have a side effect to be useful. In practice, it usually increments a variable.

In a previous chapter, I said we can split expression and statement syntax trees into two separate class hierarchies because there’s no single place in the grammar that allows both an expression and a statement. That wasn’t entirely true, I guess.

然后,我们不需要使用 GenerateAst 生成对应的语法树,也不需要对特定的语法树进行 interpret,我们只需要在 Parser 类中将其转化为一个 BlockStmt(就像上面的例子一样),然后 The Visitor pattern 就会自动进行 interpret

这一过程也叫 desugaring

所以重点是下面这个方法:

private Stmt forStatement() {
consume(LEFT_PAREN, "Expect '(' after 'for'.");
Stmt initializer;
if (match(SEMICOLON)) {
initializer = null;
} else if (match(VAR)) {
initializer = varDeclaration();
} else {
initializer = expressionStatement();
}
Expr condition = null;
if (!check(SEMICOLON)) {
condition = expression();
}
consume(SEMICOLON, "Expect ';' after loop condition.");
Expr increment = null;
if (!check(RIGHT_PAREN)) {
increment = expression();
}
consume(RIGHT_PAREN, "Expect ')' after for clauses.");
Stmt body = statement();
if (increment != null) {
body = new Stmt.Block(Arrays.asList(body, new Stmt.Expression(increment)));
}
if (condition == null) condition = new Expr.Literal(true);
body = new Stmt.While(condition, body);
if (initializer != null) {
body = new Stmt.Block(Arrays.asList(initializer, body));
}
return body;
}

总结来说,先正向匹配,再反向组合。

Most modern languages have a higher-level looping statement for iterating over arbitrary user-defined sequences. C# has foreach, Java has “enhanced for” , even C++ has range-based for statements now. Those offer cleaner syntax than C’s for statement by implicitly calling into an iteration protocol that the object being looped over supports.

I love those. For Lox, though, we’re limited by building up the interpreter a chapter at a time. We don’t have objects and methods yet, so we have no way of defining an iteration protocol that the for loop could use. So we’ll stick with the old school C for loop. Think of it as “vintage”. The fixie of control flow statements.

习题

  1. A few chapters from now, when Lox supports first-class functions and dynamic dispatch, we technically won’t need branching statements built into the language. Show how conditional execution can be implemented in terms of those. Name a language that uses this technique for its control flow.

  2. Likewise, looping can be implemented using those same tools, provided our interpreter supports an important optimization. What is it, and why is it necessary? Name a language that uses this technique for iteration.

  3. Unlike Lox, most other C-style languages also support break and continue statements inside loops. Add support for break statements.

    The syntax is a break keyword followed by a semicolon. It should be a syntax error to have a break statement appear outside of any enclosing loop. At runtime, a break statement causes execution to jump to the end of the nearest enclosing loop and proceeds from there. Note that the break may be nested inside other blocks and if statements that also need to be exited.

1、2

循环可以用递归表示,如 Scheme

分支不太会 🤣,参考了 Smalltalk,使用 Python

class TrueValue:
def then(self, thenStmt):
return thenStmt
def thenElse(self, thenStmt, elseStmt):
return thenStmt
class FalseValue:
def then(self, thenStmt):
return None
def thenElse(self, thenStmt, elseStmt):
return elseStmt
def testDispatch(cond):
def f():
print("if then -> then")
def g():
print("if then else -> then")
def h():
print("if then else -> else")
thenRes = cond.then(f)
if thenRes != None:
thenRes()
thenElseRes = cond.thenElse(g, h)
if thenElseRes != None:
thenElseRes()
t = TrueValue()
f = FalseValue()
testDispatch(t)
testDispatch(f)

3

首先是 break 语句,考虑的有两个问题:

解决方案分别为:

下面是 continue 语句,似乎有点棘手 🤣。在捕获异常后,由于环境依然存在,我们可以让 visitWhileStmt 继续执行循环体。

@Override
public Void visitWhileStmt(Stmt.While stmt) {
try {
while (isTruthy(evaluate(stmt.condition))) {
execute(stmt.body);
}
} catch (BreakException ex) {
} catch (ContinueException ex) {
visitWhileStmt(stmt);
}
return null;
}

但是这样做会有一个问题,因为我们将 for 语句转换为一个 BlockStmtincrement 子句被视为循环体的一部分而被跳过了,如果我们不单独实现 for 语句的话,应该是无法区分的。

权衡一下的话,我们可以修改 for 语句在 C 中原本的语义,或者显式的在 continue 语句前加上 increment 子句:

for(var a=1;a<10;a=a+1){print a;if(a==5){a=a+1;continue;}}

Functions

这一部分添加的功能有:

函数调用运算符

首先,函数调用运算符拥有最高的优先级,也就是位于 unary 层和 primary 之间,它的语法规则为:

unary → ( "!" | "-" ) unary | call ;
call → primary ( "(" arguments? ")" )* ;
arguments → expression ( "," expression )* ;

也就是说,下面的表达式都符合 call 规则:

fn()
fn(a)
fn(a,b)
fn(a)(b)
...

最后一个函数调用可以理解为多参数柯里化,部分应用嵌套函数。

使用 GenerateAst 生成:

"Call : Expr callee, Token paren, List<Expr> arguments"

还是以 fn(a)(b) 为例,其中 calleefn(a),而 fn(a) 本身又是一个 callee。而 paren 则是右括号 RIGHT_PAREN,用于定位函数调用语句所在的行,以便给出错误信息。

对应在 Parser 类实现如下,expr = finishCall(expr); 是为了处理嵌套 callee

private Expr call() {
Expr expr = primary();
while (true) {
if (match(LEFT_PAREN)) {
expr = finishCall(expr);
} else {
break;
}
}
return expr;
}
private Expr finishCall(Expr callee) {
List<Expr> arguments = new ArrayList<>();
if (!check(RIGHT_PAREN)) {
do {
if (arguments.size() >= 255) {
error(peek(), "Can't have more than 255 arguments.");
}
arguments.add(expression());
} while (match(COMMA));
}
Token paren = consume(RIGHT_PAREN, "Expect ')' after arguments.");
return new Expr.Call(callee, paren, arguments);
}

在这里我们限制了函数调用中的参数数量应小于等于 254,实际上是小于等于 255(还有一个 this),目的是为了简化第二个解释器的实现。若超出了限制,我们只是简单的报告,并不会抛出错误。

下面是语义的部分,先看实现:

@Override
public Object visitCallExpr(Expr.Call expr) {
Object callee = evaluate(expr.callee);
List<Object> arguments = new ArrayList<>();
for (Expr argument : expr.arguments) {
arguments.add(evaluate(argument));
}
if (!(callee instanceof LoxCallable)) {
throw new RuntimeError(expr.paren, "Can only call functions and classes.");
}
LoxCallable function = (LoxCallable)callee;
if (arguments.size() != function.arity()) {
throw new RuntimeError(expr.paren, "Expected " + function.arity() + " arguments but got " + arguments.size() + ".");
}
return function.call(this, arguments);
}

为了处理嵌套 callee,首先对 expr.callee 求值。在这里我们引入了一个接口 LoxCallableLox 中可以被调用的东西都实现了这个接口。这个接口有两个方法,一是返回参数数量的 arity 方法,二是响应调用的 call 方法,call 方法接受一个解释器实例(环境)和实际参数列作为参数。

注意这里对参数求值的顺序。

本地方法

Native Functions, sometimes these are called primitives, external functions, or foreign functions.

有了函数调用运算符,但是现在我们没有函数可以被调用 🤣。在实现函数声明语句之前,我们先引入本地方法的概念:

These are functions that the interpreter exposes to user code but that are implemented in the host language (in our case Java), not the language being implemented (Lox).

In Java, “native” methods are ones implemented in C or C++ and compiled to native machine code.

本地方法的意义是提供了程序对基本服务的访问,如文件操作、输入和输出等。

Many languages also allow users to provide their own native functions. The mechanism for doing so is called a foreign function interface (FFI), native extension, native interface, or something along those lines. These are nice because they free the language implementer from providing access to every single capability the underlying platform supports.

在这里,我们将实现一个本地方法 clock(),实现的方法很简单,在 Interpreter 类的构造函数中加一点东西就行了:

实际上我们可以将之前引入的 print 语句替换为本地方法 print(),就像 Python 那样。

Interpreter() {
globals.define("clock", new LoxCallable() {
@Override
public int arity() { return 0; }
@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
return (double)System.currentTimeMillis() / 1000.0;
}
@Override
public String toString() { return "<native fn>"; }
});
}

注意这里对应环境字段的变化,从:

private Environment environment = new Environment();

变成了:

final Environment globals = new Environment();
private Environment environment = globals;

这样确保了这个函数是在全局范围内定义的。

The environment field in the interpreter changes as we enter and exit local scopes. It tracks the current environment. This new globals field holds a fixed reference to the outermost global environment.

另外注意到函数和变量的环境在一个地方:

In Lox, functions and variables occupy the same namespace. In Common Lisp, the two live in their own worlds. A function and variable with the same name don’t collide. If you call the name, it looks up the function. If you refer to it, it looks up the variable. This does require jumping through some hoops when you do want to refer to a function as a first-class value.

Richard P. Gabriel and Kent Pitman coined the terms “Lisp-1” to refer to languages like Scheme that put functions and variables in the same namespace, and “Lisp-2” for languages like Common Lisp that partition them. Despite being totally opaque, those names have since stuck. Lox is a Lisp-1.

函数声明语句

A named function declaration isn’t really a single primitive operation. It’s syntactic sugar for two distinct steps: (1) creating a new function object, and (2) binding a new variable to it. If Lox had syntax for anonymous functions, we wouldn’t need function declaration statements. You could just do:

var add = fun (a, b) {
print a + b;
};

However, since named functions are the common case, I went ahead and gave Lox nice syntax for them.

作为声明语句,函数声明和变量声明是同优先级的。来看语法规则:

declaration → funDecl
| varDecl
| statement ;
funDecl → "fun" function ;
function → IDENTIFIER "(" parameters? ")" block ;
parameters → IDENTIFIER ( "," IDENTIFIER )* ;

这里将规则拆成两部分是为了之后扩展时复用 function 规则。

使用 GenerateAst 生成:

"Function : Token name, List<Token> params, List<Stmt> body"

对应在 Parser 类实现类似函数调用中的实现,辅助函数 function 有一个参数,便于区分函数和方法声明的错误信息,之后扩展时可以复用:

private Stmt.Function function(String kind)

另外,我们还要处理返回语句,因为函数体是语句的集合,而语句本身并不返回值。

实际上语句返回的是 nil,如下:

fun procedure() {
print "don't return anything";
}
var result = procedure();
print result; // ?

结果为

don't return anything
nil

函数体可以没有返回语句,那么隐式返回 nil,可以有返回语句但其中没有表达式,这样的返回语句用于退出函数体(类似 break)。为此,我们添加语法规则:

statement → exprStmt
| forStmt
| ifStmt
| printStmt
| returnStmt
| whileStmt
| block ;
returnStmt → "return" expression? ";" ;

使用 GenerateAst 生成:

"Return : Token keyword, Expr value"

对应在 Parser 类实现:

private Stmt returnStatement() {
Token keyword = previous();
Expr value = null;
if (!check(SEMICOLON)) {
value = expression();
}
consume(SEMICOLON, "Expect ';' after return value.");
return new Stmt.Return(keyword, value);
}

注意实现中判断是否有表达式的部分,因为表达式有很多种前瞻关键词。我们使用逆向思维,判断 return 后面是否紧接着分号,从而避免了复杂的讨论。

我们顺便实现返回语句的语义,这里使用了自定义非检异常 Return

@Override
public Void visitReturnStmt(Stmt.Return stmt) {
Object value = null;
if (stmt.value != null) value = evaluate(stmt.value);
throw new Return(value);
}

对于函数声明语句的语义,我们在实现函数调用时中引入了函数对象接口 LoxCallable,现在我们要思考如何表示 Lox 中的函数,并实现 LoxCallable 接口。实际上,函数就是由参数列和函数体组成,这基本上就是我们刚才生成的 Stmt.Function,然而我们不希望解释器的运行时阶段渗透到前端的语法树中,所以我们不希望 Stmt.Function 本身实现它。相反,我们将它包装在一个新类 LoxFunction 中。也就是说 LoxFunction 类中有一个字段:

private final Stmt.Function declaration;

下面我们来实现 LoxCallable 接口

@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
Environment environment = new Environment(interpreter.globals);
for (int i = 0; i < declaration.params.size(); i++) {
environment.define(declaration.params.get(i).lexeme, arguments.get(i));
}
try {
interpreter.executeBlock(declaration.body, environment);
} catch (Return returnValue) {
return returnValue.value;
}
return null;
}

注意这里的环境变化,我们以全局环境作为父环境,在当前环境下,将形参与实参对应,并执行函数体。也就是说,每次函数调用(而不是声明)都会创建一个新的环境,该环境封装了形式参数名,并引用每一次实际传入的对象。若碰到返回语句,则捕获异常,并返回对应的值,否则隐式返回 nil

这就让递归函数成为了可能。思考下面的代码:

fun count(n) {
if (n > 1) count(n - 1);
print n;
}
count(3);

我们来分析对应的环境变化:

  1. interpret 方法执行 count(3); 这条语句时,先调用 visitExpressionStmt,再调用 visitCallExpr
  2. visitCallExpr 内,通过一系列 evaluate 方法得到函数对象和实际参数,并调用该函数对象的 call 方法。
  3. call 内,我们以全局环境作为父环境创建环境 1,在该环境中将形参与实参对应,并执行函数体。
  4. executeBlock 内,旧环境为全局环境,我们将环境设置为环境 1,并执行各条语句。第一条语句为 if 语句,条件为真,执行 count(2);
  5. 接下来的步骤与刚才类似,在 call 内以全局环境作为父环境创建一个环境 2。在 executeBlock 内,旧环境为环境 1,我们将环境设置为环境 2,并执行各条语句。第一条语句为 if 语句,条件为真,执行 count(1);
  6. 接下来的步骤与刚才类似,在 call 内以全局环境作为父环境创建一个环境 3。在 executeBlock 内,旧环境为环境 2,我们将环境设置为环境 3,并执行各条语句。第一条语句为 if 语句,条件为假。然后执行 print 1;
  7. 此时 count(1); 函数体已经被执行完了,在 finally 子句中将环境设置为环境 2,然后执行 print 2;
  8. 此时 count(2); 函数体已经被执行完了,在 finally 子句中将环境设置为环境 1,然后执行 print 3;
  9. 此时 count(3); 函数体已经被执行完了,在 finally 子句中将环境设置为全局环境。

回到 Interpreter 类,我们还要实现 visitFunctionStmt 方法:

@Override
public Void visitFunctionStmt(Stmt.Function stmt) {
LoxFunction function = new LoxFunction(stmt);
environment.define(stmt.name.lexeme, function);
return null;
}

类似于 visitVarStmt 方法,函数声明在当前环境将函数对象绑定到函数名。换句话说,我们将函数的编译时表示转换为运行时表示,对应了函数调用中对 callee 的求值:

Object callee = evaluate(expr.callee);

函数闭包

先来思考下面的代码:

fun makeCounter() {
var i = 0;
fun count() {
i = i + 1;
print i;
}
return count;
}
var counter = makeCounter();
counter(); // "1".
counter(); // "2".

我们迫不及待的让解释器去执行,结果却报错了:

Undefined variable 'i'.
[line 4]

让我们来分析一下,当执行到 var counter = makeCounter(); 时,解释器调用 visitVarStmt 方法,其中需要执行 evaluate(stmt.initializer)。这是一个函数调用,和之前一样,当函数体已经被执行的那一刻的环境为:

values
count -> {LoxFunction@1071} "<fn count>"
i -> {Double@1052} 0.0
enclosing
values
clock -> {Interpreter$1@1041} "<native fn>"
makeCounter -> {LoxFunction@987} "<fn makeCounter>"
enclosing
null

在执行完 finally 子句后,环境就变成了

values
clock -> {Interpreter$1@1041} "<native fn>"
makeCounter -> {LoxFunction@987} "<fn makeCounter>"
enclosing
null

我们将返回的 count 函数放入当前环境:

values
clock -> {Interpreter$1@1041} "<native fn>"
makeCounter -> {LoxFunction@987} "<fn makeCounter>"
counter -> {LoxFunction@1071} "<fn count>"
enclosing
null

此时,当我们执行 counter(); 时,由于环境中已经没有 i 了,所以会报错。

为了解决这个问题,我们修改两个地方:

现在我们再来分析上面的代码:

在执行完 makeCounter 函数的声明语句后,由于函数对象 makeCounter 是定义在全局环境中的,makeCounterclosure 与全局环境一样。

当执行到 var counter = makeCounter(); 时,环境的变化与上面一致,只不过我们将环境:

values
count -> {LoxFunction@1071} "<fn count>"
i -> {Double@1052} 0.0
enclosing
values
clock -> {Interpreter$1@1041} "<native fn>"
makeCounter -> {LoxFunction@987} "<fn makeCounter>"
enclosing
null

存入了函数对象 counterclosure 字段,在该语句结束后,其 enclosing 作为全局环境多了一个条目 counter,也就是说。在执行 counter(); 前,函数对象 counterclosure 为:

values
count -> {LoxFunction@1071} "<fn count>"
i -> {Double@1052} 0.0
enclosing
values
clock -> {Interpreter$1@1041} "<native fn>"
makeCounter -> {LoxFunction@987} "<fn makeCounter>"
counter -> {LoxFunction@1082} "<fn count>"
enclosing
null

在执行 counter(); 时,以 counterclosure 为父环境创建一个新的环境,环境图如下:

96bf1794ef314017bdc9397183e52ff7.png

那么对 i 的解析就不成问题啦。

Alas, in our rush to cram closures in, we have let a tiny bit of dynamic scoping leak into the interpreter. In the next chapter, we will explore deeper into lexical scope and close that hole.

习题

  1. Our interpreter carefully checks that the number of arguments passed to a function matches the number of parameters it expects. Since this check is done at runtime on every call, it has a performance cost. Smalltalk implementations don’t have that problem. Why not?

  2. Lox’s function declaration syntax performs two independent operations. It creates a function and also binds it to a name. This improves usability for the common case where you do want to associate a name with the function. But in functional-styled code, you often want to create a function to immediately pass it to some other function or return it. In that case, it doesn’t need a name.

    Languages that encourage a functional style usually support anonymous functions or lambdas - an expression syntax that creates a function without binding it to a name. Add anonymous function syntax to Lox so that this works:

    fun thrice(fn) {
    for (var i = 1; i <= 3; i = i + 1) {
    fn(i);
    }
    }
    thrice(fun (a) {
    print a;
    });
    // "1".
    // "2".
    // "3".

    How do you handle the tricky case of an anonymous function expression occurring in an expression statement:

    fun () {};
  3. Is this program valid?

    fun scope(a) {
    var a = "local";
    }

    In other words, are a function’s parameters in the same scope as its local variables, or in an outer scope? What does Lox do? What about other languages you are familiar with? What do you think a language should do?

1

Smalltalk 不太了解 🤣,这里是参考答案:

Smalltalk has different call syntax for different arities. To define a method that takes multiple arguments, you use keyword selectors. Each argument has a piece of the method name preceding instead of using commas as a separator. For example, a method like:

list.insert("element", 2)

To insert “element” as index 2 would look like this in Smalltalk:

list insert: "element" at: 2

Smalltalk doesn’t use a dot to separate method name from receiver. More interestingly, the “insert:” and “at:” parts both form a single method call whose full name is “insert:at:“. Since the selectors and the colons that separate them form part of the method’s name, there’s no way to call it with the wrong number of arguments. You can’t pass too many or two few arguments to “insert:at:” because there would be no way to write that call while still actually naming that method.

2

这道题的意思是让我们为 Lox 扩展匿名函数功能,由于匿名函数可以作为表达式使用,我们需要使用 GenerateAst 生成:

"Function : List<Token> parameters, List<Stmt> body"

同时修改函数声明语句,重用上面的语法结构:

"Function : Token name, Expr.Function function"

为此,我们需要解耦函数名与函数本身。修改 LoxFunction 类的表示,在 Parser 类中将 function 方法修改为:

private Stmt.Function function(String kind) {
Token name = consume(IDENTIFIER, "Expect " + kind + " name.");
return new Stmt.Function(name, functionBody(kind));
}
private Expr.Function functionBody(String kind) {
consume(LEFT_PAREN, "Expect '(' after " + kind + " name.");
List<Token> parameters = new ArrayList<>();
if (!check(RIGHT_PAREN)) {
do {
if (parameters.size() >= 8) {
error(peek(), "Can't have more than 8 parameters.");
}
parameters.add(consume(IDENTIFIER, "Expect parameter name."));
} while (match(COMMA));
}
consume(RIGHT_PAREN, "Expect ')' after parameters.");
consume(LEFT_BRACE, "Expect '{' before " + kind + " body.");
List<Stmt> body = block();
return new Expr.Function(parameters, body);
}

同时在 primary 层添加对匿名函数的解析:

if (match(FUN)) return functionBody("function");

对于语义的部分,略加修改 visitFunctionStmt 方法,并添加 visitFunctionExpr 方法:

@Override
public Object visitFunctionExpr(Expr.Function expr) {
return new LoxFunction(null, expr, environment);
}

此时我们若写出如下的代码:

fun () {};

会报错:

[line 1] Error at '(': Expect function name.

此时 Parser 会认为这是函数声明语句,却找不到函数名。为此,我们需要两次前瞻,就像 Scanner 类中的 peekpeekNext 一样,我们添加 checkNext

private boolean checkNext(TokenType tokenType) {
if (isAtEnd()) return false;
return peekNext().type == tokenType;
}

其中用到了 peekNext

private Token peekNext() {
// 调用该方法前确认了 isAtEnd() 为 false,所以可以安全的 get
return tokens.get(current + 1);
}

我们修改 declaration

private Stmt declaration() {
try {
if (match(VAR)) return varDeclaration();
if (check(FUN) && checkNext(IDENTIFIER)) {
consume(FUN, null);
return function("function");
}
return statement();
} catch (ParseError error) {
synchronize();
return null;
}
}

大功告成。

3

Lox 中,参数和函数体在一个环境中,因为 Lox 允许重定义变量(无论是全局变量还是局部变量),所以对于下面的代码:

fun scope(a) {
var a = "local";
print a;
}
scope("outer");

结果为 local。对于不允许重定义变量的 CJava,类似语义的代码无法通过编译。

另外,在学习了下一章后,就不允许重定义局部变量了,即上面的代码无法通过 resolve

Resolving and Binding

引子

我们思考如下代码:

var a = "global";
{
fun showA() {
print a;
}
showA();
var a = "block";
showA();
}

通过类似上一章的分析可以得到结果为:

global
block

在执行 var a = "block"; 前的环境图为:

642b8fb99f9e4c67be5d2afc05149082.png

在执行 var a = "block"; 后的环境图为:

9b8330ecf8af4490ac83d52317659b8d.png

简单来说,出现这种现象的原因是 showA() 的闭包在 Java 的表示中是一个对象,在代码执行过程中该闭包发生了变化,因而导致了遮蔽现象。

然而,这样的结果并不满足我们对词法作用域的定义:

A variable usage refers to the preceding declaration with the same name in the innermost scope that encloses the expression where the variable is used.

In JavaScript, variables declared using var are implicitly “hoisted” to the beginning of the block. Any use of that name in the block will refer to that variable, even if the use appears before the declaration. When you write this in JavaScript:

{
console.log(a);
var a = "value";
}

It behaves like:

{
var a; // Hoist.
console.log(a);
a = "value";
}

That means that in some cases you can read a variable before its initializer has run - an annoying source of bugs. The alternate let syntax for declaring variables was added later to address this problem.

Since this rule makes no mention of any runtime behavior, it implies that a variable expression always refers to the same declaration through the entire execution of the program.

关键问题是词法作用域要求打印语句 print a; 中所引用的对象 a 在整个运行期内保持不变。我们再来反思之前出现的问题,罪魁祸首是赋值语句 var a = "block";,在执行完这条语句后,块作用域对应的环境发生了动态变化,然而我们的实现是以块作用域为最小环境单元为前提的,这就导致了上面的问题。

Some languages make this split explicit. In Scheme and ML, when you declare a local variable using let, you also delineate the subsequent code where the new variable is in scope. There is no implicit “rest of the block”.

我们有两种解决方法:

72f1b02e2d61454795999cd99ae2b7c3.png

It’s the classic way to implement environments in Scheme interpreters.

为了最大程度复用已经实现的代码,我们选择第二种做法。这里的关键是:

If we could ensure a variable lookup always walked the same number of links in the environment chain, that would ensure that it found the same variable in the same scope every time.

因为我们要分析程序的静态结构,因此我们应该修改 Parser 类,第二个解释器就是这样的。然而作者却说:

It would work here too, but I want an excuse to show you another technique. We’ll write our resolver as a separate pass.

也就是说,在得到语法树后,我们并不是立刻执行代码,而是遍历一次语法树并对变量进行静态解析与绑定:

Additional passes between parsing and execution are common. If Lox had static types, we could slide a type checker in there. Optimizations are often implemented in separate passes like this too. Basically, any work that doesn’t rely on state that’s only available at runtime can be done in this way.

这类似于执行代码,区别在于:

实现

Resolver

我们需要一个新类 Resolver,其中实现了接口 Expr.Visitor<Void>Stmt.Visitor<Void>。似乎工作量很大,但其实我们的重点只有下面这些:

下面按顺序讲解,首先是块语句:

@Override
public Void visitBlockStmt(Stmt.Block stmt) {
beginScope();
resolve(stmt.statements);
endScope();
return null;
}

这里使用了许多辅助函数:

void resolve(List<Stmt> statements) {
for (Stmt statement : statements) { resolve(statement); }
}
private void resolve(Stmt stmt) { stmt.accept(this); }
private void resolve(Expr expr) { expr.accept(this); }
private void beginScope() { scopes.push(new HashMap<String, Boolean>()); }
private void endScope() { scopes.pop(); }

其中的 scopes 是局部环境链,我们将局部环境链组织成栈的结构:

private final Stack<Map<String, Boolean>> scopes = new Stack<>();

这里 Boolean 的含义下面会解释。

然后是变量声明语句,这里展示一种可能出现的代码:

var a = "outer";
{
var a = a;
}

我们有下面三种处理方法:

这里的选择是第三种。选择不重要 🤣,重要的是这带给我们一个启示,变量声明语句其实分为两个阶段:第一阶段是声明这个变量的存在,第二阶段是声明这个变量的值,这实际上是考虑到了初始化的过程。初始化未完成的变量标记为 false,初始化完成的变量标记为 true,这就是 Boolean 的含义。

因此我们实现如下:

private void declare(Token name) {
if (scopes.isEmpty()) return;
Map<String, Boolean> scope = scopes.peek();
if (scope.containsKey(name.lexeme)) {
Lox.error(name, "Already a variable with this name in this scope.");
}
scope.put(name.lexeme, false);
}
private void define(Token name) {
if (scopes.isEmpty()) return;
scopes.peek().put(name.lexeme, true);
}
@Override
public Void visitVarStmt(Stmt.Var stmt) {
declare(stmt.name);
if (stmt.initializer != null) {
resolve(stmt.initializer);
}
define(stmt.name);
return null;
}

注意这里的实现杜绝了重定义局部变量的现象。之所以是局部变量,下面会解释。

然后是变量表达式,先看实现:

@Override
public Void visitVariableExpr(Expr.Variable expr) {
if (!scopes.isEmpty() && scopes.peek().get(expr.name.lexeme) == Boolean.FALSE) {
Lox.error(expr.name, "Can't read local variable in its own initializer.");
}
resolveLocal(expr, expr.name);
return null;
}
private void resolveLocal(Expr expr, Token name) {
for (int i = scopes.size() - 1; i >= 0; i--) {
if (scopes.get(i).containsKey(name.lexeme)) {
interpreter.resolve(expr, scopes.size() - 1 - i);
return;
}
}
}

这里的实现呼应了上面的第三种选择,另外这里的 resolveLocal 方法是精髓,我们的 Resolver 类中有一个解释器的实例,我们在 Interpreter 类中新增了 resolve 方法:

void resolve(Expr expr, int depth) {
locals.put(expr, depth);
}

这里的 locals 是一个从变量名到环境层数(当前作用域和定义变量的作用域之间的距离)的映射:

private final Map<Expr, Integer> locals = new HashMap<>();

You might think we’d need some sort of nested tree structure to avoid getting confused when there are multiple expressions that reference the same variable, but each expression node is its own Java object with its own unique identity.

resolveLocal 方法中,若我们在当前环境解析到了变量,环境层数就是 0,若我们在当前环境的父环境解析到了变量,环境层数就是 1,以此类推。若没有解析到变量,那么说明该变量不在局部环境链中,只能在全局环境中。

如果我们选择修改 Parser 类来实现上述目的,那么可以将这些信息存储在语法树中。

赋值表达式的实现就简单一些了:

@Override
public Void visitAssignExpr(Expr.Assign expr) {
resolve(expr.value);
resolveLocal(expr, expr.name);
return null;
}

函数声明语句既需要引入新的作用域,又需要绑定函数名(声明、定义),其实现不多赘述。需要注意在上一章中我们实现了匿名函数,对于匿名函数的作用域问题,需要斟酌 🤣。

对于其他的语句和表达式,也不再赘述。

Interpreter

如果上述代码运行良好,我们在 Interpreter 类中就得到了 locals 映射表。

The way the interpreter assumes the variable is in that map feels like flying blind. The interpreter code trusts that the resolver did its job and resolved the variable correctly. This implies a deep coupling between these two classes. In the resolver, each line of code that touches a scope must have its exact match in the interpreter for modifying an environment.

I felt that coupling firsthand because as I wrote the code for the book, I ran into a couple of subtle bugs where the resolver and interpreter code were slightly out of sync. Tracking those down was difficult. One tool to make that easier is to have the interpreter explicitly assert - using Java’s assert statements or some other validation tool - the contract it expects the resolver to have already upheld.

这样我们就能优化变量表达式和赋值表达式了(只有这两个对应的方法对环境进行了 get 或 assign)。首先是变量表达式:

@Override
public Object visitVariableExpr(Expr.Variable expr) {
return lookUpVariable(expr.name, expr);
}
private Object lookUpVariable(Token name, Expr expr) {
Integer distance = locals.get(expr);
if (distance != null) {
return environment.getAt(distance, name.lexeme);
} else {
return globals.get(name);
}
}

然后是赋值表达式:

@Override
public Object visitAssignExpr(Expr.Assign expr) {
Object value = evaluate(expr.value);
Integer distance = locals.get(expr);
if (distance != null) {
environment.assignAt(distance, expr.name, value);
} else {
globals.assign(expr.name, value);
}
return value;
}

这里需要在 Environment 类中添加几个方法,不再赘述。

别忘了在 Lox 类的 run 方法中加上我们新出炉的resolver。现在当我们运行开头的代码时,结果变成了:

global
global

好嘞,这就是最终成果 🤣。

性能

下面是一段示例程序,展示了性能的提升:

var st=clock();
var obj=1;
{{{{{{{{{{for(var i=0;i<10000000;i=i+1){obj=obj+1;}}}}}}}}}}}
var ed=clock();
print (ed-st)+"s";

chapter10 的解释器运行大约需要 2.2 秒,chapter11(本章)的解释器运行大约需要 1.4 秒。

习题

  1. Why is it safe to eagerly define the variable bound to a function’s name when other variables must wait until after they are initialized before they can be used?

  2. How do other languages you know handle local variables that refer to the same name in their initializer, like:

    var a = "outer";
    {
    var a = a;
    }

    Is it a runtime error? Compile error? Allowed? Do they treat global variables differently? Do you agree with their choices? Justify your answer.

  3. Extend the resolver to report an error if a local variable is never used.

  4. Our resolver calculates which environment the variable is found in, but it’s still looked up by name in that map. A more efficient environment representation would store local variables in an array and look them up by index.

    Extend the resolver to associate a unique index for each local variable declared in a scope. When resolving a variable access, look up both the scope the variable is in and its index and store that. In the interpreter, use that to quickly access a variable by its index instead of using a map.

1

首先观察 visitFunctionStmt 方法:

@Override
public Void visitFunctionStmt(Stmt.Function stmt) {
declare(stmt.name);
define(stmt.name);
resolveFunction(stmt.function);
return null;
}
private void resolveFunction(Expr.Function function) {
beginScope();
for (Token param : function.parameters) {
declare(param);
define(param);
}
resolve(function.body);
endScope();
}

这个问题有一点疑惑,下面是一些思考:

目前,declaredefine 方法的唯一作用就是检测下面的情形:

var a = "outer";
{
var a = a;
}

对于递归函数而言,在解析函数体内部时,确实可能在 visitCallExpr 方法中解析 expr.callee,也就是函数名。然而函数体和函数名在不同的环境中,也就是在下面的 if 语句中:

@Override
public Void visitVariableExpr(Expr.Variable expr) {
if (!scopes.isEmpty() && scopes.peek().get(expr.name.lexeme) == Boolean.FALSE) {
Lox.error(expr.name, "Can't read local variable in its own initializer.");
}
resolveLocal(expr, expr.name);
return null;
}

scopes.peek().get(expr.name.lexeme) 返回的结果为 null,所以不会引发错误。

参考 HashMap 的实现:

public V get(Object key) {
Node<K,V> e;
return (e = getNode(hash(key), key)) == null ? null : e.value;
}

如此看来在函数体前 define 函数名似乎没有意义,一种赋予其意义 🤣 的方案是修改 scopes.peek().get(expr.name.lexeme) 的部分,让其可以在整个局部环境链中寻找,这样就必须在函数体前 define 函数名了。

另一方面,考虑下面的代码:

{
fun isOdd(n) {
if (n == 0) return false;
return isEven(n - 1);
}
fun isEven(n) {
if (n == 0) return true;
return isOdd(n - 1);
}
print isOdd(3);
}

注意最外面有一层大括号,这样的代码在执行时会报错:

Undefined variable 'isEven'.
[line 4]

原因是函数 isOddisEven 都位于局部环境中,当解析 isOdd 的函数声明时,会解析到 isEven 的函数名,然而此时局部环境中并没有 isEven,所以在执行时,解释器会认为 isEven 在全局环境中,这就出问题了。对于 isEven 函数声明的解析则可以完成。这种情形下,即使在函数体前 define 函数名,也无法修复问题(这个问题在上一章中并不存在,因为闭包是动态变化的)。

如果想补救一下的话 🤣,可以使用匿名函数,显式地控制函数名绑定:

{
var isEven;
var isOdd = fun (n) {
if (n == 0) return false;
return isEven(n - 1);
};
isEven = fun (n) {
if (n == 0) return true;
return isOdd(n - 1);
};
print isOdd(3);
print isEven(3);
}

2

C 中,对于如下代码:

#include<stdio.h>
int main() {
int i = 1;
{
int i = i;
printf("%d", i);
}
}

int i = i; 的右边的 i 遮蔽了外面的 i,对应 Put the new variable in scope, then run the initializer. 这种情形,而 C 不允许使用未初始化的局部变量,所以会报错。

Java 则要求内部的变量名不能与外部的变量名同名。

3

这里的关键是给变量做标记,上面我们已经有了已声明和已定义的标记,分别对应 falsetrue

需要注意已定义不代表已初始化。

现在我们引入一个标记 已使用,在 Resolver 类中新增:

private static class Variable {
final Token name;
VariableState state;
private Variable(Token name, VariableState state) {
this.name = name;
this.state = state;
}
}
private enum VariableState {
DECLARED,
DEFINED,
READ
}

并修改 scopes

private final Stack<Map<String, Variable>> scopes = new Stack<>();

resolveLocal 中标记已使用:

private void resolveLocal(Expr expr, Token name, boolean isRead) {
for (int i = scopes.size() - 1; i >= 0; i--) {
if (scopes.get(i).containsKey(name.lexeme)) {
interpreter.resolve(expr, scopes.size() - 1 - i);
if (isRead) {
scopes.get(i).get(name.lexeme).state = VariableState.READ;
}
return;
}
}
}

变量表达式使用了变量,赋值表达式则未使用变量:

@Override
public Void visitVariableExpr(Expr.Variable expr) {
if (!scopes.isEmpty() &&
scopes.peek().containsKey(expr.name.lexeme) &&
scopes.peek().get(expr.name.lexeme).state == VariableState.DECLARED) {
Lox.error(expr.name, "Can't read local variable in its own initializer.");
}
resolveLocal(expr, expr.name, true);
return null;
}
@Override
public Void visitAssignExpr(Expr.Assign expr) {
resolve(expr.value);
resolveLocal(expr, expr.name, false);
return null;
}

endScope 中检查是否有局部变量未被使用:

private void endScope() {
Map<String, Variable> scope = scopes.pop();
for (Map.Entry<String, Variable> entry : scope.entrySet()) {
if (entry.getValue().state == VariableState.DEFINED) {
Lox.error(entry.getValue().name, "Local variable is not used.");
}
}
}

4

不感兴趣 🤣

Classes

通向 OOP 的世界有三条道路:

Multimethods are the approach you’re least likely to be familiar with. I’d love to talk more about them - I designed a hobby language around them once and they are super rad- but there are only so many pages I can fit in. If you’d like to learn more, take a look at CLOS (the object system in Common Lisp), Dylan, Julia, or Raku.

DESIGN NOTE: PROTOTYPES AND POWER

对于 Lox 而言,我们选择第一条道路。先瞧瞧地图:

59e991f738b845e2a7a337d8c7ae6a6c.png

类声明语句

类似函数声明语句,语法规则为:

declaration → classDecl
| funDecl
| varDecl
| statement ;
classDecl → "class" IDENTIFIER "{" function* "}" ;
function → IDENTIFIER "(" parameters? ")" block ;
parameters → IDENTIFIER ( "," IDENTIFIER )* ;

使用 GenerateAst 生成:

"Class : Token name, List<Stmt.Function> methods"

对应在 Parser 类中的实现复用了 function 方法。

下面是语义的部分,对于 Resolver 类而言,目前我们只有类名,所以需要对类名进行解析(我们允许在局部环境中声明类)。对于 Interpreter 类而言,我们需要将类的编译时表示转换为运行时表示(类似函数声明),为此我们新建一个类 LoxClass,其中含有一个类名字段,并在 Interpreter 类中实现 visitClassStmt 方法:

@Override
public Void visitClassStmt(Stmt.Class stmt) {
environment.define(stmt.name.lexeme, null);
LoxClass klass = new LoxClass(stmt.name.lexeme);
environment.assign(stmt.name, klass);
return null;
}

That two-stage variable binding process allows references to the class inside its own methods.

类实例

While some syntax and semantics are fairly standard across OOP languages, the way you create new instances isn’t. Ruby, following Smalltalk, creates instances by calling a method on the class object itself, a recursively graceful approach. Some, like C++ and Java, have a new keyword dedicated to birthing a new object. Python has you “call” the class itself like a function. (JavaScript, ever weird, sort of does both.)

In Smalltalk, even classes are created by calling methods on an existing object, usually the desired superclass. It’s sort of a turtles-all-the-way-down thing. It ultimately bottoms out on a few magical classes like Object and Metaclass that the runtime conjures into being ex nihilo.

我们通过将类名作为 callee 的方案产生类的实例(调用运算符),这样的话只要让 LoxClass 实现 LoxCallable 接口就完事了。

It’s as if a class is a factory function that generates instances of itself.

为此,我们需要新建一个类 LoxInstance,其中含有一个 LoxClass 类型的字段。

@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
LoxInstance instance = new LoxInstance(this);
return instance;
}
@Override
public int arity() {
return 0;
}

实例属性

属性(property):字段(field)和方法(method)

对属性的访问是通过句号运算符进行的,句号运算符和调用运算符具有相同的优先级。

Allowing code outside of the class to directly modify an object’s fields goes against the object-oriented credo that a class encapsulates state. Some languages take a more principled stance. In Smalltalk, fields are accessed using simple identifiers - essentially, variables that are only in scope inside a class’s methods. Ruby uses @ followed by a name to access a field in an object. That syntax is only meaningful inside a method and always accesses state on the current object.

Lox, for better or worse, isn’t quite so pious about its OOP faith.

下面是语法规则:

call → primary ( "(" arguments? ")" | "." IDENTIFIER )* ;

After a primary expression, we allow a series of any mixture of parenthesized calls and dotted property accesses.

使用 GenerateAst 生成:

"Get : Expr object, Token name"

对应修改 Parser 类中的 call 方法:

private Expr call() {
Expr expr = primary();
while (true) {
if (match(LEFT_PAREN)) {
expr = finishCall(expr);
} else if (match(DOT)) {
Token name = consume(IDENTIFIER, "Expect property name after '.'.");
expr = new Expr.Get(expr, name);
} else {
break;
}
}
return expr;
}

可参考图示:

04bb4be5023747adb0a4e8a9ce130599.png

语义部分,我们不需要静态解析实例属性(显然),只需要递归解析左子树即可:

@Override
public Void visitGetExpr(Expr.Get expr) {
resolve(expr.object);
return null;
}

You can literally see that property dispatch in Lox is dynamic since we don’t process the property name during the static resolution pass.

真正对实例属性的访问发生在运行时,也就是 Interpreter 类的 visitGetExpr 方法:

@Override
public Object visitGetExpr(Expr.Get expr) {
Object object = evaluate(expr.object);
if (object instanceof LoxInstance) {
return ((LoxInstance) object).get(expr.name);
}
throw new RuntimeError(expr.name, "Only instances have properties.");
}

为此,我们在 LoxInstance 类中添加一个字段,一个从属性(字段)名到引用对象的映射表,并对应添加 get 方法。

Doing a hash table lookup for every field access is fast enough for many language implementations, but not ideal. High performance VMs for languages like JavaScript use sophisticated optimizations like “hidden classes” to avoid that overhead.

Paradoxically, many of the optimizations invented to make dynamic languages fast rest on the observation that - even in those languages - most code is fairly static in terms of the types of objects it works with and their fields.

下面是设置属性的部分,此时逗号运算符出现在赋值表达式的左侧:

assignment → ( call "." )? IDENTIFIER "=" assignment
| logic_or ;

Unlike getters, setters don’t chain. However, the reference to call allows any high-precedence expression before the last dot, including any number of getters, as in:

fb11f2d6728742508c0720b0c1a7f989.png

使用 GenerateAst 生成:

"Set : Expr object, Token name, Expr value"

对应修改 Parser 类中的 assignment 方法:

private Expr assignment() {
Expr expr = ternary();
if (match(EQUAL)) {
Token equals = previous();
Expr value = assignment();
if (expr instanceof Expr.Variable) {
Token name = ((Expr.Variable) expr).name;
return new Expr.Assign(name, value);
} else if (expr instanceof Expr.Get) {
Expr.Get get = (Expr.Get) expr;
return new Expr.Set(get.object, get.name, value);
}
error(equals, "Invalid assignment target.");
}
return expr;
}

对于语义,类似访问属性的部分,不再赘述。

由于实例属性的动态性,我们没有必要在设置属性前检查属性是否存在。

类方法

这里的类方法并不是指我们可以通过类名直接调用方法,这只是为了暗示方法是存储在类中的。

我们在一个类实例上访问并调用方法,如图所示:

5109152cc3f8417f902e3b0329dbe9e5.png

现在一个问题是,如果将访问和调用方法的过程分开,会发生什么,例如考虑下面的代码:

class Person {
sayName() {
print this.name;
}
}
var jane = Person();
jane.name = "Jane";
var method = jane.sayName;
method(); // ?

结果应该是 Jane,那下面的代码呢:

class Person {
sayName() {
print this.name;
}
}
var jane = Person();
jane.name = "Jane";
var bill = Person();
bill.name = "Bill";
bill.sayName = jane.sayName;
bill.sayName(); // ?

结果依然是 Jane。这带给我们一个启示(现在用不上 🤣):

We will have methods “bind” this to the original instance when the method is first grabbed. Python calls these bound methods.

这里涉及了方法(函数)是一等对象及 this 表达式等内容,比较费解 🤣。

A motivating use for this is callbacks. Often, you want to pass a callback whose body simply invokes a method on some object. Being able to look up the method and pass it directly saves you the chore of manually declaring a function to wrap it. Compare this:

fun callback(a, b, c) {
object.method(a, b, c);
}
takeCallback(callback);

With this:

takeCallback(object.method);

在实现类方法之前,我们先来处理一个问题,那就是在函数体之外调用 return 语句,我们需要记录目前代码所处的位置。类似 break 语句和 continue 语句的处理,然而这次,我们并非在 Parser 类中添加类似 loopDepth 的字段,既然我们已经有了 Resolver 类,我们可以在resolver中处理这个问题。在 Resolver 类中添加枚举类 FunctionType 和该类型的字段 currentFunction,并修改对函数体的静态解析方法:

private void resolveFunction(Expr.Function function, FunctionType type) {
FunctionType enclosingFunction = currentFunction;
currentFunction = type;
beginScope();
for (Token param : function.parameters) {
declare(param);
define(param);
}
resolve(function.body);
endScope();
currentFunction = enclosingFunction;
}

这样我们在遇到 return 语句时,就能通过检查 currentFunction 来判断 return 语句是否在函数体内了。

可以用同样的方法处理 break 语句和 continue 语句,将代码转移到 Resolver 类中。

现在我们来实现类方法,我们在刚才的 FunctionType 中添加 METHOD,并修改 Resolver 类中的 visitClassStmt

@Override
public Void visitClassStmt(Stmt.Class stmt) {
declare(stmt.name);
define(stmt.name);
for (Stmt.Function method : stmt.methods) {
FunctionType declaration = FunctionType.METHOD;
resolveFunction(method.function, declaration);
}
return null;
}

修改 Interpreter 类中的 visitClassStmt

@Override
public Void visitClassStmt(Stmt.Class stmt) {
environment.define(stmt.name.lexeme, null);
Map<String, LoxFunction> methods = new HashMap<>();
for (Stmt.Function method : stmt.methods) {
LoxFunction function = new LoxFunction(method.name.lexeme, method.function, environment);
methods.put(method.name.lexeme, function);
}
LoxClass klass = new LoxClass(stmt.name.lexeme, methods);
environment.assign(stmt.name, klass);
return null;
}

注意为了实现匿名函数,LoxFunction 类的构造方法有所变化。

对应的,我们在 LoxClass 类中添加字段,一个从方法名到函数对象的映射表。并修改 LoxInstance 类中的 get 方法:

Object get(Token name) {
if (fields.containsKey(name.lexeme)) {
return fields.get(name.lexeme);
}
LoxFunction method = klass.findMethod(name.lexeme);
if (method != null) return method;
throw new RuntimeError(name, "Undefined property '" + name.lexeme + "'.");
}

方法作为实例属性的一部分。在访问属性的过程中,若字段名与方法名重名,则字段名会遮蔽方法名。对方法的 get 是通过 LoxClass 类的 findMethod 方法实现的。

this

尽管我们已经有了字段和方法,但是它们之间并没有任何关系。目前的方法内部不能访问实例字段,因为方法并没有和实例绑定在一起。所以现在的方法就像类方法,只不过我们是通过实例来调用的。

为此,我们引入 this 表达式,其关键是闭包

先看实现,我们修改语法规则:

primary → "true" | "false" | "nil" | "this"
| NUMBER | STRING
| "(" expression ")"
| IDENTIFIER ;

使用 GenerateAst 生成:

"This : Token keyword"

对应在 Parser 类的实现略去。

下面是语义的部分,在 Resolver 类中添加 visitThisExpr 方法:

@Override
public Void visitThisExpr(Expr.This expr) {
resolveLocal(expr, expr.keyword, true);
return null;
}

注意这里为了识别未使用的局部变量,resolveLocal 参数多了一个,此处 true 表示已使用(后面会看到这里填什么都无所谓)。

我们同时需要修改 Resolver 类中的 visitClassStmt 方法,创建一个新的环境,并在新环境中添加 this

@Override
public Void visitClassStmt(Stmt.Class stmt) {
declare(stmt.name);
define(stmt.name);
beginScope();
scopes.peek().put("this", new Variable(null, VariableState.READ));
for (Stmt.Function method : stmt.methods) {
FunctionType declaration = FunctionType.METHOD;
resolveFunction(method.function, declaration);
}
endScope();
return null;
}

这里的 scopes.peek().put("this", new Variable(null, VariableState.READ)); 也是为了识别未使用的局部变量,之所以不用 VariableState.DEFINED,是因为若类声明中的方法里没有出现 this 的话,Resolver 类的 endScope 方法就会报错。这里使用 null,是因为没有该token的信息。

The resolver has a new scope for this, so the interpreter needs to create a corresponding environment for it. Remember, we always have to keep the resolver’s scope chains and the interpreter’s linked environments in sync with each other.

为此,我们修改 LoxInstance 类中的 get 方法:

Object get(Token name) {
if (fields.containsKey(name.lexeme)) {
return fields.get(name.lexeme);
}
LoxFunction method = klass.findMethod(name.lexeme);
if (method != null) return method.bind(this);
throw new RuntimeError(name, "Undefined property '" + name.lexeme + "'.");
}

其中使用了 LoxFunction 类的 bind 方法

LoxFunction bind(LoxInstance instance) {
Environment environment = new Environment(closure);
environment.define("this", instance);
return new LoxFunction(name, declaration, environment, isInitializer);
}

最后,我们还要在 Interpreter 类中添加 visitThisExpr 方法:

@Override
public Object visitThisExpr(Expr.This expr) {
return lookUpVariable(expr.keyword, expr);
}

这番操作实在让人头大 🤣,来看一个例子,分析其执行过程:

class Cake {
taste() {
var adjective = "delicious";
print "The " + this.flavor + " cake is " + adjective + "!";
}
}
var cake = Cake();
cake.flavor = "German chocolate";
cake.taste(); // Prints "The German chocolate cake is delicious!".

先来看看 resolver 干了些什么:

再来看看 interpreter 干了些什么:

总结来说,静态解析在类声明语句中就准备好了所有的环境变化,而直到实例方法被调用时,这些环境变化才对应地体现出来。

另外,我们还要检测出对 this 表达式的非法使用,方法和检查函数体之外调用 return 语句完全类似,在 Resolver 添加对应的枚举类 ClassType 即可。

init

”Constructing” an object is actually a pair of operations:

  1. The runtime allocates the memory required for a fresh instance. In most languages, this operation is at a fundamental level beneath what user code is able to access.
  2. Then, a user-provided chunk of code is called which initializes the unformed object.

C++‘s “placement new” is a rare example where the bowels of allocation are laid bare for the programmer to prod.

The latter is what we tend to think of when we hear “constructor”, but the language itself has usually done some groundwork for us before we get to that point. In fact, our Lox interpreter already has that covered when it creates a new LoxInstance object.

We’ll do the remaining part - user-defined initialization - now.

我们需要修改 LoxClass 中的 call 方法和 arity 方法:

@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
LoxInstance instance = new LoxInstance(this);
LoxFunction initializer = findMethod("init");
if (initializer != null) {
initializer.bind(instance).call(interpreter, arguments);
}
return instance;
}
@Override
public int arity() {
LoxFunction initializer = findMethod("init");
if (initializer == null) return 0;
return initializer.arity();
}

也就是说,当我们通过类名创建类的实例时,init 方法就会被立即绑定到实例上并且被调用。

下面我们考虑一些细节问题。首先,我们允许显式地调用 init 方法,并且 init 方法的返回值总是 this(我们不允许在 init 方法中返回其他值),考虑下面的代码:

class Foo {
init() {
this.bar = 1;
}
}
var foo = Foo();
foo.bar = 2;
print foo.bar;
print foo.init();
print foo.bar;

结果应该是:

2
Foo instance
1

为了实现这一点,我们需要修改 LoxFunction 中的 call 方法:

@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
Environment environment = new Environment(closure);
for (int i = 0; i < declaration.parameters.size(); i++) {
environment.define(declaration.parameters.get(i).lexeme, arguments.get(i));
}
try {
interpreter.executeBlock(declaration.body, environment);
} catch (Return returnValue) {
if (isInitializer) return closure.getAt(0, "this");
return returnValue.value;
}
if (isInitializer) return closure.getAt(0, "this");
return null;
}

这里添加了两处 if (isInitializer) return closure.getAt(0, "this");,第一处是处理 return; 的情形,第二处是处理没有 return 语句的情形。同时,我们要在 LoxFunction 中添加一个字段 isInitializer,并修改所有使用了 LoxFunction 构造方法的地方。

另外,为了处理在 init 方法中返回其他值的情形,我们在 Resolver 类的 FunctionType 中添加 INITIALIZER 字段,并对应修改 visitClassStmtvisitReturnStmt 方法即可。

习题

  1. We have methods on instances, but there is no way to define “static” methods that can be called directly on the class object itself. Add support for them. Use a class keyword preceding the method to indicate a static method that hangs off the class object.

    class Math {
    class square(n) {
    return n * n;
    }
    }
    print Math.square(3); // Prints "9".

    You can solve this however you like, but the “metaclasses” used by Smalltalk and Ruby are a particularly elegant approach. Hint: Make LoxClass extend LoxInstance and go from there.

  2. Most modern languages support “getters” and “setters” - members on a class that look like field reads and writes but that actually execute user-defined code. Extend Lox to support getter methods. These are declared without a parameter list. The body of the getter is executed when a property with that name is accessed.

    class Circle {
    init(radius) {
    this.radius = radius;
    }
    area {
    return 3.141592653 * this.radius * this.radius;
    }
    }
    var circle = Circle(4);
    print circle.area; // Prints roughly "50.2655".
  3. Python and JavaScript allow you to freely access an object’s fields from outside of its own methods. Ruby and Smalltalk encapsulate instance state. Only methods on the class can access the raw fields, and it is up to the class to decide which state is exposed. Most statically typed languages offer modifiers like private and public to control which parts of a class are externally accessible on a per-member basis.

    What are the trade-offs between these approaches and why might a language prefer one or the other?

1

根据提示,这里的关键是类本身也是一个类的实例。

首先,我们需要在类的语法树表示中添加类方法的字段:

"Class : Token name, List<Stmt.Function> methods, List<Stmt.Function> classMethods"

对应修改 Parser 类中的 classDeclaration 方法:

private Stmt classDeclaration() {
Token name = consume(IDENTIFIER, "Expect class name.");
List<Stmt.Function> methods = new ArrayList<>();
List<Stmt.Function> classMethods = new ArrayList<>();
consume(LEFT_BRACE, "Expect '{' before class body.");
while (!check(RIGHT_BRACE) && !isAtEnd()) {
boolean isClassMethod = match(CLASS);
(isClassMethod ? classMethods : methods).add(function("method"));
}
consume(RIGHT_BRACE, "Expect '}' after class body.");
return new Stmt.Class(name, methods, classMethods);
}

语义部分,对 Resolver 类:

@Override
public Void visitClassStmt(Stmt.Class stmt) {
ClassType enclosingClass = currentClass;
currentClass = ClassType.CLASS;
declare(stmt.name);
define(stmt.name);
beginScope();
scopes.peek().put("this", new Variable(null, VariableState.READ));
for (Stmt.Function method : stmt.methods) {
FunctionType declaration = FunctionType.METHOD;
if (method.name.lexeme.equals("init")) {
declaration = FunctionType.INITIALIZER;
}
resolveFunction(method.function, declaration);
}
for (Stmt.Function method : stmt.classMethods) {
beginScope();
scopes.peek().put("this", new Variable(null, VariableState.READ));
resolveFunction(method.function, FunctionType.METHOD);
endScope();
}
endScope();
currentClass = enclosingClass;
return null;
}

我们在解析类方法时创建了一个新的环境,并在其中添加了 this,这是为了避免与父环境中的 this 冲突。

Interpreter 类:

@Override
public Void visitClassStmt(Stmt.Class stmt) {
environment.define(stmt.name.lexeme, null);
Map<String, LoxFunction> classMethods = new HashMap<>();
for (Stmt.Function method : stmt.classMethods) {
LoxFunction function = new LoxFunction(method.name.lexeme, method.function, environment, false);
classMethods.put(method.name.lexeme, function);
}
LoxClass metaclass = new LoxClass(null, stmt.name.lexeme + " metaclass", classMethods);
Map<String, LoxFunction> methods = new HashMap<>();
for (Stmt.Function method : stmt.methods) {
LoxFunction function = new LoxFunction(method.name.lexeme, method.function,
environment, method.name.lexeme.equals("init"));
methods.put(method.name.lexeme, function);
}
LoxClass klass = new LoxClass(metaclass, stmt.name.lexeme, methods);
environment.assign(stmt.name, klass);
return null;
}

这里为了让 LoxClass 继承 LoxInstance 类,我们将包含类方法的 metaclass 作为类实例所对应的类。如此一来,在全局环境中的 Math 就可以作为实例访问其属性,这里的属性就是 metaclass 中的类方法。

可以尝试分析下面的代码:

class Math {
class square(n) {
print this.foo;
return n * n;
}
init(n) {
this.foo = n;
}
square() {
Math.square(this.foo);
return this.foo * this.foo;
}
}
Math.foo = 1;
print Math.square(3);
var math = Math(2);
print math.square();

需要注意只能通过类调用类方法,无法通过类实例调用类方法,要点:

2

感觉这个特性并没啥用 🤣。

3

The decision to encapsulate at all or not is the classic trade-off between whether you want to make things easier for the class consumer or the class maintainer. By making everything public and freely externally visible and modifier, a downstream user of a class has more freedom to pop the hood open and muck around in the class’s internals.

However, that access tends to increasing coupling between the class and its users. That increased coupling makes the class itself more brittle, similar to the “fragile base class problem”. If users are directly accessing properties that the class author considered implementation details, they lose the freedom to tweak that implementation without breaking those users. The class can end up harder to change. That’s more painful for the maintainer, but also has a knock-on effect to the consumer — if the class evolves more slowly, they get fewer newer features for free from the upstream maintainer.

On the other hand, free external access to class state is a simpler, easier user experience when the class maintainer and consumer are the same person. If you’re banging out a small script, it’s handy to be able to just push stuff around without having to go through a lot of ceremony and boilerplate. At small scales, most language features that build fences in the program are more annoying than they are useful.

As the program scales up, though, those fences become increasingly important since no one person is able to hold the entire program in their head. Boundaries in the code let you make productive changes while only knowing a single region of the program.

Assuming you do want some sort of access control over properties, the next question is how fine-grained. Java has four different access control levels. That’s four concepts the user needs to understand. Every time you add a member to a class, you need to pick one of the four, and need to have the expertise and forethought to choose wisely. This adds to the cognitive load of the language and adds some mental friction when programming.

However, at large scales, each of those access control levels (except maybe package private) has proven to be useful. Having a few options gives class maintainers precise control over what extension points the class user has access to. While the class author has to do the mental work to pick a modifier, the class consumer gets to benefit from that. The modifier chosen for each member clearly communicates to the class user how the class is intended to be used. If you’re subclassing a class and looking at a sea of methods, trying to figure out which one to override, the fact that one is protected while the others are all private or public makes your choice much easier — it’s a clear sign that that method is for the subclass’s use.

Inheritance

继承

在 Lox 中,我们用 < 表示继承,为此,我们修改语法规则:

classDecl → "class" IDENTIFIER ( "<" IDENTIFIER )?
"{" function* "}" ;

我们并没有要求所有类都有一个超类,对 Lox 而言,没有根类的概念。

使用 GenerateAst 生成,这里保留了上一章练习中的类方法:

"Class : Token name, Expr.Variable superclass, List<Stmt.Function> methods, List<Stmt.Function> classMethods

You might be surprised that we store the superclass name as an Expr.Variable, not a Token. The grammar restricts the superclass clause to a single identifier, but at runtime, that identifier is evaluated as a variable access. Wrapping the name in an Expr.Variable early on in the parser gives us an object that the resolver can hang the resolution information off of.

对应在 Parser 类的实现略去。

下面是语义的部分,我们先考虑存储超类的信息,再考虑继承的语义。

修改 Resolver 类的 visitClassStmt 方法,在 define(stmt.name); 后添加如下代码:

if (stmt.superclass != null && stmt.name.lexeme.equals(stmt.superclass.name.lexeme)) {
Lox.error(stmt.superclass.name, "A class can't inherit from itself.");
}
if (stmt.superclass != null) {
currentClass = ClassType.SUBCLASS;
resolve(stmt.superclass);
}

这里考虑了一个类继承自身的错误情形,并且我们在 ClassType 中添加了一个标记,为后来检测 super 表达式的错误用法提供了便利。

再强调一下,这里之所以要 resolve(stmt.superclass);,是因为在 Lox 中我们允许在局部环境中进行类的声明。

接着,修改 Interpreter 类的 visitClassStmt 方法,在 environment.define(stmt.name.lexeme, null); 前添加如下代码:

Object superclass = null;
if (stmt.superclass != null) {
superclass = evaluate(stmt.superclass);
if (!(superclass instanceof LoxClass)) {
throw new RuntimeError(stmt.superclass.name, "Superclass must be a class.");
}
}

并在 LoxClass 添加一个新的字段(超类),修改对应的构造方法。

对于元类的超类,先设置为 null,后面还会修改。

现在我们来考虑继承的语义,由于 Lox 是动态类型语言,我们对继承的要求仅仅是子类可以复用超类的方法(也就是说 Lox 并没有遵循里氏代换原则🤣),于是我们只要修改 LoxClassfindMethod 方法:

LoxFunction findMethod(String name) {
if (methods.containsKey(name)) {
return methods.get(name);
}
if (superclass != null) {
return superclass.findMethod(name);
}
return null;
}

super

super 表达式类似 this 表达式,只不过 super 不能单独使用,下面是语法规则:

primary → "true" | "false" | "nil" | "this"
| NUMBER | STRING | IDENTIFIER | "(" expression ")"
| "super" "." IDENTIFIER ;

Typically, a super expression is used for a method call, but, as with regular methods, the argument list is not part of the expression. Instead, a super call is a super access followed by a function call. Like other method calls, you can get a handle to a superclass method and invoke it separately.

So the super expression itself contains only the token for the super keyword and the name of the method being looked up.

使用 GenerateAst 生成:

"Super : Token keyword, Token method"

对应在 Parser 类的实现略去。

super 表达式的语义需要慎重考虑,思考下面的代码:

class A {
method() {
print "A method";
}
}
class B < A {
method() {
print "B method";
}
test() {
super.method();
}
}
class C < B {}
C().test();

正确的结果应该是 A method,分析见图:

350e72eb6ccf4361a7d40ade326b50a0.png

The execution flow looks something like this:

  1. We call test() on an instance of C.
  2. That enters the test() method inherited from B. That calls super.method().
  3. The superclass of B is A, so that chains to method() on A, and the program prints “A method”.

super 表达式从 this 的超类开始寻找,结果就变成了 B method。所以 super 表达式应该从包含 super 表达式的类的超类开始寻找。

为了实现这个语义,我们采用和 this 表达式类似的方法 - 闭包。

One important difference is that we bound this when the method was accessed. The same method can be called on different instances and each needs its own this. With super expressions, the superclass is a fixed property of the class declaration itself. Every time you evaluate some super expression, the superclass is always the same.

修改 Resolver 类的 visitClassStmt 方法,在 beginScope(); 前添加如下代码:

if (stmt.superclass != null) {
beginScope();
scopes.peek().put("super", new Variable(null, VariableState.READ));
}

endScope(); 后添加如下代码:

if (stmt.superclass != null) endScope();

Resolver 类中添加 visitSuperExpr 方法:

@Override
public Void visitSuperExpr(Expr.Super expr) {
if (currentClass == ClassType.NONE) {
Lox.error(expr.keyword, "Can't use 'super' outside of a class.");
} else if (currentClass != ClassType.SUBCLASS) {
Lox.error(expr.keyword, "Can't use 'super' in a class with no superclass.");
}
resolveLocal(expr, expr.keyword, true);
return null;
}

这同时检测了 super 表达式的两种错误用法。

修改 Interpreter 类的 visitClassStmt 方法,在 environment.define(stmt.name.lexeme, null); 后添加:

if (stmt.superclass != null) {
environment = new Environment(environment);
environment.define("super", superclass);
}

environment.assign(stmt.name, klass); 前添加:

if (superclass != null) {
environment = environment.enclosing;
}

最后,我们在 Interpreter 类中添加 visitSuperExpr 方法:

@Override
public Object visitSuperExpr(Expr.Super expr) {
int distance = locals.get(expr);
LoxClass superclass = (LoxClass)environment.getAt(distance, "super");
LoxInstance object = (LoxInstance)environment.getAt(distance - 1, "this");
LoxFunction method = superclass.findMethod(expr.method.lexeme);
if (method == null) {
throw new RuntimeError(expr.method, "Undefined property '" + expr.method.lexeme + "'.");
}
return method.bind(object);
}

这里的 distance - 1 耐人寻味 🤣。我们来看上面的例子,分析其执行过程。

先来看看 resolver 干了些什么:

再来看看 interpreter 干了些什么:

习题

  1. Lox supports only single inheritance- a class may have a single superclass and that’s the only way to reuse methods across classes. Other languages have explored a variety of ways to more freely reuse and share capabilities across classes: mixins, traits, multiple inheritance, virtual inheritance, extension methods, etc.

    If you were to add some feature along these lines to Lox, which would you pick and why? If you’re feeling courageous (and you should be at this point), go ahead and add it.

  2. In Lox, as in most other object-oriented languages, when looking up a method, we start at the bottom of the class hierarchy and work our way up - a subclass’s method is preferred over a superclass’s. In order to get to the superclass method from within an overriding method, you use super.

    The language BETA takes the opposite approach. When you call a method, it starts at the top of the class hierarchy and works down. A superclass method wins over a subclass method. In order to get to the subclass method, the superclass method can call inner, which is sort of like the inverse of super. It chains to the next method down the hierarchy.

    The superclass method controls when and where the subclass is allowed to refine its behavior. If the superclass method doesn’t call inner at all, then the subclass has no way of overriding or modifying the superclass’s behavior.

    Take out Lox’s current overriding and super behavior and replace it with BETA’s semantics. In short:

    • When calling a method on a class, prefer the method highest on the class’s inheritance chain.
    • Inside the body of a method, a call to inner looks for a method with the same name in the nearest subclass along the inheritance chain between the class containing the inner and the class of this. If there is no matching method, the inner call does nothing.

    For example:

    class Doughnut {
    cook() {
    print "Fry until golden brown.";
    inner();
    print "Place in a nice box.";
    }
    }
    class BostonCream < Doughnut {
    cook() {
    print "Pipe full of custard and coat with chocolate.";
    }
    }
    BostonCream().cook();

    This should print:

    Fry until golden brown.
    Pipe full of custard and coat with chocolate.
    Place in a nice box.
  3. In the chapter where I introduced Lox, I challenged you to come up with a couple of features you think the language is missing. Now that you know how to build an interpreter, implement one of those features.

1

对这几个概念挺感兴趣:

其实多重继承是万恶之源,从不同途径(菱形继承)继承来的同一超类,会在子类中存在多份拷贝。这将存在两个问题:其一,浪费存储空间;第二,存在二义性问题。为此,C++ 引入了虚继承的概念:

#include<iostream>
class A {
protected:
int a = 1;
};
class B : virtual public A {
};
class C : virtual public A {
};
class D : public B, public C {
public:
void get() {
std::cout << a;
}
};
int main() {
D d;
d.get();
return 0;
}

Python 而言,当出现多重继承时,会有一个方法的解析顺序:

class A: attr = 1
class B(A): attr = 2
class C(A): attr = 3
class D(B, C): pass
x = D()
print(x.attr)
print(D.mro())

我们可以在一些类中实现某种功能单元,并使用多重继承的方式,将功能组合到子类中。这便是 mixin 的思想:

不是 is-a,而是 -able

class MappingMixin:
def __getitem__(self, key):
return self.__dict__.get(key)
def __setitem__(self, key, value):
return self.__dict__.set(key, value)
class ReprMixin:
def __repr__(self):
s = self.__class__.__name__ + '('
for k, v in self.__dict__.items():
if not k.startswith('_'):
s += '{}={}, '.format(k, v)
s = s.rstrip(', ') + ')' # 将最后一个逗号和空格换成括号
return s
class Person(ReprMixin, MappingMixin):
def __init__(self, name, age):
self.name = name
self.age = age
p = Person("LLW", 18)
print(p['name'])
print(p)
print(Person.mro())

对于 trait,简单来说就是 interface + default method,可以参考这篇文章:Traits with Java 8 Default Methods。作者给出了 trait 的一个简单实现。

C# 中体现了 extension method

2

需要思考一个超类有多个子类的情形,此时的 inner 方法会有二义性。

与多重继承对偶。如果以类为对象、继承关系为态射,或许对应了范畴论中的积和余积 🤣

3

首先实现方法(不是实例方法)的继承,我们需要让子类成为超类的实例,也就是说,在构造子类的元类时,其超类为父类的元类。

为此,我们在 LoxClass 中存储 metaclass 字段,并修改 Interpreter 类的 visitClassStmt 方法,将

LoxClass metaclass = new LoxClass(null, stmt.name.lexeme + " metaclass", null, classMethods);

修改为:

LoxClass metaclass;
if (stmt.superclass != null) {
metaclass = new LoxClass(null, stmt.name.lexeme + " metaclass", ((LoxClass) superclass).metaclass, classMethods);
} else {
metaclass = new LoxClass(null, stmt.name.lexeme + " metaclass", null, classMethods);
}

这样下面的 Lox 程序就可以运行了:

class A {
class foo() {
print "A.foo()";
}
}
class B < A{
}
class C < B{
}
A.foo();
B.foo();
C.foo();

然后在 Interpreter 类中实现标准输入的本地方法:

globals.define("input", new LoxCallable() {
@Override
public int arity() { return 0; }
@Override
public Object call(Interpreter interpreter, List<Object> arguments) {
Scanner scanner = new Scanner(System.in);
return scanner.nextLine();
}
@Override
public String toString() { return "<native fn>"; }
});

这里的 nextLine 方法会读取一行,尽量与 Pythoninput() 方法的语义一致。