11 April 2017

Lately in Kitten #2: Closures, Boxing, Permissions, and Assumptions

Recent work on Kitten has been focused on design rather than implementation, so those of you who’ve been following the project might have noticed an apparent lull in development activity. Having figured out a few issues, I’ll be getting back to the code this week. Here’s an overview of the stuff I’ve been thinking about.

Closure Conversion

Closure conversion and lambda lifting are two operations central to the implementation of a compiler for a functional programming language. The goal is to convert anonymous functions that implicitly capture variables from the enclosing scope into closures that explicitly indicate which variables they capture, as well as lift inline anonymous functions into top-level definitions.

For instance, the expression -> x; { x } takes a value from the stack and wraps it in a function that simply returns that value when called. The quotation { x } captures the local variable x from the enclosing scope. Currently, the compiler transforms this into an explicit “capture” expression, represented in pseudo-Kitten as:

$(local.0){ closure.0 }

When execution reaches this expression, the local variable (local.0) is copied into the closure; when the closure is executed, closure.0 refers to this captured value.

However, the implementation of this process is overly complex, and makes a needless distinction between closure variables and local variables. I’ve figured out a much nicer (and more traditional) implementation that represents closure variables simply as extra parameters passed on top of the stack when a closure is invoked. Again in pseudo-Kitten, this yields:

local.0 { -> closure.0; closure.0 } new.closure.1

The advantage here is that it makes it very simple to produce unboxed closures: the new.closure.n internal instruction simply coerces the top of the stack—from a series of n variables followed by a function pointer into a single opaque closure value. When the closure is invoked with call, the coercion is performed in reverse, yielding the captured values as ordinary arguments on the stack beneath a function pointer, which is then called just like any other function.

Furthermore, this implementation of unboxed closures maps very nicely to existential types. Briefly, we can represent a closure as an unboxed pair of the captured values and a function pointer, using existential quantification to hide the actual types of the captured values. With this change, you’ll be able to pass closures around as ordinary stack-allocated values. Because the closure type is abstract, it doesn’t unify with any other type—as in Rust, internally each closure will get a unique anonymous type.

Boxing

After unboxed closures are implemented, you will only need to pay the cost of boxing a closure when you want to store it in another data structure, and this will be explicit. For example, currently you might write a list of functions of type List<(Int32 -> Int32)> as:

[\(+ x), \(* x), \(/ x)]

Once unboxed closures are implemented, you will need to write:

[\(+ x) box, \(* x) box, \(/ x) box]

This is essentially the same as std::function<int32_t(int32_t)> in C++, which performs boxing internally.

Likewise, the list literal notation [1, 2, 3] will be changed to produce an unboxed array of known size, with a type like Array<Int32, 3>—the same as the type int32_t[3] in C or C++. You will only need to allocate memory when you want this size to be dynamic: [1, 2, 3] box will have the type List<Int32>, equivalent to std::vector<int32_t> in C++.

In the future, I may introduce more concise notations such as {| … |} and [| … |] for boxed closures and lists, respectively. However, the box word has the advantage of being more searchable.

In addition, the box trait will require the +Alloc permission, which will be implicitly granted to all definitions by default. You will be able to waive the permission to allocate by adding -Alloc to a type signature, and this should be very useful for statically guaranteeing that critical code paths allocate no heap memory.

Permissions

While adding the implicit +Alloc permission, I plan to make other permissions granted by default. For example, +Fail is the permission to abort the program with an assertion failure, and this will be used by default for overflow-checked arithmetic operations. If you waive this permission by adding -Fail to the signature of a definition, then it will be statically guaranteed not to raise any assertion failures.

Because these implicit permissions will be backward-compatible, we will be able to add additional permissions in the future for easy, fine-grained control over low-level performance and safety details. For example, if we make it possible to give type parameters to a permission, then we might add such permissions as +Read<R> and +Write<R> for reasoning about accesses to some region of memory R; with some metadata, the compiler will be able to tell that reads and writes to different memory regions are commutative, enabling it to safely reorder instructions to produce more instruction-level parallelism. This, in turn, should make it easier to write correct high-performance lock-free code.

Furthermore, I plan to extend the permission system to allow permissions to provide data; this could be used to implement such things as dynamic scoping, typeclasses, and Scala-style implicits. Because all terms denote functions, there’s no material difference between an effect (an action the code is allowed to take on the environment) and a coeffect (a constraint on the environment in which some code is allowed to run). So with this small change, Kitten’s permission system will be able to neatly represent both.

As with the free monad approach to extensible effects, these changes will enable the same permission to be executed with different handlers. So you will be able to write a single definition with a user-defined permission (such as +Database) then interpret the same code in different ways (such as accessing a mock, test, or production database) to enable more code reuse and easier testing.

Assumptions

Finally, I’ve been working on figuring out the details of an additional extension to the permission system, which I term assumptions. Whereas permissions are attached to function types to describe their side effects and constraints, assumptions will be attached to values to describe simple invariants. I haven’t quite figured out the syntax for this, but for example, the type List<Int32> @Sorted @NonEmpty might describe a list of integers that is both sorted and non-empty.

The compiler should be able to use these assumptions to produce better error messages, as well as better compiled code, e.g., by eliminating redundant checks. I envision an unshared mutable reference type Owned<T> and a shared reference-counted variant Shared<T>. Assumptions would let you safely claim ownership of a shared reference with a function such as claim with the type Shared<T> @Unique -> Owned<T>.

I can also see this being used to constrain the ranges of numeric types, such as Int32 @Range<0, 1024>.

Like permissions, assumptions will have a subtyping property. It’s fine to pass a List<Int32> @Sorted to a function expecting any old List<Int32>, but not the other way around: you can’t pass any old List<Int32> to a function expecting a List<Int32> @Sorted.

You will be able to imbue a value with assumptions safely using a static or dynamic check:

assumption Sorted<T> (List<T>):
  dup tail \<= zip_with and

Or unsafely using a coercion such as imbue (@Sorted). Various standard functions will produce values with standard assumptions; for example, sort may have the type List<T> -> List<T> @Sorted.

Thinking Forth

As always, contributors are more than welcome—a programming language is a massive undertaking, and I am only one person. Even if you have no experience with compilers, concatenative programming, or Haskell (in which Kitten’s compiler is written), there are plenty of tasks available for a motivated contributor of any skill level. Feel free to look at the current issues, email me directly, or write to the Kitteneers mailing list and I’ll gladly help you help me!

15 February 2017

Lately in Kitten

At the turn of the new year, I resolved to work on Kitten every week of 2017, and so far it’s been going pretty well. With the success of This Week in Rust, I thought I’d make a point to continue documenting my thought processes when designing and implementing language features, rather than leave them in the dark in my notebooks. I hope it’ll be enlightening to fellow language designers, and enjoyable for anyone with an interest in programming languages.

In the News

Someone posted about the language on Hacker News, and it led to some interesting discussions and multiple offers from potential contributors. Matz even tweeted about it:

It was yet another reminder that I need to update the website, since I haven’t done so for about 2 years, and many details of the language have changed significantly in that time. I swear I’ll get around to it eventually. :P

Typechecker Fixes

“Instance checking” refers to verifying an inferred type against a type signature, by checking that the type is at least as polymorphic as the signature. We do this by skolemising the type (replacing all its type variables with fresh unique type constants) and then running a subsumption check. This is largely the same as unification, but it accounts for the fact that a function type is contravariant in its input type: if $mathjax((a \to b) \le (c \to d))$, then $mathjax(b \le d)$ (covariant), but $mathjax(c \le a)$ (contravariant). This is also known as the “generic instance” relation in System F, and is usually spelled $mathjax(\sqsubseteq)$.

Somewhere along the line, instance checking got broken; that meant that the inferred type of a function didn’t have to match its declared type, meaning a total violation of type safety. Fortunately, a “total violation” is normal when you have a typechecker bug—unsoundness can sneak in at any time. Fixing this required some simple tests and logging to track down the issue, which was a trivial case of swapping two variables: the declared and inferred types. D’oh!

Syntactic Additions and Fixes

Kitten uses \name as sugar for { name } when pushing a function to the stack. James Sully noticed that this notation can be generalised to accept any term. This resulted in a nice little notational improvement when passing a partially applied operator to a higher-order function:

// Before
[1, 2, 3] { (* 5) } map

// After
[1, 2, 3] \(* 5) map

The new as (Type, Type, …) operator lets you write the identity function specialised to a certain type, which is useful for documenting tricky code, and avoiding ambiguity in type inference:

// Documenting types
1 2.0 "three" as (Int32, Float64, List<Char>)

// Error: what type do we read the string as before showing it?
"1" read show

// Unambiguous
"1" read as (Int32) show

// Potential error: not clear what type we cast to
1.0 cast

// Unambiguous
1.0 cast as (Float32)

Finally, I’ve enabled support for Unicode versions of many common Kitten symbols; for example, you can write instead of ->. In the future, we might put this behind a feature flag, or at least warn about inconsistent mixing of Unicode symbols and their ASCII approximations.

Simplifying Internals

The compiler is essentially a database server. It takes requests in the form of program fragments from source files or interactive input, then tries to tokenise, parse, and typecheck them. Then it reports any errors, or commits the result of successful compilation to the “dictionary”, a key-value store of all the definitions in the program. Generating an executable is just one way to serialise this dictionary—it will also be used to generate documentation, TAGS files, syntax highlighting information, source maps, and so on.

However, the design of the dictionary grew organically as I needed different features, so it’s essentially just a hash table that the compiler reads and updates directly. I’ve been working on replacing it with a simpler design with only two API endpoints: query and update. In the future, this should make it easier to run the compiler as a language server, to ease editor integration.

I’ve also been trying to simplify the internal representation of program terms (e.g. #168), to make things easier to compile. Kitten’s compiler is somewhat unusual in that the same term representation is used throughout the compilation process—from parsing and desugaring, through typechecking, to code generation. This is feasible because Kitten is essentially just a high-level stack-based assembly language, with a healthy dose of static typing and syntactic sugar.

Some of these refactorings have exposed bugs that will need to be addressed before they can be landed. For example, when a closure captures a variable with a polymorphic type, it’s lifted into a generic definition; the compiler fails to generate an instantiation of it, leading to what amounts to a linking error later on.

I’m going to try to do more of this development in the open, via GitHub PRs, to give potential contributors more visibility into what I’m up to.

Code Generation

The old compiler generated C99 code; the new compiler is moving away from C as a target language, opting instead to generate assembly. This will give us more control over calling conventions and data representations. There is a half-baked backend for our first target architecture, x86-64, which I intend to flesh out in the coming weeks.

Working on the backend should give me reason to polish some of the lower-level bits of Kitten, such as +Unsafe code, copy constructors and destructors, and unboxed closures. The generated executables will still depend on libc for now, but it’s my long-standing goal to eventually require zero nontrivial runtime support.

Removing the Old Compiler

Kitten has been rewritten several times over the course of its 5-year history, as I’ve developed a vision of what exactly I’m trying to build. I’ve been referring to the latest rewrite as “the new compiler” for over a year now, and it’s high time to remove the “old” one. There are only about a dozen minor features and nice-to-haves present in the old compiler that haven’t yet been ported; soon I’ll have the pleasure of red-diffing over 7000 lines of code. :)

Looking Forward

It’s been a long journey already, and there’s so much left to do, but I think we’re on target for a release late this year. I’m always happy to welcome contributors—there’s room for all kinds of people, including compiler developers, testers, documentation writers, and UI designer/developers. If there’s something you’re interested in working on, I’ll gladly help you get up to speed.