kompact-io-landing/content/posts/tracing-aiken-build.md

296 lines
12 KiB
Markdown

---
title: Tracing Aiken Build
date: 2023-09-02
---
Aims:
> Describe the pipeline and components getting from Aiken to Uplc.
## The Preface
### Motivations
The motivation for writing this came from a desire to add additional features to
Aiken not yet available. One such feature would evaluate an arbitrary function
in Aiken callable from JavaScript. This would help a lot with testing and when
trying to align on and off-chain code.
Another more pipe dreamy, ad-hoc function extraction - from a span of code,
generate a function. A digression to answer _why would this be at all helpful?!_
Validator logic often needs a broad context throughout. How then to best factor
code? Possible solutions:
1. Introduce types / structs
2. Have functions with lots of arguments
3. Don't
The problems are:
1. Requires relentless constructing and deconstructing across the function call.
This adds costs.
2. Becomes tedious aligning the definition and function call.
3. Ends up with very long validators which are hard to unit test.
My current preferred way is to accept that validator functions are long. Ad-hoc
function extraction would allow for sections of code to be tested without
needing to be factored out.
To do either of these, we need to get to grips with the Aiken compilation
pipeline.
### This won't age well
Aiken is undergoing active development. This post started life with Aiken
~v1.14. Aiken v1.15 introduced reasonably significant changes to the compilation
pipeline. The word is that there aren't any more big changes in the near future,
but this article will undoubtedly begin to diverge from the current code-base
even before publishing.
### Limitations of narrating code
Narrating code becomes a compromise between being honest and accurate, and being
readable and digestible. The command `aiken build` covers well in excess of
10,000 LoC. The writing of this post ground to a halt as it reached deeper into
the code-base. To redeem it, some (possibly large) sections remain black boxes.
## Aiken build
Tracing `aiken build`, the pipeline is roughly:
```sample
. -> Project::read_source_files ->
Vec<Source> -> Project::parse_sources ->
ParsedModules -> Project::type_check ->
CheckedModules -> CodeGenerator::build ->
AirTree -> AirTree::to_vec ->
Vec<Air> -> CodeGenerator::uplc_code_gen ->
Program / Term<Name> -> serialize ->
.
```
We'll pick our way through these steps
At a high level we are trying to do something straightforward: reformulate Aiken
code as Uplc. Some Aiken expressions are relatively easy to handle for example
an Aiken `Int` goes to an `Int` in Uplc. Some Aiken expressions require more
involved handling, for example an Aiken `If... If Else... Else ` must have the
branches "nested" in Uplc. Aiken has lots of nice-to-haves like pattern
matching, modules, and generics; Uplc has none of these.
### The Preamble
#### Cli handling
The cli enters at `aiken/src/cmd/mod.rs` which parses the command. With some
establishing of context, the program enters `Project::build`
(`crates/aiken-project/src/lib.rs`), which in turn calls `Project::compile`.
#### File crawl
The program looks for Aiken files in both `./lib` and `./validator`
sub-directories. For each it walks over all contents (recursively) looking for
`.ak` extensions. It treats these two sets of files a little differently. For
example, only validator files can contain the special validator functions.
#### Parse and Type check
`Project::parse_sources` parses the module source code. The heavy lifting is
done by `aiken_lang::parser::module`, which is evaluated on each file. It
produces a `Module` containing a list of parsed definitions of the file:
functions, types _etc_, together with metadata like docstrings and the file
path.
`Project::type_check` inspects the parsed modules and, as the name implies,
checks the types. It flags type level warnings and errors and constructs a hash
map of `CheckedModule`s.
#### Code generator
The code generator `CodeGenerator` (`aiken-lang/src/gen_uplc.rs`) is given the
definitions found from the previous step, together with the plutus builtins. It
has additional fields for things like debugging.
This is handed over to a `Blueprint` (`aiken-project/src/blueprint/mod.rs`). The
blueprint does little more than find the validators on which to run the code
gen. The heavy lifting is done by `CodeGenerator::generate`.
We are now ready to take the source code and create plutus.
### In the air
Things become a bit intimidating at this point in terms of sheer lines of code:
`gen_uplc.rs` and three modules in `gen_uplc/` totals > 8500 LoC.
Aiken has its own _intermediate representation_ called `air` (as in Aiken
Intermediate Representation). Intermediate representations are common in
compiled languages. `Air` is defined in `aiken-lang/src/gen_uplc/air.rs`.
Unsurprisingly, it looks a little bit like a language between Aiken and plutus.
In fact, Aiken has another intermediate representation: `AirTree`. This is
constructed between the `TypedExpr` and `Vec<Air>` ie between parsed Aiken and
air.
#### Climbing the AirTree
Within `CodeGenerator::generate`, `CodeGenerator::build` is called on the
function body. This takes a `TypedExpr` and constructs and returns an `AirTree`.
The construction is recursive as it traverses the recursive `TypedExpr` data
structure. More on what an airtree is and its construction below. At the same
time `self` is treated as `mut`, so we need to keep an eye on this too. The
method which is called and uses this mutability of self is `self.assignment`. It
does so by
```sample
- self.assignment
└ self.expect_type_assign
└ self.code_gen_functions.insert
```
and thus is creating a hashmap of all the functions that appear in the
definition. From the call to return of `assign` covers > 600 LoC so we'll leave
this as a black box. (`self.handle_each_clause` is also called with `mut` which
in turn calls `self.build` for which `mut` it is needed.)
Validators in Aiken are boolean functions while in Uplc they are unit-valued
(aka void-valued) functions. Thus the air tree is wrapped such that `false`
results in an error (`wrap_validator_condition`). I don't know why there is a
prevailing thought that boolean functions are preferable to functions that error
if anything is wrong - which is what validators are.
`check_validator_args` again extends the airtree from the previous step, and
again calls `self.assignment` mutating self. Something interesting is happening
here. Script context is the final argument of a validator - for any script
purpose. `check_validator_args` treats the script context like it is an unused
argument. The importance of this is not immediate, and I've still yet to
appreciate why this happens.
Let's take a look at what AirTree actually is
```language-rust
pub enum AirTree {
Statement {
statement: AirStatement,
hoisted_over: Option<Box<AirTree>>,
},
Expression(AirExpression),
UnhoistedSequence(Vec<AirTree>),
}
```
Note that `AirStatement` and `AirExpression` are mutually recursive definitions
with `AirTree`. Otherwise, it would be unclear from first inspection how
tree-like this really is.
`AirExpression` has multiple constructors. These include (non-exhaustive)
- air primitives (including all the ones that appear in plutus)
- constructors `Call` and `Fn` to handle anonymous functions
- binary and unary operators
- handling when and if
- handling error and tracing
`AirStatement` also has multiple constructors. These include
- let assignments and named function definitions
- handling expect assignments
- pattern matching
- unwrapping data structures
Note that `AirTree` has many methods that are partial functions, as in there are
possible states that are not considered legitimate at different points of its
construction and use. For example `hoist_over` will throw an error if called on
an `Expression`. As `AirTree` is for internal use only, the scope for potential
problems is reasonably contained. It seems likely this is to avoid
similar-yet-different IRs between steps. However, the trade off is that it
partially obfuscates what is a valid state where.
What is hoisting? Hoisting gives the airtree depth. The motivation is that by
the time we hit Uplc it is "generally better" that
- function definitions appear once rather than being inlined multiple times
- the definition appears as close to use as possible
Hoisting creates tree paths. The final airtree to airtree step,
`self.hoist_functions_to_validator`, traverses these paths. There is a lot of
mutating of self, making it quite hard to keep a handle on things. In all this
(several thousand?) LoC, it is essentially ascertaining in which node of the
tree to insert each function definition. In a resource constrained environment
like plutus, this effort is warranted.
At the same time this function deals with
- monomophisation - no more generics
- erasing opaque types
Neither of which exist at the Uplc level.
#### Into Air
The `to_vec : AirTree -> Vec<Air>` is much easier to digest. For one, it is not
evaluated in the context of the code generator, and two, there is no mutation of
the airtree. The function recursively takes nodes of the tree and maps them to
entries in a mutable vector. It flattens the tree to a vec.
### Down to Uplc
Next we go from `Vec<Air> -> Term<Name>`. This step is a little more involved
than the previous. For one, this is executed in the context of the code
generator. Moreover, the code generator is treated as mutable - ouch.
On further inspection we see that the only mutation is setting
`self.needs_field_access = true`. This flag informs the compiler that, if true,
additional terms must be added in one of the final steps (see
`CodeGenerator::finalize`).
As noted above, some of the mappings from air to terms are immediate like
`Air::Bool -> Term::bool`.
Others are less so. Some examples:
- `Air::Var` require 100 LoC to do case handling on different constructors.
- Lists in air have no immediate analogue in uplc
- builtins, as in built-in functions (standard shorthand), have to be mediated
with some combination of `force` and `delay` in order to behave as they
should.
- user functions must be "uncurried", ie treated as a sequence of single
argument functions, and recursion must be handled
- Do some magic in order to efficiently allow "record updates".
#### Cranking the Optimizer
There is a sequence of operations performed on the Uplc, mapping
`Term<Name> -> Term<Name>`. This removes inconsequential parts of the logic
which have been generated, including:
- removing application of the identity function
- directly substituting where apply lambda is applied to a constant or builtin
- inline or simplify where apply lambda is applied to a parameter that appears
once or not at all
Each of these optimizing methods has a its own relatively narrow focus, and so
although there is a fair number of LoC, it's reasonably straightforward to
follow. Some are applied multiple times.
### The End
The generated program can now be serialized and included in the blueprint.
### Plutus Core Signposting
All this fuss is to get us to a point where we can write Uplc - and good Uplc at
that. Note that there are many ways to generate code and most of them are bad.
The various design decisions and compilation steps make more sense when we have
a better understanding of the target language.
Uplc is a lambda calculus. For a comprehensive definition on Uplc checkout the
specification found
[here](https://github.com/input-output-hk/plutus/#specifications-and-design)
from the plutus GitHub repo. (I imagine this link will be maintained longer than
the current actual link.) If you're not at all familiar with lambda calculus I
recommend [an unpacking](https://crypto.stanford.edu/~blynn/lambda/) by Ben
Lynn.
### What next?
I think it would be helpful to have some examples... Watch this space.