296 lines
12 KiB
Markdown
296 lines
12 KiB
Markdown
---
|
|
title: Tracing Aiken Build
|
|
date: 2023-09-02
|
|
---
|
|
|
|
Aims:
|
|
|
|
> Describe the pipeline and components getting from Aiken to Uplc.
|
|
|
|
## The Preface
|
|
|
|
### Motivations
|
|
|
|
The motivation for writing this came from a desire to add additional features to
|
|
Aiken not yet available. One such feature would evaluate an arbitrary function
|
|
in Aiken callable from JavaScript. This would help a lot with testing and when
|
|
trying to align on and off-chain code.
|
|
|
|
Another more pipe dreamy, ad-hoc function extraction - from a span of code,
|
|
generate a function. A digression to answer _why would this be at all helpful?!_
|
|
Validator logic often needs a broad context throughout. How then to best factor
|
|
code? Possible solutions:
|
|
|
|
1. Introduce types / structs
|
|
2. Have functions with lots of arguments
|
|
3. Don't
|
|
|
|
The problems are:
|
|
|
|
1. Requires relentless constructing and deconstructing across the function call.
|
|
This adds costs.
|
|
2. Becomes tedious aligning the definition and function call.
|
|
3. Ends up with very long validators which are hard to unit test.
|
|
|
|
My current preferred way is to accept that validator functions are long. Ad-hoc
|
|
function extraction would allow for sections of code to be tested without
|
|
needing to be factored out.
|
|
|
|
To do either of these, we need to get to grips with the Aiken compilation
|
|
pipeline.
|
|
|
|
### This won't age well
|
|
|
|
Aiken is undergoing active development. This post started life with Aiken
|
|
~v1.14. Aiken v1.15 introduced reasonably significant changes to the compilation
|
|
pipeline. The word is that there aren't any more big changes in the near future,
|
|
but this article will undoubtedly begin to diverge from the current code-base
|
|
even before publishing.
|
|
|
|
### Limitations of narrating code
|
|
|
|
Narrating code becomes a compromise between being honest and accurate, and being
|
|
readable and digestible. The command `aiken build` covers well in excess of
|
|
10,000 LoC. The writing of this post ground to a halt as it reached deeper into
|
|
the code-base. To redeem it, some (possibly large) sections remain black boxes.
|
|
|
|
## Aiken build
|
|
|
|
Tracing `aiken build`, the pipeline is roughly:
|
|
|
|
```sample
|
|
. -> Project::read_source_files ->
|
|
Vec<Source> -> Project::parse_sources ->
|
|
ParsedModules -> Project::type_check ->
|
|
CheckedModules -> CodeGenerator::build ->
|
|
AirTree -> AirTree::to_vec ->
|
|
Vec<Air> -> CodeGenerator::uplc_code_gen ->
|
|
Program / Term<Name> -> serialize ->
|
|
.
|
|
```
|
|
|
|
We'll pick our way through these steps
|
|
|
|
At a high level we are trying to do something straightforward: reformulate Aiken
|
|
code as Uplc. Some Aiken expressions are relatively easy to handle for example
|
|
an Aiken `Int` goes to an `Int` in Uplc. Some Aiken expressions require more
|
|
involved handling, for example an Aiken `If... If Else... Else ` must have the
|
|
branches "nested" in Uplc. Aiken has lots of nice-to-haves like pattern
|
|
matching, modules, and generics; Uplc has none of these.
|
|
|
|
### The Preamble
|
|
|
|
#### Cli handling
|
|
|
|
The cli enters at `aiken/src/cmd/mod.rs` which parses the command. With some
|
|
establishing of context, the program enters `Project::build`
|
|
(`crates/aiken-project/src/lib.rs`), which in turn calls `Project::compile`.
|
|
|
|
#### File crawl
|
|
|
|
The program looks for Aiken files in both `./lib` and `./validator`
|
|
sub-directories. For each it walks over all contents (recursively) looking for
|
|
`.ak` extensions. It treats these two sets of files a little differently. For
|
|
example, only validator files can contain the special validator functions.
|
|
|
|
#### Parse and Type check
|
|
|
|
`Project::parse_sources` parses the module source code. The heavy lifting is
|
|
done by `aiken_lang::parser::module`, which is evaluated on each file. It
|
|
produces a `Module` containing a list of parsed definitions of the file:
|
|
functions, types _etc_, together with metadata like docstrings and the file
|
|
path.
|
|
|
|
`Project::type_check` inspects the parsed modules and, as the name implies,
|
|
checks the types. It flags type level warnings and errors and constructs a hash
|
|
map of `CheckedModule`s.
|
|
|
|
#### Code generator
|
|
|
|
The code generator `CodeGenerator` (`aiken-lang/src/gen_uplc.rs`) is given the
|
|
definitions found from the previous step, together with the plutus builtins. It
|
|
has additional fields for things like debugging.
|
|
|
|
This is handed over to a `Blueprint` (`aiken-project/src/blueprint/mod.rs`). The
|
|
blueprint does little more than find the validators on which to run the code
|
|
gen. The heavy lifting is done by `CodeGenerator::generate`.
|
|
|
|
We are now ready to take the source code and create plutus.
|
|
|
|
### In the air
|
|
|
|
Things become a bit intimidating at this point in terms of sheer lines of code:
|
|
`gen_uplc.rs` and three modules in `gen_uplc/` totals > 8500 LoC.
|
|
|
|
Aiken has its own _intermediate representation_ called `air` (as in Aiken
|
|
Intermediate Representation). Intermediate representations are common in
|
|
compiled languages. `Air` is defined in `aiken-lang/src/gen_uplc/air.rs`.
|
|
Unsurprisingly, it looks a little bit like a language between Aiken and plutus.
|
|
|
|
In fact, Aiken has another intermediate representation: `AirTree`. This is
|
|
constructed between the `TypedExpr` and `Vec<Air>` ie between parsed Aiken and
|
|
air.
|
|
|
|
#### Climbing the AirTree
|
|
|
|
Within `CodeGenerator::generate`, `CodeGenerator::build` is called on the
|
|
function body. This takes a `TypedExpr` and constructs and returns an `AirTree`.
|
|
The construction is recursive as it traverses the recursive `TypedExpr` data
|
|
structure. More on what an airtree is and its construction below. At the same
|
|
time `self` is treated as `mut`, so we need to keep an eye on this too. The
|
|
method which is called and uses this mutability of self is `self.assignment`. It
|
|
does so by
|
|
|
|
```sample
|
|
- self.assignment
|
|
└ self.expect_type_assign
|
|
└ self.code_gen_functions.insert
|
|
```
|
|
|
|
and thus is creating a hashmap of all the functions that appear in the
|
|
definition. From the call to return of `assign` covers > 600 LoC so we'll leave
|
|
this as a black box. (`self.handle_each_clause` is also called with `mut` which
|
|
in turn calls `self.build` for which `mut` it is needed.)
|
|
|
|
Validators in Aiken are boolean functions while in Uplc they are unit-valued
|
|
(aka void-valued) functions. Thus the air tree is wrapped such that `false`
|
|
results in an error (`wrap_validator_condition`). I don't know why there is a
|
|
prevailing thought that boolean functions are preferable to functions that error
|
|
if anything is wrong - which is what validators are.
|
|
|
|
`check_validator_args` again extends the airtree from the previous step, and
|
|
again calls `self.assignment` mutating self. Something interesting is happening
|
|
here. Script context is the final argument of a validator - for any script
|
|
purpose. `check_validator_args` treats the script context like it is an unused
|
|
argument. The importance of this is not immediate, and I've still yet to
|
|
appreciate why this happens.
|
|
|
|
Let's take a look at what AirTree actually is
|
|
|
|
```language-rust
|
|
pub enum AirTree {
|
|
Statement {
|
|
statement: AirStatement,
|
|
hoisted_over: Option<Box<AirTree>>,
|
|
},
|
|
Expression(AirExpression),
|
|
UnhoistedSequence(Vec<AirTree>),
|
|
}
|
|
```
|
|
|
|
Note that `AirStatement` and `AirExpression` are mutually recursive definitions
|
|
with `AirTree`. Otherwise, it would be unclear from first inspection how
|
|
tree-like this really is.
|
|
|
|
`AirExpression` has multiple constructors. These include (non-exhaustive)
|
|
|
|
- air primitives (including all the ones that appear in plutus)
|
|
- constructors `Call` and `Fn` to handle anonymous functions
|
|
- binary and unary operators
|
|
- handling when and if
|
|
- handling error and tracing
|
|
|
|
`AirStatement` also has multiple constructors. These include
|
|
|
|
- let assignments and named function definitions
|
|
- handling expect assignments
|
|
- pattern matching
|
|
- unwrapping data structures
|
|
|
|
Note that `AirTree` has many methods that are partial functions, as in there are
|
|
possible states that are not considered legitimate at different points of its
|
|
construction and use. For example `hoist_over` will throw an error if called on
|
|
an `Expression`. As `AirTree` is for internal use only, the scope for potential
|
|
problems is reasonably contained. It seems likely this is to avoid
|
|
similar-yet-different IRs between steps. However, the trade off is that it
|
|
partially obfuscates what is a valid state where.
|
|
|
|
What is hoisting? Hoisting gives the airtree depth. The motivation is that by
|
|
the time we hit Uplc it is "generally better" that
|
|
|
|
- function definitions appear once rather than being inlined multiple times
|
|
- the definition appears as close to use as possible
|
|
|
|
Hoisting creates tree paths. The final airtree to airtree step,
|
|
`self.hoist_functions_to_validator`, traverses these paths. There is a lot of
|
|
mutating of self, making it quite hard to keep a handle on things. In all this
|
|
(several thousand?) LoC, it is essentially ascertaining in which node of the
|
|
tree to insert each function definition. In a resource constrained environment
|
|
like plutus, this effort is warranted.
|
|
|
|
At the same time this function deals with
|
|
|
|
- monomophisation - no more generics
|
|
- erasing opaque types
|
|
|
|
Neither of which exist at the Uplc level.
|
|
|
|
#### Into Air
|
|
|
|
The `to_vec : AirTree -> Vec<Air>` is much easier to digest. For one, it is not
|
|
evaluated in the context of the code generator, and two, there is no mutation of
|
|
the airtree. The function recursively takes nodes of the tree and maps them to
|
|
entries in a mutable vector. It flattens the tree to a vec.
|
|
|
|
### Down to Uplc
|
|
|
|
Next we go from `Vec<Air> -> Term<Name>`. This step is a little more involved
|
|
than the previous. For one, this is executed in the context of the code
|
|
generator. Moreover, the code generator is treated as mutable - ouch.
|
|
|
|
On further inspection we see that the only mutation is setting
|
|
`self.needs_field_access = true`. This flag informs the compiler that, if true,
|
|
additional terms must be added in one of the final steps (see
|
|
`CodeGenerator::finalize`).
|
|
|
|
As noted above, some of the mappings from air to terms are immediate like
|
|
`Air::Bool -> Term::bool`.
|
|
Others are less so. Some examples:
|
|
|
|
- `Air::Var` require 100 LoC to do case handling on different constructors.
|
|
- Lists in air have no immediate analogue in uplc
|
|
- builtins, as in built-in functions (standard shorthand), have to be mediated
|
|
with some combination of `force` and `delay` in order to behave as they
|
|
should.
|
|
- user functions must be "uncurried", ie treated as a sequence of single
|
|
argument functions, and recursion must be handled
|
|
- Do some magic in order to efficiently allow "record updates".
|
|
|
|
#### Cranking the Optimizer
|
|
|
|
There is a sequence of operations performed on the Uplc, mapping
|
|
`Term<Name> -> Term<Name>`. This removes inconsequential parts of the logic
|
|
which have been generated, including:
|
|
|
|
- removing application of the identity function
|
|
- directly substituting where apply lambda is applied to a constant or builtin
|
|
- inline or simplify where apply lambda is applied to a parameter that appears
|
|
once or not at all
|
|
|
|
Each of these optimizing methods has a its own relatively narrow focus, and so
|
|
although there is a fair number of LoC, it's reasonably straightforward to
|
|
follow. Some are applied multiple times.
|
|
|
|
### The End
|
|
|
|
The generated program can now be serialized and included in the blueprint.
|
|
|
|
### Plutus Core Signposting
|
|
|
|
All this fuss is to get us to a point where we can write Uplc - and good Uplc at
|
|
that. Note that there are many ways to generate code and most of them are bad.
|
|
The various design decisions and compilation steps make more sense when we have
|
|
a better understanding of the target language.
|
|
|
|
Uplc is a lambda calculus. For a comprehensive definition on Uplc checkout the
|
|
specification found
|
|
[here](https://github.com/input-output-hk/plutus/#specifications-and-design)
|
|
from the plutus GitHub repo. (I imagine this link will be maintained longer than
|
|
the current actual link.) If you're not at all familiar with lambda calculus I
|
|
recommend [an unpacking](https://crypto.stanford.edu/~blynn/lambda/) by Ben
|
|
Lynn.
|
|
|
|
### What next?
|
|
|
|
I think it would be helpful to have some examples... Watch this space.
|