kompact-io-landing/tracing-aiken-build.md at 62b4aa0523f477ee5696e5da8788656974a5339a

12 KiB

Raw Blame History

title	date
Tracing Aiken Build	2023-09-02

Aims:

Describe the pipeline and components getting from Aiken to Uplc.

The Preface

Motivations

The motivation for writing this came from a desire to add additional features to Aiken not yet available. One such feature would evaluate an arbitrary function in Aiken callable from JavaScript. This would help a lot with testing and when trying to align on and off-chain code.

Another more pipe dreamy, ad-hoc function extraction - from a span of code, generate a function. A digression to answer why would this be at all helpful?! Validator logic often needs a broad context throughout. How then to best factor code? Possible solutions:

Introduce types / structs
Have functions with lots of arguments
Don't

The problems are:

Requires relentless constructing and deconstructing across the function call. This adds costs.
Becomes tedious aligning the definition and function call.
Ends up with very long validators which are hard to unit test.

My current preferred way is to accept that validator functions are long. Ad-hoc function extraction would allow for sections of code to be tested without needing to be factored out.

To do either of these, we need to get to grips with the Aiken compilation pipeline.

This won't age well

Aiken is undergoing active development. This post started life with Aiken ~v1.14. Aiken v1.15 introduced reasonably significant changes to the compilation pipeline. The word is that there aren't any more big changes in the near future, but this article will undoubtedly begin to diverge from the current code-base even before publishing.

Limitations of narrating code

Narrating code becomes a compromise between being honest and accurate, and being readable and digestible. The command aiken build covers well in excess of 10,000 LoC. The writing of this post ground to a halt as it reached deeper into the code-base. To redeem it, some (possibly large) sections remain black boxes.

Aiken build

Tracing aiken build, the pipeline is roughly:

  .               -> Project::read_source_files ->
  Vec<Source>     -> Project::parse_sources ->
  ParsedModules   -> Project::type_check ->
  CheckedModules  -> CodeGenerator::build ->
  AirTree         -> AirTree::to_vec ->
  Vec<Air>        -> CodeGenerator::uplc_code_gen ->
  Program / Term<Name> -> serialize ->
  .

We'll pick our way through these steps

At a high level we are trying to do something straightforward: reformulate Aiken code as Uplc. Some Aiken expressions are relatively easy to handle for example an Aiken Int goes to an Int in Uplc. Some Aiken expressions require more involved handling, for example an Aiken If... If Else... Else must have the branches "nested" in Uplc. Aiken has lots of nice-to-haves like pattern matching, modules, and generics; Uplc has none of these.

The Preamble

Cli handling

The cli enters at aiken/src/cmd/mod.rs which parses the command. With some establishing of context, the program enters Project::build (crates/aiken-project/src/lib.rs), which in turn calls Project::compile.

File crawl

The program looks for Aiken files in both ./lib and ./validator sub-directories. For each it walks over all contents (recursively) looking for .ak extensions. It treats these two sets of files a little differently. For example, only validator files can contain the special validator functions.

Parse and Type check

Project::parse_sources parses the module source code. The heavy lifting is done by aiken_lang::parser::module, which is evaluated on each file. It produces a Module containing a list of parsed definitions of the file: functions, types etc, together with metadata like docstrings and the file path.

Project::type_check inspects the parsed modules and, as the name implies, checks the types. It flags type level warnings and errors and constructs a hash map of CheckedModules.

Code generator

The code generator CodeGenerator (aiken-lang/src/gen_uplc.rs) is given the definitions found from the previous step, together with the plutus builtins. It has additional fields for things like debugging.

This is handed over to a Blueprint (aiken-project/src/blueprint/mod.rs). The blueprint does little more than find the validators on which to run the code gen. The heavy lifting is done by CodeGenerator::generate.

We are now ready to take the source code and create plutus.

In the air

Things become a bit intimidating at this point in terms of sheer lines of code: gen_uplc.rs and three modules in gen_uplc/ totals > 8500 LoC.

Aiken has its own intermediate representation called air (as in Aiken Intermediate Representation). Intermediate representations are common in compiled languages. Air is defined in aiken-lang/src/gen_uplc/air.rs. Unsurprisingly, it looks a little bit like a language between Aiken and plutus.

In fact, Aiken has another intermediate representation: AirTree. This is constructed between the TypedExpr and Vec<Air> ie between parsed Aiken and air.

Climbing the AirTree

Within CodeGenerator::generate, CodeGenerator::build is called on the function body. This takes a TypedExpr and constructs and returns an AirTree. The construction is recursive as it traverses the recursive TypedExpr data structure. More on what an airtree is and its construction below. At the same time self is treated as mut, so we need to keep an eye on this too. The method which is called and uses this mutability of self is self.assignment. It does so by

- self.assignment
  └ self.expect_type_assign
    └ self.code_gen_functions.insert

and thus is creating a hashmap of all the functions that appear in the definition. From the call to return of assign covers > 600 LoC so we'll leave this as a black box. (self.handle_each_clause is also called with mut which in turn calls self.build for which mut it is needed.)

Validators in Aiken are boolean functions while in Uplc they are unit-valued (aka void-valued) functions. Thus the air tree is wrapped such that false results in an error (wrap_validator_condition). I don't know why there is a prevailing thought that boolean functions are preferable to functions that error if anything is wrong - which is what validators are.

check_validator_args again extends the airtree from the previous step, and again calls self.assignment mutating self. Something interesting is happening here. Script context is the final argument of a validator - for any script purpose. check_validator_args treats the script context like it is an unused argument. The importance of this is not immediate, and I've still yet to appreciate why this happens.

Let's take a look at what AirTree actually is

pub enum AirTree {
    Statement {
        statement: AirStatement,
        hoisted_over: Option<Box<AirTree>>,
    },
    Expression(AirExpression),
    UnhoistedSequence(Vec<AirTree>),
}

Note that AirStatement and AirExpression are mutually recursive definitions with AirTree. Otherwise, it would be unclear from first inspection how tree-like this really is.

AirExpression has multiple constructors. These include (non-exhaustive)

air primitives (including all the ones that appear in plutus)
constructors Call and Fn to handle anonymous functions
binary and unary operators
handling when and if
handling error and tracing

AirStatement also has multiple constructors. These include

let assignments and named function definitions
handling expect assignments
pattern matching
unwrapping data structures

Note that AirTree has many methods that are partial functions, as in there are possible states that are not considered legitimate at different points of its construction and use. For example hoist_over will throw an error if called on an Expression. As AirTree is for internal use only, the scope for potential problems is reasonably contained. It seems likely this is to avoid similar-yet-different IRs between steps. However, the trade off is that it partially obfuscates what is a valid state where.

What is hoisting? Hoisting gives the airtree depth. The motivation is that by the time we hit Uplc it is "generally better" that

function definitions appear once rather than being inlined multiple times
the definition appears as close to use as possible

Hoisting creates tree paths. The final airtree to airtree step, self.hoist_functions_to_validator, traverses these paths. There is a lot of mutating of self, making it quite hard to keep a handle on things. In all this (several thousand?) LoC, it is essentially ascertaining in which node of the tree to insert each function definition. In a resource constrained environment like plutus, this effort is warranted.

At the same time this function deals with

monomophisation - no more generics
erasing opaque types

Neither of which exist at the Uplc level.

Into Air

The to_vec : AirTree -> Vec<Air> is much easier to digest. For one, it is not evaluated in the context of the code generator, and two, there is no mutation of the airtree. The function recursively takes nodes of the tree and maps them to entries in a mutable vector. It flattens the tree to a vec.

Down to Uplc

Next we go from Vec<Air> -> Term<Name>. This step is a little more involved than the previous. For one, this is executed in the context of the code generator. Moreover, the code generator is treated as mutable - ouch.

On further inspection we see that the only mutation is setting self.needs_field_access = true. This flag informs the compiler that, if true, additional terms must be added in one of the final steps (see CodeGenerator::finalize).

As noted above, some of the mappings from air to terms are immediate like Air::Bool -> Term::bool.
Others are less so. Some examples:

Air::Var require 100 LoC to do case handling on different constructors.
Lists in air have no immediate analogue in uplc
builtins, as in built-in functions (standard shorthand), have to be mediated with some combination of force and delay in order to behave as they should.
user functions must be "uncurried", ie treated as a sequence of single argument functions, and recursion must be handled
Do some magic in order to efficiently allow "record updates".

Cranking the Optimizer

There is a sequence of operations performed on the Uplc, mapping Term<Name> -> Term<Name>. This removes inconsequential parts of the logic which have been generated, including:

removing application of the identity function
directly substituting where apply lambda is applied to a constant or builtin
inline or simplify where apply lambda is applied to a parameter that appears once or not at all

Each of these optimizing methods has a its own relatively narrow focus, and so although there is a fair number of LoC, it's reasonably straightforward to follow. Some are applied multiple times.

The End

The generated program can now be serialized and included in the blueprint.

Plutus Core Signposting

All this fuss is to get us to a point where we can write Uplc - and good Uplc at that. Note that there are many ways to generate code and most of them are bad. The various design decisions and compilation steps make more sense when we have a better understanding of the target language.

Uplc is a lambda calculus. For a comprehensive definition on Uplc checkout the specification found here from the plutus GitHub repo. (I imagine this link will be maintained longer than the current actual link.) If you're not at all familiar with lambda calculus I recommend an unpacking by Ben Lynn.

What next?

I think it would be helpful to have some examples... Watch this space.

12 KiB Raw Blame History