edits to tracing aiken

2023-09-01 13:30:49 +00:00 · 2023-09-01 13:30:49 +00:00 · 38e68b5316
parent f3b88b8446
commit 38e68b5316
2 changed files with 275 additions and 228 deletions
--- a/content/drafts/tracing-aiken-build.md
+++ b/content/drafts/tracing-aiken-build.md
@ -0,0 +1,275 @@
 Aims: 
 > Describe the pipeline and components getting from aiken to uplc. 
 ## The Preface
 ### Motivations
 The motivation for writing this came from a desire to add additional features to aiken not yet available.
 One such feature would evaluate an arbitrary function in aiken callable from javascript. 
 This would help a lot with testing trying to align on and off-chain code. 
 Another more pipe dreamy, adhoc function extraction - from a span of code, generate a function.
 A digression to answer _why would this be at all helpful?!_
 Validator logic often needs a broad context throughout.
 How then to best factor code?
 Possible solutions: 
 1. Introduce types / structs 
 2. Have functions with lots of arguments
 3. Don't
 The problems are:
 1. Requires relentless constructing and deconstructing across the function call.
 And this is adds costs in aiken. 
 2. Becomes tedious aligning the definition and function call.  
 3. End up with very long validators which are hard to unit test. 
 My current preferred way is to accept that validator functions are long.
 Adhoc function extraction would allow for sections of code to be tested without needing to be factored out.
 To do either of these, we need to get to grips with the aiken compilation pipeline.
 ### This won't age well 
 Aiken is undergoing active development. 
 This post was started life with Aiken ~v1.14. 
 With Aiken v1.15, there were already reasonably significant changes to the compilation pipeline. 
 The word is that there aren't as big changes in the near future, 
 but this article will undoubtably begin to diverge from the current codebase even before publishing.  
 ### Limitations of narating code
 Narating code becomes a compromise between being honest and accurate, and being readable and digestable. 
 Following the command `aiken build` covers well in excess of 10,000 LoC.
 The writing of this post ground slowly to a halt as it progressed deeper into the code 
 with the details seeming to increase in importance. 
 At some point I had to draw a line and resign to fact that some parts will remain black boxes for now. 
 ## Aiken build
 Tracing `aiken build`, the pipeline is roughly: 
 ```
  .               -> Project::read_source_files -> 
  Vec<Source>     -> Project::parse_sources ->
  ParsedModules   -> Project::type_check ->
  CheckedModules  -> CodeGenerator::build ->  
  AirTree         -> AirTree::to_vec -> 
  Vec<Air>        -> CodeGenerator::uplc_code_gen -> 
  Program / Term<Name> -> serialize -> 
  .
 ```
 We'll pick our way through these steps
 At a high level we are trying to do something straightforward: reformulate aiken code as uplc.
 Some aiken expressions are relatively easy to handle for example an aiken `Int` goes to an `Int` in uplc. 
 Some aiken expressions require more involved handling, for example an aiken `If... If Else... Else ` 
 must have the branches "nested" in uplc.
 Aiken also have lots of nice-to-haves like pattern matching, modules, and generics.
 Uplc has none of these.
 ### The Preamble 
 #### Cli handling
 The cli enters at `aiken/src/cmd/mod.rs` which parses the command. 
 With some establishing of context, the program enters `Project::build` (`crates/aiken-project/src/lib.rs`),
 which in turn calls `Project::compile`. 
 #### File crawl
 The program looks for aiken files in both `./lib` and `./validator` subdirs. 
 For each it walks over all contents (recursively) looking for `.ak` extensions. 
 It treats these two sets of files a little differently. 
 For example, only validator files can contain the special validator functions.
 #### Parse and Type check
 `Project::parse_sources` parses the module source code.
 The heavy lifting is done by `aiken_lang::parser::module`, which is evaluated on each file. 
 It produces a `Module` containing a list of parsed definitions of the file: functions, types _etc_,
 together with metadata like docstrings and the file path. 
 `Project::type_check` inspects the parsed modules and, as the name implies, checks the types. 
 It flags type level warnings and errors and constructs a hash map of `CheckedModule`s.
 #### Code generator
 The code generator `CodeGenerator` (`aiken-lang/src/gen_uplc.rs`) is given 
 the definitions found from the previous step, 
 together with the plutus builtins. 
 It has additional fields for things like debugging. 
 This is handed over to a `Blueprint` (`aiken-project/src/blueprint/mod.rs`).
 The blueprint does little more than find the validators on which to run the code gen. 
 The heavy lifting is done by `CodeGenerator::generate`.
 We are now ready to take the source code and create plutus. 
 ### In the air
 Things become a bit intimidating at this point in terms of sheer lines of code:
 `gen_uplc.rs` and three modules in `gen_uplc/` totals > 8500 LoC.  
 Aiken has its own _intermediate representation_ called `air` (as in Aiken Intermediate Representation). 
 These are common in compiled languages.
 `Air` is defined in `aiken-lang/src/gen_uplc/air.rs`. 
 Unsurprisingly, it looks little bit like a language between aiken and plutus. 
 In fact, Aiken has another intermediate representation: `AirTree`. 
 This is constructed between the `TypedExpr` and `Vec<Air>` ie between parsed aiken and air. 
 #### Climbing the AirTree 
 Within `CodeGenerator::generate`, `CodeGenerator::build` is called on the function body. 
 This takes a `TypedExpr` and constructs and returns an `AirTree`.
 The construction is recursive as it traverses the recursive `TypedExpr` data structure.
 More on what an airtree is and its construction below.
 At the same time `self` is treated as `mut`, so we need to keep an eye on this too.
 The method which is called and uses this mutability of self is `self.assignment`. 
 It does so by
 ```sample 
  self.assignment >> self.expect_type_assign >> self.code_gen_functions.insert
 ```
 and thus is creating a hashmap of all the functions that appear in the definition.
 From the call to return of `assign` covers > 600 LoC so we'll leave this as otherwise a black box.
 (`self.handle_each_clause` is also called with `mut` which in turn calls `self.build` for which `mut` it is needed.) 
 Validators in aiken are boolean functions while in uplc they are unit-valued (aka void-valued) functions.
 Thus the airtree is wrapped such that `false` results in an error (`wrap_validator_condition`). 
 I don't know why there is a prevailing thought that boolean functions are preferable than functions 
 that error if anything is wrong - which is what validators are.
 `check_validator_args` again extends the airtree from the previous step, 
 and again calls `self.assignment` mutating self.
 Something interesting is happening here. 
 Script context is the final argument of a validator - for any script purpose.
 `check_validator_args` treats the script context like it is an unused argument. 
 The importance of this is not immediate, and I've still yet to appreciate why this happens.
 Let's take a look at what AirTree actually is
 ```rust
 pub enum AirTree {
    Statement {
        statement: AirStatement,
        hoisted_over: Option<Box<AirTree>>,
    },
    Expression(AirExpression),
    UnhoistedSequence(Vec<AirTree>),
 }
 ```
 Note that `AirStatement` and `AirExpression` are mutually recusive definitions with `AirTree`. 
 Otherwise, it would be unclear from first inspection how tree-like this really is. 
 `AirExpression` has multiple constructors. These include (non-exhaustive)
 - air primitives (including all the ones that appear in plutus)
 - constructors `Call` and `Fn` to handle anonymous functions
 - binary and unary operators
 - handling when and if
 - handling error and tracing
 `AirStatement` also has multiple constructors. These include 
 - let assignments and named function definitions
 - handling expect assignments 
 - pattern matching 
 - unwrapping datastructures
 Note that `AirTree` has many methods that are partial functions, 
 as in there are possible states that are not considered legitimate 
 at different points of its construction and use.
 For example `hoist_over` will throw an error if called on an `Expression`.
 As `AirTree` is for internal use only, the scope for potential problems is reasonably contained.
 It seems likely this is to avoid similar-yet-different IRs between steps.
 However, the trade off is that it partially obsufucates what is a valid state where. 
 What is hoisting? hoisting gives the airtree depth. 
 The motivation is that by the time we hit uplc it is "generally better"
 that 
 - function defintions appear once rather than being inlined multiple times
 - the definition appears as close to use as possible 
 Hoisting creates tree paths. 
 The final airtree to airtree step is`self.hoist_functions_to_validator` traverses the paths.
 There is a lot of mutating of self, making it quite hard to keep a handle on things. 
 In all this (several thousand?) LoC, it is essentially ascertaining in which node of the tree
 to insert each function definiton. 
 In a resource constrained environment like plutus, this effort is warranted.
 At the same time this function deals with 
 - monomophisation - no more generics
 - erasing opaque types
 Neither of which exist at the uplc level. 
 #### Into Air
 The `to_vec : AirTree -> Vec<Air>` is much easier to digest. 
 For one, it is not evaluated in the context of the CodeGenerator,
 and two, there is no mutation of the airtree. 
 The function recursively takes nodes of the tree and maps them to entries in a mutable vector.
 It flattens the tree to a vec.
 ### Down to uplc 
 Next we go from `Vec<Air> -> Term<Name>`.
 This step is a little more involved than the previous. 
 For one, this is executed in the context of the code generator. 
 Moreover, the code generatore is treated mutable - ouch.
 On further inspection we see that the only mutation is setting `self.needs_field_access = true`.
 This flag informs the compiler that, if true, additional terms must be added in one of the final steps
 (see `CodeGenerator::finalize`).
 As noted above, some of the mappings from air to terms are immediate like `Air::Bool -> Term::bool`.  
 Others are less so.
 Some examples:
 - `Air::Var` require 100 LoC to do case handling on different constructors. 
 - Lists in air have no immediate analogue in uplc
 - builtins, as in built-in functions (standard shorthand), have to mediated 
 with some combination of `force` and `delay` in order to behave as they should.
 - user functions must be "uncurried", ie treated as a sequence of single argument functions, 
 and recursion must be handled
 - Do some magic in order to efficiently allow "record updates".
 #### Cranking the Optimizer
 There is a sequence of operations perfromed on the uplc mapping `Term<Name> -> Term<Name>`.
 These remove inconsequential parts of the logic which will appear.
 These include: 
 - removing application of the identity function
 - directly substituting where apply lambda is applied to a constant or builtin
 - inline or simplify where apply lambda is applied to a param that appears once or not at all
 Each of these optimizing methods has a its own relatively narrow focus, 
 and so although there is a fair number of LoC, it's reasonably straightforward to follow.
 Some are applied multiple times. 
 ### The End 
 The generated program can now be serialized and included in the blueprint.
 ### Plutus Core Signposting
 All this fuss is to get us to a point where we can write uplc - and good uplc at that. 
 Note that there's many ways to generate code and most of them are bad.  
 The various design decisions and compilation steps make more sense 
 when we have a better understanding of the target language. 
 Uplc is a lambda calculus. 
 For a comprehensive definition on uplc checkout the specification found 
 [here](https://github.com/input-output-hk/plutus/#specifications-and-design) from the plutus github repo. 
 (I imagine this link will be maintained longer than the current actual link.)
 If you're not at all familiar with lambda calculus I recommend 
 [an unpacking](https://crypto.stanford.edu/~blynn/lambda/) by Ben Lynn.
 ### What next?
 I think it would be helpful to have some examples... Watch this space.
--- a/content/drafts/unpicking-aiken-air.md
+++ b/content/drafts/unpicking-aiken-air.md
@ -1,228 +0,0 @@
 Aims: 
 - Describe the pipeline, and components getting from aiken to uplc. 
 ## Preface
 Aiken is undergoing active development. 
 This post was started Aiken ~v1.14. 
 With Aiken v1.15, there were already reasonably significant changes to the compilation pipeline. 
 The word is that there aren't as big changes in the near future, but 
 this article will undoubtably begin to diverge from the current codebase even before publishing.  
 ## Aiken build
 Tracing `aiken build`, the pipeline is roughly something like: 
 ```
  .               -> Project::read_source_files -> 
  Vec<Source>     -> Project::parse_sources ->
  ParsedModules   -> Project::type_check ->
  CheckedModules  -> CodeGenerator::build ->  
  AirTree         -> AirTree::to_vec -> 
  Vec<Air>        -> CodeGenerator::uplc_code_gen -> 
  Program / Term<Name> -> serialize -> 
  .
 ```
 We'll pick our way through these steps
 At a high level we are trying to do something straightforward: reformulate aiken code as uplc.
 Some aiken expressions are relatively easy to handle for example an aiken `Int` goes to an `Int` in uplc. 
 Some aiken expressions require more involved handling, for example an aiken `If... If Else... Else ` 
 must have the branches "nested" in uplc.
 ### The Preamble 
 #### cli handling
 The cli enters at `aiken/src/cmd/mod.rs` which parses the command. 
 With some establishing of context, the program enters `Project::build` (`crates/aiken-project/src/lib.rs`),
 which in turn calls `Project::compile`. 
 #### File crawl
 The program looks for aiken files in both `./lib` and `./validator` subdirs. 
 For each it walks over all contents (recursively) looking for `.ak` extensions. 
 It treats these two sets of files a little differently. 
 Only validator files can contain the special validator functions.
 #### Parse and Type check
 `Project::parse_sources` parses the module source code.
 The heavy lifting is done by `aiken_lang::parser::module`, which is evaluated on each file. 
 It produces a `Module` containing a list of parsed definitions of the file: functions, types _etc_,
 together with "metadata" like docstrings and the file path. 
 `Project::type_check` inspects the parsed modules and, as the name implies, checks the types. 
 It flags type level warnings and errors. 
 It constructs a hash map of `CheckedModule`s.
 #### Code generator
 The code generator `CodeGenerator` (`aiken-lang/src/gen_uplc.rs`) is given 
 the definitions found from the previous step, 
 together with the plutus builtins. 
 It has additional fields for things like debugging. 
 This is handed over to a `Blueprint` (`aiken-project/src/blueprint/mod.rs`).
 A blueprint does little more than find the validators on which to run the code gen. 
 The heavy lifting is done by `CodeGenerator::generate`.
 We are now ready to take the source code and create plutus. 
 ### Up in the air
 Things become a bit intimidating at this point in terms of sheer lines of code:
 `gen_uplc.rs` and three modules in `gen_uplc/` totals > 8500 LoC.  
 Aiken has its own _intermediate representation_ called `air` (as in Aiken Intermediate Representation). 
 These are common in compiled languages.
 `Air` is defined in `aiken-lang/src/gen_uplc/air.rs`. 
 Unsurprisingly, it looks little bit like a language between aiken and plutus. 
 In fact, Aiken has another intermediate representation: `AirTree`. 
 This is constructed between the `TypedExpr` and `Vec<Air>` ie between parsed aiken and air. 
 #### AirTree 
 Within `CodeGenerator::generate`, `CodeGenerator::build` is called on the function body. 
 This constructs and returns an `AirTree`.
 More on what an airtree is and its construction below.
 At the same time `self` is treated as `mut`, so we need to keep an eye on this too.
 The method which is called and uses this mutability of self is `self.assignment`. 
 It does so by
 ```sample 
  self.assignment >> self.expect_type_assign >> self.code_gen_functions.insert
 ```
 and thus is creating a hashmap of all the functions that appear in the definition.
 (`self.handle_each_clause` is also called with `mut` which in turn calls `self.build` for which `mut` it is needed.
 `self.clause_pattern` is called with `mut` but it isn't used.) 
 ###### Codegen assignment 
 ~200 LoC 
 ###### Codegen expect type assign
 ~400 LoC 
 ###### ... Back to build 
 Validators in aiken are boolean functions while in uplc they are unit-valued (aka void-valued) functions.
 Thus the airtree is wrapped such that `false` results in an error (`wrap_validator_condition`). 
 (Ed: I don't know why there is a prevailing thought that boolean functions are preferable than functions 
 that simply error if anything is wrong.)
 `check_validator_args` again extends the airtree from the previous step, 
 and again calls `self.assignment` mutating self.
 Something interesting is happening here. 
 Script context is the final argument of a validator - for any script purpose.
 `check_validator_args` treats the script context like it is an unused argument. 
 We'll circle back to how this works later on.
 Next we encounter 
 ```rust
  AirTree::no_op().hoist_over(validator_args_tree);
 ```
 Its not very apparent why we need to do this. Let's look ahead and consider this later.
 The final airtree to step(s) are in `self.hoist_functions_to_validator`.
 TODO: What happens here?!
 Note that `AirTree` and its methods aren't fully typesafe.
 For example `hoist_over` will throw an error if called on an `Expression`.
 As `AirTree` is for internal use only, the scope for potential problems is reasonably contained.
 The AirTree has the following definition
 ```rust
 pub enum AirTree {
    Statement {
        statement: AirStatement,
        hoisted_over: Option<Box<AirTree>>,
    },
    Expression(AirExpression),
    UnhoistedSequence(Vec<AirTree>),
 }
 ```
 We can see it has a tree-like structure, as the name suggests. 
 `AirExpression` has multiple constructors. These include (non-exhaustive)
 - air primitives (including all the ones that appear in plutus)
 - constructors `Call` and `Fn` to handle functions
 - binary and unary operators
 - handling when and if
 - error and tracing
 `AirStatement` also has multiple constructors. 
 for handling functions, `plutus primitives, along with 
 An `AirStatement` 
 ## Down to uplc 
 ## Air 
 Aiken compiles aiken code to uplc via _air_: 
 Aiken Intermediate Representation. 
 ## Trace
 Running  `aiken build`...
 The cli (See `aiken/src/cmd/mod.rs`) parses the command, 
 finds the context and calls `Project::build` (`crates/aiken-project/src/lib.rs`),
 which in turn calls `Project::compile`. 
 #### `Project::compile`
 1. Check dependencies are available _eg_ aiken stdlib. 
 2. Read source files.  
  1. Walk over `./lib` and `./validators` and push aiken modules onto `Project.sources`.
 3. Parse each source in sources: 
  1. Generate a `ParsedModule` containing the `ast`, `docs`, _etc_.
  The `ast` here is an `UntypedModule`, which contains untyped definitions.
 4. Type check each parsed module.
  1. For each untyped module, create a `CheckedModule`. 
  This includes typed definitions. 
 5. `compile` forks into two depending on whether it's been called with `build` or `check`. 
 6. From `CheckModules` construct a `CodeGenerator`
 7. Pass the generator to construct a new `Blueprints`.
  1. Blueprints finds validators from checked modules. 
  2. From each it constructs a `Validator` with the constructor `Validator::from_checked_module` (which returns a vector of validators)
      1. Its here that the magic happens: The method `generator.generate(def)` is called, 
        where `def` is the typed validator(s). 
        This method outputs a `Program<Name>` which contains the UPLC.
      2. These are collected together.
  3. The rest is collecting and handling the errors and warnings and writing the blueprint.
 #### `CodeGenerator::generate`
 1. Create a new `AirStack`.
 #### `AirStack`
 Consists of:
 1. An Id
 2. A `Scope`
 3. A vector of `Air` 
 The Scope keeps track of ... [TODO]
 #### Air 
 Air is a typed language... [TODO]