Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Added initial architecture decision records #1

Merged
merged 8 commits into from
Jun 7, 2024
35 changes: 35 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Algorand TypeScript

Algorand TypeScript is a partial implementation of the TypeScript programming language that runs on the Algorand Virtual Machine (AVM). It includes a statically typed framework for development of Algorand smart contracts and logic signatures, with TypeScript interfaces to underlying AVM functionality that works with standard TypeScript tooling.

It maintains the syntax and semantics of TypeScript such that a developer who knows TypeScript can make safe assumptions
about the behaviour of the compiled code when running on the AVM. Algorand TypeScript is also executable TypeScript that can be run
and debugged on a Node.js virtual machine with transpilation to EcmaScript and run from automated tests.
robdmoore marked this conversation as resolved.
Show resolved Hide resolved

# Guiding Principals

## Familiarity

Where the base language (TypeScript/EcmaScript) doesn't support a given feature natively (eg. unsigned fixed size integers),
prior art should be used to inspire an API that is familiar to a user of the base language and transpilation can be used to
ensure this code executes correctly.

## Leveraging TypeScript type system

TypeScript's type system should be used where ever possible to ensure code is type safe before compilation to create a fast
feedback loop and nudge users into the [pit of success](https://blog.codinghorror.com/falling-into-the-pit-of-success/).

## TEALScript compatibility

[TEALScript](https://github.com/algorandfoundation/tealscript/) is an existing TypeScript-like language to TEAL compiler however the source code is not executable TypeScript, and it does not prioritise semantic compatibility. Wherever possible, Algorand TypeScript should endeavour to be compatible with existing TEALScript contracts and where not possible migratable with minimal changes.

## Algorand Python

[Algorand Python](https://algorandfoundation.github.io/puya/) is the Python equivalent of Algorand TypeScript. Whilst there is a primary goal to produce an API which makes sense in the TypeScript ecosystem, a secondary goal is to minimise the disparity between the two APIs such that users who choose to, or are required to develop on both platforms are not facing a completely unfamiliar API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## ABI Abstraction
When possible, Algorand TypeScript should avoid putting the cognitive overhead of ABI encoding/decoding on the developer. For example, there should be no different between AVM byteslices and ABI encoded strings and they should be directly comparable and compatible until the point of encoding (returning, putting in state, array encoding, logging, etc.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a slightly more nuanced view of this.

There should be default typing options that meet this principle. Like how in Algorand Python if you use a UInt64 or String then Puya takes care of encoding/decoding for you.

But, I think it's important to also have control over that stuff when you want/need it, or there isn't a direct translation between AVM primitives and the ABI type. The way this was achieved in Algorand Python was to have primitive types (e.g. UInt64, String) that automatically get decoded and encoded on their way in/out and represented with the equivalent ABI type in ARC-32/4, but then to expose specific ABI encoded types in a separate (arc4) namespace, so when you want to work with the encoded data you can do that too (and those types all have a .native property to easily decode to the relevant underlying AVM type. Reference: https://algorandfoundation.github.io/puya/lg-types.html

Regardless, this is not relevant to these ADRs because they are talking about AVM primitive types. We plan on having a separate ADR to discuss how ABI types are handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the scenarios where a developer wants to perform operations on the encoded bytes? Seems to be very rare. For edge cases TEALScript provides a rawBytes function that will give you the encoded bytes of the value.

Regardless, this is not relevant to these ADRs because they are talking about AVM primitive types. We plan on having a separate ADR to discuss how ABI types are handled.

Ah gotcha. I do think they are a bit linked though because compatibility between native types and ABI types is important. It's something develoeprs have tripped up on with Beaker and Algorand Python

# Architecture decisions

As part of developing Algorand TypeScript we are documenting key architecture decisions using [Architecture Decision Records (ADRs)](https://adr.github.io/). The following are the key decisions that have been made thus far:

- [2024-05-21: Primitive integer types](./architecture-decisions/2024-05-21_primitive-integer-types.md)
- [2024-05-21: Primitive byte and string types](./architecture-decisions/2024-05-21_primitive-bytes-and-strings.md)
196 changes: 196 additions & 0 deletions docs/architecture-decisions/2024-05-21_primitive-bytes-and-strings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
# Architecture Decision Record - Primitive bytes and strings

- **Status**: Draft
- **Owner:** Tristan Menzel
- **Deciders**: Alessandro Cappellato (Algorand Foundation), Joe Polny (Algorand Foundation), Rob Moore (MakerX)
- **Date created**: 2024-05-21
- **Date decided**: N/A
- **Date updated**: 2024-05-31

## Context

See [Architecture Decision Record - Primitive integer types](./2024-05-21_primitive-integer-types.md) for related decision and context.

The AVM's only non-integer type is a variable length byte array. When *not* being interpreted as a `biguint`, leading zeros are significant and length is constant unless explicitly manipulated. Strings can only be represented in the AVM if they are encoded as bytes. The AVM supports byte literals in the form of base16, base64, and UTF-8 encoded strings. Once a literal has been parsed, the AVM has no concept of the original encoding or of UTF-8 characters. As a result, whilst a byte array can be indexed to receive a single byte (or a slice of bytes); it cannot be indexed to return a single UTF-8 *character* - unless one assumes all characters in the original string were ASCII (i.e. single byte) characters.

Algorand Python has specific [Bytes and String types](https://algorandfoundation.github.io/puya/lg-types.html#avm-types) that have semantics that exactly match the AVM semantics. Python allows for operator overloading so these types also use native operators (where they align to functionality in the underlying AVM).


## Requirements

- Support bytes AVM type and a string type that supports ASCII UTF-8 strings
- Use idiomatic TypeScript expressions for string expressions
- Semantic compatibility between AVM execution and TypeScript execution (e.g. in unit tests)

## Principles

- **[AlgoKit Guiding Principles](https://github.com/algorandfoundation/algokit-cli/blob/main/docs/algokit.md#guiding-principles)** - specifically Seamless onramp, Leverage existing ecosystem, Meet devs where they are
- **[Algorand Python Principles](https://algorandfoundation.github.io/puya/principles.html#principles)**
- **[Algorand TypeScript Guiding Principles](../README.md#guiding-principals)**

## Options


### Option 1 - Direct use of native EcmaScript types


EcmaScript provides two relevant types for bytes and strings.

- **string**: The native string type. Supports arbitrary length, concatenation, indexation/slicing of characters plus many utility methods (upper/lower/startswith/endswith/charcodeat/trim etc). Supports concat with binary `+` operator.
- **Uint8Array**: A variable length mutable array of 8-bit numbers. Supports indexing/slicing of 'bytes'.


```ts
const b1 = "somebytes"

const b2 = new Uint8Array([1, 2, 3, 4])

const b3 = b1 + b1
```

Whilst binary data is often a representation of a utf-8 string, it is not always - so direct use of the string type is not a natural fit. It doesn't allow us to represent alternative encodings (b16/b64) and the existing api surface is very 'string' centric. Much of the api would also be expensive to implement on the AVM leading to a bunch of 'dead' methods hanging off the type (or a significant amount of work implementing all the methods). The signatures of these methods also use `number` which is [not a semantically relevant type](./2024-05-21_primitive-integer-types.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem with having non-implemented functions is them still showing IDE, but we can solve this with plugins

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also name clashes. e.g. if we want a length method that returns uint64 then you can't do it with extending string prototype.


Achieving semantic compatability with EcmaScript's `String` type would also be very expensive as it uses utf-16 encoding underneath whilst an ABI string is utf-8 encoded. A significant number of ops (and program size) would be required to convert between the two. If we were to ignore this and use utf-8 at runtime, apis such as `.length` would return different results. For example `"😄".length` in ES returns `2` whilst utf-8 encoding would yield `1` codepoint or `4` bytes, similarly indexing and slicing would yield different results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I thought we were after semantic compatibility with TypeScript, not EcmaScript?

Copy link
Contributor

@tristanmenzel tristanmenzel Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TypeScript being a superset of EcmaScript takes its string semantics from EcmaScript so it made more sense to me to reference the latter. This will probably apply in most cases. Maybe it's not valuable to distinguish between the two and we should just use 'TypeScript' everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the distinction is semantic compatibility with the runtime vs semantic compatibility with the type system. We know we'll have a transformer to adjust runtime behavior, thus is still seems reasonable to me to use string to represent utf-8 strings (much like how we can use number to represent uint64).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic compatibility means given some typescript code - it either runs the same in our execution environment (test execution and AVM execution), or doesn't run at all (if we don't support a given syntax).

AlgoTS will need to introduce several types that don't exist in any existing javascript execution environment so there is no expectation that you will be able to lift AlgoTS code and run it in a browser but the syntax and semantics should be the same.

For the reasons listed above I think it is impractical to have a string type on the avm behave the same as the typescript/ecmascript string so we would be better off introducing a new type that doesn't carry any semantic expectations.

As another example, given this function

function test(x: string) {
   return x[0]
}

When you pass in the string "a", you would get "a" back in any javascript execution environment - and the same in TealScript; however if you pass in the string "ā", you would get "ā" back in any javascript execution environment - but you'll get 0xC4 in TealScript (an invalid utf8 string)


With regards to number we deliberately don't expose the type as number and force implicit and explicit casts when dealing with literals and operations. In other words we pretend to introduce a new type uint64 without semantic expectations but are forced (by limitations of typescript/ecmascript) to depend on number underneath.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you pass in the string "a", you would get "a" back in any javascript execution environment - and the same in TealScript; however if you pass in the string "ā", you would get "ā" back in any javascript execution environment - but you'll get 0xC4 in TealScript (an invalid utf8 string)

Ah ok thanks for the example. That makes sense.

With regards to number we deliberately don't expose the type as number and force implicit and explicit casts when dealing with literals and operations. In other words we pretend to introduce a new type uint64 without semantic expectations but are forced (by limitations of typescript/ecmascript) to depend on number underneath.

Perhaps a viable option here is also a branded string with the upside being string literal support

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A branded string is definitely an option - the major con is the crowded prototype which we can (probably??) clean up with a plugin, but that's additional complication (and development effort). It feels like a lot to pay to get plain literals "hello" over tagged literals Str`hello`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah true, branded strings still would't address the incorrect intuition of str[x] accessing byte x rather than character x.


The Uint8Array type is fit for purpose as an encoding mechanism but the API is not as friendly as it could be for writing declarative contracts. The `new` keyword feels unnatural for something that is ostensibly a primitive type. The fact that it is mutable also complicates the implementation the compiler produces for the AVM.
robdmoore marked this conversation as resolved.
Show resolved Hide resolved



### Option 2 - Branded strings (TEALScript approach)


TEALScript uses a branded `string` to represent `bytes` and native `string` to represent UTF-8 bytes. Base64/Base16 encoding/decoding is performed with specific methods.

```typescript
const someString = "foo"
const someHexValue = hex("0xdeadbeef") // branded "bytes"
```

Bytes and UTF-8 strings are typed via branded `string` types. UTF-8 strings are the most common use case for strings, thus have the JavaScript `String` prototype functions when working with byteslice, which provides a familiar set of function signatures. This option also enables the usage of `+` for concatenation.

To differentiate between ABI `string` and AVM `byteslice`, a branded type, `bytes`, can be used to represent non-encoded byteslices that may or may not be UTF-8 strings.

Additional functions can be used when wanting to have string literals of a specific encoding represent a string or byteslice.


The downsides of using `string` are listed in Option 1.


### Option 3 - Define a class to represent Bytes

A `Bytes` class and `Str` (Name TBD) class are defined with a very specific API tailored to operations which are available on the AVM:

```ts
class Bytes {
constructor(v: string) {
this.v = v
}

concat(other: Bytes): Bytes {
return new Bytes(this.v + other.v)
}

at(x: uint64): Bytes {
return new Bytes(this.v[x])
}

/* etc */
}

class Str {
/* implementation */
}

```

This solution provides great type safety and requires no transpilation to run _correctly_ on Node.js. However, non-primitive types in Node.js have equality checked by reference. Again the `new` keyword feels unnatural. Due to lack of overloading, `+` will not work as expected however concatenations do not require the same understanding of "order of operations" and nesting as numeric operations, so a `concat` method isn't as unwieldy (but still isn't idiomatic).

```ts
const a = new Bytes("Hello")
const b = new Bytes("World")
const c = new Str("Example string")
const ab = a.concat(b)

function testValue(x: Bytes) {
// No compile error, but will work on reference not value
switch(x) {
case a:
return b
case b:
return a
}
return new Bytes("default")
}
```

To have equality checks behave as expected we would need a transpilation step to replace bytes values in certain expressions with a primitive type.

### Option 4 - Implement bytes as a class but define it as a type + factory

We can iron out some of the rough edges of using a class by only exposing a factory method for `Bytes`/`Str` and a resulting type `bytes`/`str`. This removes the need for the `new` keyword and lets us use a 'primitive looking' type alias (`bytes` versus `Bytes`, `str` versus `Str` - much like `string` and `String`). We can use [tagged templates](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#tagged_templates) to improve the user experience of multipart concat expressions in lieu of having the `+` operator.

```ts

export type bytes = {
readonly length: uint64

at(i: Uint64Compat): bytes

concat(other: BytesCompat): bytes
} & symbol

export function Bytes(value: TemplateStringsArray, ...replacements: BytesCompat[]): bytes
export function Bytes(value: BytesCompat): bytes
export function Bytes(value: BytesCompat | TemplateStringsArray, ...replacements: BytesCompat[]): bytes {
/* implementation */
}

const a = Bytes("Hello")
const b = Bytes.fromHex("ABFF")
const c = Bytes.fromBase64("...")
const d = Bytes.fromInts(255, 123, 28, 20)
const e = Bytes`${a} World!`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor detail, but I would propose we take inspiration from Python and use b

b`${a} World!`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did explore this but found (at least in contrived example code) the likelihood of a name clash with a local variable to be quite high. eg. const [a, b, c] = someArray. This resulted in an unfriendly error. We could use B if we prefer to be terse at the expense of clarity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right yeah you did mention that. I don't feel strongly either way so we should just see what existing developers would prefer.



function testValue(x: bytes, y: bytes): bytes {
return Bytes`${x} and ${y}`
}

const f = Str`Example string`

```

robdmoore marked this conversation as resolved.
Show resolved Hide resolved
Whilst we still can't accept string literals on their own, the tagged template is almost as concise.

Having `bytes` and `str` behave like a primitive value type (value equality) whilst not _actually_ being a primitive is not strictly semantically compatible with EcmaScript however the lowercase type names (plus factory with no `new` keyword) communicates the intention of it being a primitive value type and there is an existing precedence of introducing new value types to the language in a similar pattern (`bigint` and `BigInt`). Essentially - if EcmaScript were to have a primitive bytes type, this is most likely what it would look like.

## Preferred option

Option 3 can be excluded because the requirement for a `new` keyword feels unnatural for representing a primitive value type.

Option 1 and 2 are not preferred as they make maintaining semantic compatability with EcmaScript impractical.

Option 4 gives us the most natural feeling api whilst still giving us full control over the api surface. It doesn't support the `+` operator, but supports interpolation and `.concat` which gives us most of what `+` provides other than augmented assignment (ie. `+=`).

We should select an appropriate name for the type representing an AVM string. It should not conflict with the semantically incompatible EcmaScript type `string`.
- `str`/`Str`:
- ✅ Short
- ✅ obvious what it is
- ✅ obvious equivalent in ABI types
- ❌ NOT obvious how it differs from EcmaScript `string`
- `utf8`/`Utf8`:
- ✅ Short
- ✅ reasonably obvious what it is
- 🤔 less obvious equivalent in ABI types
- ✅ obvious how it differs to `string`
- `utf8string`/`Utf8String`
- ❌ Verbose
- ✅ obvious equivalent in ABI types
- ✅ very obvious what it is
- ✅ obvious how it differs to `string`



## Selected option

Option 4 has been selected as the best option
Loading