Proposal: bootscrapping from byte and bytes

Can we create all the other datatypes from byte and bytes
and algebraic data types?

Encode arbitrary data in scrapscript using Base64. This is helpful for human manipulation and debugging, but don’t worry – we can send it over the wire as raw bytes using flat scraps.

I disagree, base64 is harder to work with than hexadecimal for binary data. While include base64 encoding is obviously a good idea, I don’t think base64 as a a debugging or transport format is useful.

Can we create all the other datatypes from byte and bytes
and algebraic data types?

I mean, you could. Things fall apart when you have to start thinking about things like endianness though.

Hmm, I think I agree.

It might be useful to include literal syntax for both hex and base64. Maybe backtick and triple-backtick? I’m open to suggestions.

Literal syntax for hex is typically numeric with a 0x prefix. I don’t see why we should do anything else.

Base64 is “just text” that goes through an encode/decode step, so it should be literally represented as a string.

Being conservative with the use of tokens and delimiters is pretty important. I could see backticks being useful for, say, templating or macros or really anything else.

The difference is that we want syntax to make that encode step happen at compile time. You shouldn’t have to run your program just to discover that your base64 literal was missing a character;

`aGVsbG8gd29ybGQ=`
-- returns: `aGVsbG8gd29ybGQ=`
bytes/to-utf8-text "aGVsbG8gd29ybGQ="
-- returns: ok "hello world"

The former example seems much easier to work with, especially when considering a scrap could be an image or arbitrary pdf or whatever.

I have a few problems with 0x:

  1. It creates ambiguity for the parser, and I want to keep the parser really simple. When the parser hits a zero, it should only have to worry about ints and floats.

  2. 0x seems arbitrary to me, and doesn’t look like anything else in scrapsrcipt. I like “rhyming” syntax. For example, I would be probably be happy to use double-quotes for bytes and single quote for byte (which means text would have to move to backtick, which would be pretty weird haha).

The difference is that we want syntax to make that encode step happen at compile time. You shouldn’t have to run your program just to discover that your base64 literal was missing a character;

I agree that’s something nice to have. It would be really nice to have a general purpose compile-time immediate syntax, similar to Zig’s comptime, FORTH’s immediate, or Lisp macros to make any function execute at compile time when it’s used in a declaration. I’m actually really partial to FORTH’s immediate, it’s elegant and let’s you extend the compiler as user with basically no overhead.

IMO this isn’t a huge tradeoff in complexity, if you were going for asceticism why have the Elm applicator sugar?

Regardless, if you do compile-time immediates as above, you can get the best of both by having some functions work by transforming the syntax tree at compile time so you can have compile-time and tooling-level checks to transform base64/decode "aGVsbG8gd29ybGQ=" directly into "hello" in the calling function as if it was a literal. I’m not sure what that syntax would look like, I’d probably have a special “where” clause for functions that matches on syntax elements or literals.

It’s been the standard since before either of us were alive. Almost any language that isn’t married to web browsers supports that syntax (and even JS-family languages only don’t support it because JS doesn’t have integer types at all. It’s all IEEE 754 floats).

I don’t mind this, but I’d swap backtick to be for bytes, single quote for byte (this is pretty much exactly the same as Java’s char anyway), and double quotes for text strings. There’s always the option of prefixing double quotes with single ASCII characters like Python to change how the string is parsed. It’s pretty nice, IMO.

Aren’t there already other operators that require the parser to look ahead one character? Iirc it’s even the case for +

Definitely agree that backticks should be the ones to go to bytes, though. I think double for string / single for “char” or “byte” makes a lot more sense. Backticks are generally harder to reach on a keyboard, and, ime, bytes are used way less frequently than strings

Another option might be to have some sort of syntactical way of embedding an “included” file or string as base64, and just have the compiler deal with the encoding etc, so the user never even sees the string? (At least as an option?)

e.g.

bytes/embed "scraps/local_png.png"
-- returns: "aGVsbG8gd29ybGQ="

probably not easily

many languages (rust, zig, c, swift, etc) feature extended precision floating point numbers, which are 80bit types. with only byte-wide operations, manipulating f80s (if desired) would be impossible

furthermore: this will likely impact performance unless you special case many types in the compiler to be detected and promoted to native (theorem provers detect peano numbers and promote them to native integers, for instance)

plus, if you only had byte and bytes, you’d need to implement addition across 16bit, 32bit, etc in terms of add-and-carry (which would either be a mess of pattern matching and slow, or just slow if implemented as a rock)

a couple languages support other notations for arbitrary base integers. for instance, Ada features base#literal syntax, and some dialects of Smalltalk feature the syntax baserliteral (ie, 16rFF). this likely requires some grammar complications, though. however, having arbitrary bases is not that much more complicated to implement than one or two bases in terms of reading literals, and it allows for some niche but still used bases (unix file modes are in octal, if you decide to implement file APIs)

1 Like

the nice thing is, that if you adopt an arbitrary base syntax, you can have base64, hex literals, binary literals, and anything you desire, in a single syntactical feature: just use 64 as the base to input base64 literals

Oh, like you would have 64x:aGVsbG8= to have a base64 literal? I like that

2 Likes

correct, and 16x for 16bit literals, and soforth.

this would be pretty simple implementationwise, because you can reuse the same logic for decoding every literal. just modulo a different number in the logic, and special case the suffix code for any particular subsets

2 Likes

Ooh that literal syntax is quite nice.

With byte strings, you’re generally either (1) copying it from somewhere or (2) writing it by hand. In both cases, it’s better to work natively with the literal format that suits the situation.

Great suggestion!

2 Likes