Proposal: Syntax change for variant tags

I have been trying to write out the EBNF grammar for scrap script as a learning exercise, and came across a inconsistency in the examples found on the website.

In a type definition the variant tags are ATOMs (prefixed with #), for example:

scoop : #vanilla #chocolate #strawberry

On the other hand, in the variant construction, the tag appears without any “#”.

This is one of the examples from the website:

point::3d 1.0 "A" ~2B ; point : x => y => z => #2d { x : x, y : y } #3d { x : x, y : y, z : z }

This makes the behavior inconsistent, and makes it harder for the lexer to parse the above. “3d” is neither a atom (it does not start with #), and it is not a regular identifier (it starts with “3”, which would be lexed as a integer).

If we instead used Atom tags in both places, like this:

point::#3d 1.0 “A” ~2b

the job for both the lexer and parser should be much simpler, and it feels more consistent to use atom tags (prefixed by #) in both places.

Is this interpretation correct? Or is the context-dependent syntax (#3d in one place and 3d in another) an intentional feature of the language?

1 Like

Here’s the way I currently think about it:

  • # is a unary operator that takes an identifier like 3d or vanilla
  • :: is a binary operator that takes any value on the left (e.g. point, { x : x, y : y }) and an identifier on the right (e.g. 3d or vanilla)

One change we’ve discussed would be to build tags using pipe operators for more consistency:

()

; f =
  | #vanilla -> "tasty"
  | _ -> "yuck"

; scoop :
  | #vanilla
  | #chocolate
  | #strawberry

Thanks for taking the time to answer! :slight_smile:

if # is a unary operator, what does it do? Convert a identifier to a tag? Where the tags are later used (for example in variants, point::3d) they are used as regular identifiers again.

Another question one I could not find the answer on the website for:

What is the full specification for what an identifier can contain?

languages like python use rules like
an identifier starts with [a-zA-Z_] and can then contain [a-zA-Z0-9_]*
but in the examples both 3d (identifier starting with a digit) and example-identifier (identifier containing “-”) is supported

I feel like this choice makes the parsing harder than it needs to, and it also leads to unnecessary ambiguity regarding what happens in some cases.

Scrap script is supposed to be white-space agnostic(“Scrapscript doesn’t care about whitespace”), but here current design is actively forcing us to have whitespace on the sides of some of the operators.

For example: x-y is an identifier, x - y is a binary expression. But at the same time - when used in a unary expression is used without the whitespace (-x) . Example from the website:spaceq/is-planet "pluto".

Is it not obvious if 3-4 is an identifier or if it is supposed to be parsed as a binary expression.

Also, should 1+1 lead to a syntax error as it is missing the whitespaces, and can all operators (except maybe ::) be used inside an identifier (for example x-1*3/-5+y!z)?

While being able to use operators as an identifiers ( is-planet-9 for example) can feel neat in some cases, I feel like it would be more obvious for a reader if operators cannot be mixed into the identifiers.

if # is a unary operator, what does it do? Convert a identifier to a tag? Where the tags are later used (for example in variants, point::3d) they are used as regular identifiers again.

Yeah, that’s how I model it in my head haha. Maybe something akin to quote in lisp, which operates on syntax rather than semantics? That’s probably too complicated though. It really should be as simple as “see # and parse identifier characters until next whitespace” haha

I can post a rough EBNF later this week if it helps!

What is the full specification for what an identifier can contain?

I don’t think it’s set in stone yet. My intuition says it should be the same as variables, which is something like this:

[a-zA-Z][a-zA-Z0-9-]+

Which would mean that 3d is actually an invalid identifier haha

Also, should 1+1 lead to a syntax error as it is missing the whitespaces, and can all operators (except maybe ::) be used inside an identifier (for example x-1*3/-5+y!z)?

I think 1+1 should be equal to 2, but I’m conflicted about x-1. My intuition says that we might want to do this for identifiers instead:

[a-zA-Z][a-zA-Z0-9]+(-[a-zA-Z][a-zA-Z0-9]+)

This would unambiguously force is-planet-9 to be is-planet - 9 and is-planet-v9 to be a variable.

What do you think?

I think that makes sense to have the concept of atoms, all other interpretations I have tried becomes very messy

Regarding the identifiers (which is the same as variable names right?);

I think it would make sense to be a bit more restrictive and maybe use python-style identifiers or similar.

Currently my implementation looks like this, and thank god LLMs are becoming quite good at writing regexes, because I don’t know if I would have been able to do it myself :wink: This is my try at unifying all the examples in the guide.

      # Identifier Rules:
      # Allowed: letters (a-z, A-Z), digits (0-9), underscore (_), dash (-), forward slash (/)
      # Cannot start with: dash (-abc) or slash (/abc)
      # Cannot be only: single underscore (_) or digits only (123)
      # Cannot end with: slash (abc/)
      # Cannot contain: double slashes (abc//def)
      # Valid: Hello, 3d, _var, abc-123, my_var, 3_, connie2036/echo, bytes/to-utf8-text
      # Invalid: _, 123, 1.0, -abc, *var, abc/, /abc, abc//def
      {:identifier,
       ~r/^(?!(?:_|[0-9]+|-)(?![a-zA-Z0-9_-]))[a-zA-Z0-9_][a-zA-Z0-9_-]*(?:\/[a-zA-Z0-9_][a-zA-Z0-9_-]*)*(?<!\/)/},

Having more restrictive rules like “identifiers cannot contain ‘-’” would make some function names a little less pretty (is_even instead of is-even), but we would remove the ambiguity of is-even being either a identifier or “is minus even” which is a expression, and would probably be more intuitive for most users. 1+1 and 1-1 would feel more logical to me if both were math expressions.

How do you feel about this one?

^[a-z]([a-z0-9/-]*[a-z0-9]+)?$

https://regex101.com/r/IqdpIF/2

It wouldn’t allow 3d and would allow abc//def, but I think that’s perfectly okay.

I know kebab case a-b is a lot harder to parse than snake case a_b, but I think it’s worth the effort to make it work :slight_smile:

My main concern with kebab case is the lack of symmetry between different math operators, if we allow “a+b” i think we should allow "a-b” to work in the same way (e.g. as a operator, not a identifier).

In practice the difficulty (within reason) of writing the parser should probably not be reflected in the language design, that is just a sign of a leaky implementation. But with that said, most people (or at least me :wink:) looking at “x-y” and “x/y” would probably think that that would be parsed as “x minus y” or “x divided by y”, and not a single variable name, especially if “x+y” and “x*y” are valid math operations. My take is that we should either require white-space on the sides of operators (which i think would be ugly) or not allow operators inside identifiers.

Hmm. you make some really good points here!

I’m thinking that this might be the simplest option for now:

/^[a-z][a-z0-9]+$/i

No hyphens or underscores :slight_smile: But I think there are multiple different scrapscript dialects/variations/flavors floating around, so feel free to do what you like, and I’ll try to implement this feedback in future updates. We definitely need to formalize the syntax at some point, so your help here is super appreciated