BABLR: Scratch-like visual editing for all programming languages

conartist6 · February 12, 2024, 1:59pm

I just wanted to see if anyone here has an interest in this idea. My belief is that creating a fully semantic code editor for every programming language (at once) would essentially solve software literacy across the globe, solving most problems with software sustainability at the same time.

My core hypothesis is that a virtual machine for parsers is necessary to attain this goal. Just like we have many web frameworks that are interoperable because they sit on top of HTML and the DOM, my goal is to make all or most developer tooling interoperable by defining equivalent technologies: An HTML-like streaming tree serialization format called CSTML, and a DOM-like object tree format called agAST.

I haven’t done extensive write ups on this because I the idea itself has relatively little value without a spec for these shared formats which can gain political consensus. All my energy essentially goes into the development of that “spec”, which is in the form of a reference implementation. Contributions to or feedback on this reference implementation would be most appreciated, and of course I’m always happy to discuss the philosophy or political mechanics behind these ideas.

abitrolly · February 12, 2024, 3:43pm

Have you seen at LLVM IR? I would ask guys about your idea.

conartist6 · February 12, 2024, 3:54pm

I don’t know that much about LLVM IR. My impression of it is that quite a bit lower level than my CSTs and more focused on execution. CSTML trees are guaranteed to contain every single space and comment. Any IR that doesn’t contain spaces and comments isn’t going to be useful as backing state for a code editor, which is my goal.

abitrolly · February 12, 2024, 7:42pm

Maybe then GitHub - sourcegraph/scip: SCIP Code Intelligence Protocol can help? Not sure about spaces and comments though.

conartist6 · February 12, 2024, 8:16pm

SCIP also isn’t built to be an editor core, and its format’s assertion that languages can be enumerated is wrong, a category in which it joins SrcML.

Tree-sitter is the closest thing that exists, and quite a few people have been able to patch the holes in tree-sitter and use it to build functionality like code formatting that requires access to concrete syntax but can have implementations for many languages, notably Topiary.

As close as tree-sitter is to being an editor core though it still isn’t quite one. For one thing it requires you to keep a separate text document that’s the real source of truth for the shared state of the system’s frontend and backend, and that means keeping history in the text, which means storing text in something like a piece tree or rope.

BABLR is different because it gives an editor all the resources it needs to define user interface that’s fully integrated with semantic structure and understanding of the language being written, and in doing so it allows interfaces for coding to flourish that go well beyond typing syntax of the program out character by character, including such possibilities as gestural coding, or even widespread voice coding as is being made possible by Cursorless.

RichardLitt · February 13, 2024, 4:06am

I would ask julienmalard (Julien Malard-Adam) · GitHub and wooorm (Titus) · GitHub for advice. They may be able to help.

abitrolly · February 13, 2024, 9:01am

"holes" is a security jargon. So what kind of holes are in tree-sitter? Can you give the specific example?

conartist6 · February 13, 2024, 1:21pm

I mean holes in the functionality. Tree-sitter doesn’t by default store syntactic tokens like parens, curly braces, nor does it store whitespace. People who built code transformers on top of it have to insert extra custom captures that get all the parens and curly braces, and then they have to use source ranges matched up to a source text to understand whitespace. Thus there is a significant amount of nontrivial setup required before you can do something as simple as reprinting a parsed program with tree-sitter.

conartist6 · February 13, 2024, 1:33pm

Oh I know of unist/esast, and thus I guess wooorm’s work. I’ve always had one problem with unist though, which is this piece from the documentation:

“esast extends unist, a format for syntax trees, to benefit from its ecosystem of utilities. There is one important difference with other implementations of unist: children are added at fields other than the children array and the children field is not used.”

I’ve always considered this to be the single decision that prevented that project from having a bright future.

conartist6 · February 13, 2024, 1:35pm

By contrast in agAST the Javascript tree format is not special compared to the agAST trees for any and every other language. agAST trees always preserve both name and ordering for children.

conartist6 · February 13, 2024, 1:48pm

Ah, and there’s one other fatal problem with unist, which is that it isn’t an editable structure. It uses esTree as its JS implementation despite the obvious incompatibility because it grew quite closely out of estree, including the use of tracking start and end for every node, being numeric indicies which select subranges of the program’s source text.

Here’s is an example structure, cribbed from their docs

position: {
  start: {line: 1, column: 1, offset: 0},
  end: {line: 1, column: 75, offset: 74}
}

The problem here is that since every single node has a position, if you had a large document and you wanted to insert a single space at the beginning of it, to update the document you will have to go through and change every single node’s position to be offset by one (the space you just inserted) or else all the source text mappings will be wrong. The fact that you can’t type a single space without revisiting the entire document is problematic for editor responsiveness, to say the least, and is a limitation shared by almost every existing parser (though not mine).

conartist6 · February 13, 2024, 1:56pm

Sorry, I really don’t mean to come here to hate on other people’s work. It’s just that in OSS I experience a lot of gatekeeping, and never more than when it seems like some idea has already been done. It would seem a universal tenet of knowledge among computer scientists that “parsers have already been done,” in other words there must not be anything seriously interesting to learn left about building parsers.

I contend that this is not true, that parsing as we know it still far from fulfilling its full potential to serve humanity, and to make that point I have to point out how all existing systems have inherited certain flaws which are so prevalent as to not be considered flaws at all.

abitrolly · February 13, 2024, 2:11pm

Do you have these points about other projects documented somewhere in dedicated space? Like a BABLR page called “Why not ______ ?”

Because for the points to become valid, they should have a reference and discussion, which is impossible to achieve in this forum, both because of plain text format and because of topic (sustainability).

conartist6 · February 13, 2024, 2:14pm

That’s a great idea – I’ll adapt this into something like that and share it here.

conartist6 · February 13, 2024, 2:26pm

I just had the idea that I could describe the difference as being “space parsers” versus “time parsers”. Does that make sense?

The idea is that a space parser is what we have had, a way of creating tree-hierarchies on top of some one or two dimensional “space”, which is the in-memory text of the program.

What I’m proposing are time parsers, which describe trees by a sequence of actions like “start node”, “end node”, and “emit literal”, which when done in order will constitute a traversal of the particular encoded tree. It’s a well known idea, and I stole it brazenly from HTML, which stole it from something even earlier. What’s especially powerful about it is that it allows you to have many different spatial layouts, so long as you know how to map time-structured traversals onto them. There are many different implementations of the DOM, but they all interoperate through HTML!

abitrolly · February 13, 2024, 9:50pm

“space parser” doesn’t make sense to me. You still need to parse start, end, and emit events from 1D inpupd data to build 2D tree-hierarchy.

conartist6 · February 14, 2024, 1:53pm

If you have numeric indexes used for random access, then you need some block of allocated memory to point into. Then any two in-range indexes define a subset of the space, which is what nodes do in what I’m calling a “space parser”. It doesn’t require you to emit events, because space parsers are almost universally not streaming parsers. You emit the entire tree, once you’ve processed the entire text.

abitrolly · February 14, 2024, 4:50pm

This looks like source maps to me. With non-overlapping nesting. Which still requires some nesting parsing to “color” regions. I used this lib wikify · PyPI to build non-conflicting recursive parsers for markup processing.

conartist6 · February 14, 2024, 6:06pm

I guess what you’re getting at is that if you don’t think of the textual tree as the most important asset, what is left would just be a text transformer. Indeed that is true. This is also a way of building nonconflicting recursive parsers, and it will likely need some (native) format for source maps (which I have not yet created)

abitrolly · February 14, 2024, 10:56pm

So how does that help Open Source Sustainabillity?

Topic		Replies	Views
Mausoleum for projects that could not make it 📣 Sustainer Talk	10	588	October 17, 2021
Story of @zloirock, core.js author	1	379	March 6, 2023
The US Government Requests Public Comment on Open-Source Software Security and Memory Safe Programming Languages	2	233	October 26, 2023
Sustain DEI WG meeting summary: December 8th 2023 DEI	2	174	December 12, 2023
Diversity in Open Source WG DEI	26	1125	September 27, 2022

BABLR: Scratch-like visual editing for all programming languages

Related topics