I just wanted to see if anyone here has an interest in this idea. My belief is that creating a fully semantic code editor for every programming language (at once) would essentially solve software literacy across the globe, solving most problems with software sustainability at the same time.
My core hypothesis is that a virtual machine for parsers is necessary to attain this goal. Just like we have many web frameworks that are interoperable because they sit on top of HTML and the DOM, my goal is to make all or most developer tooling interoperable by defining equivalent technologies: An HTML-like streaming tree serialization format called CSTML, and a DOM-like object tree format called agAST.
I haven’t done extensive write ups on this because I the idea itself has relatively little value without a spec for these shared formats which can gain political consensus. All my energy essentially goes into the development of that “spec”, which is in the form of a reference implementation. Contributions to or feedback on this reference implementation would be most appreciated, and of course I’m always happy to discuss the philosophy or political mechanics behind these ideas.
I don’t know that much about LLVM IR. My impression of it is that quite a bit lower level than my CSTs and more focused on execution. CSTML trees are guaranteed to contain every single space and comment. Any IR that doesn’t contain spaces and comments isn’t going to be useful as backing state for a code editor, which is my goal.
SCIP also isn’t built to be an editor core, and its format’s assertion that languages can be enumerated is wrong, a category in which it joins SrcML.
Tree-sitter is the closest thing that exists, and quite a few people have been able to patch the holes in tree-sitter and use it to build functionality like code formatting that requires access to concrete syntax but can have implementations for many languages, notably Topiary.
As close as tree-sitter is to being an editor core though it still isn’t quite one. For one thing it requires you to keep a separate text document that’s the real source of truth for the shared state of the system’s frontend and backend, and that means keeping history in the text, which means storing text in something like a piece tree or rope.
BABLR is different because it gives an editor all the resources it needs to define user interface that’s fully integrated with semantic structure and understanding of the language being written, and in doing so it allows interfaces for coding to flourish that go well beyond typing syntax of the program out character by character, including such possibilities as gestural coding, or even widespread voice coding as is being made possible by Cursorless.
I mean holes in the functionality. Tree-sitter doesn’t by default store syntactic tokens like parens, curly braces, nor does it store whitespace. People who built code transformers on top of it have to insert extra custom captures that get all the parens and curly braces, and then they have to use source ranges matched up to a source text to understand whitespace. Thus there is a significant amount of nontrivial setup required before you can do something as simple as reprinting a parsed program with tree-sitter.
Oh I know of unist/esast, and thus I guess wooorm’s work. I’ve always had one problem with unist though, which is this piece from the documentation:
“esast extends unist, a format for syntax trees, to benefit from its ecosystem of utilities. There is one important difference with other implementations of unist: children are added at fields other than the children array and the children field is not used.”
I’ve always considered this to be the single decision that prevented that project from having a bright future.
By contrast in agAST the Javascript tree format is not special compared to the agAST trees for any and every other language. agAST trees always preserve both name and ordering for children.
Ah, and there’s one other fatal problem with unist, which is that it isn’t an editable structure. It uses esTree as its JS implementation despite the obvious incompatibility because it grew quite closely out of estree, including the use of tracking start and end for every node, being numeric indicies which select subranges of the program’s source text.
Here’s is an example structure, cribbed from their docs
The problem here is that since every single node has a position, if you had a large document and you wanted to insert a single space at the beginning of it, to update the document you will have to go through and change every single node’s position to be offset by one (the space you just inserted) or else all the source text mappings will be wrong. The fact that you can’t type a single space without revisiting the entire document is problematic for editor responsiveness, to say the least, and is a limitation shared by almost every existing parser (though not mine).
Sorry, I really don’t mean to come here to hate on other people’s work. It’s just that in OSS I experience a lot of gatekeeping, and never more than when it seems like some idea has already been done. It would seem a universal tenet of knowledge among computer scientists that “parsers have already been done,” in other words there must not be anything seriously interesting to learn left about building parsers.
I contend that this is not true, that parsing as we know it still far from fulfilling its full potential to serve humanity, and to make that point I have to point out how all existing systems have inherited certain flaws which are so prevalent as to not be considered flaws at all.
Do you have these points about other projects documented somewhere in dedicated space? Like a BABLR page called “Why not ______ ?”
Because for the points to become valid, they should have a reference and discussion, which is impossible to achieve in this forum, both because of plain text format and because of topic (sustainability).
I just had the idea that I could describe the difference as being “space parsers” versus “time parsers”. Does that make sense?
The idea is that a space parser is what we have had, a way of creating tree-hierarchies on top of some one or two dimensional “space”, which is the in-memory text of the program.
What I’m proposing are time parsers, which describe trees by a sequence of actions like “start node”, “end node”, and “emit literal”, which when done in order will constitute a traversal of the particular encoded tree. It’s a well known idea, and I stole it brazenly from HTML, which stole it from something even earlier. What’s especially powerful about it is that it allows you to have many different spatial layouts, so long as you know how to map time-structured traversals onto them. There are many different implementations of the DOM, but they all interoperate through HTML!
If you have numeric indexes used for random access, then you need some block of allocated memory to point into. Then any two in-range indexes define a subset of the space, which is what nodes do in what I’m calling a “space parser”. It doesn’t require you to emit events, because space parsers are almost universally not streaming parsers. You emit the entire tree, once you’ve processed the entire text.
This looks like source maps to me. With non-overlapping nesting. Which still requires some nesting parsing to “color” regions. I used this lib wikify · PyPI to build non-conflicting recursive parsers for markup processing.
I guess what you’re getting at is that if you don’t think of the textual tree as the most important asset, what is left would just be a text transformer. Indeed that is true. This is also a way of building nonconflicting recursive parsers, and it will likely need some (native) format for source maps (which I have not yet created)