Dependency Tree Node Weights

Recently, we’ve been discussing how best to weight each package that shows up in a dependency tree of an Open Source project. Some discussion can be found here.

Background

Each time an open source project is published, installed, or maintained, a list of “dependencies” is installed. Dependencies were revolutionary. It meant that instead of having to bundle my code along with everyone else’s code that I reference, I can simply denote that my project “imports X package”. This “dependency” is taken note of within a package manifest file. The benefits were many-fold. First, it meant my package code was smaller. All run-time dependencies would be installed on the fly, and the code I wrote is then isolated. Second, it means that I am now referencing someone else’s package. So if they improve their package, they can then bump the version number, I can accept the updated version of the dependency, and benefit from the improved code of my dependency at no work to me.

Now that we’re attempting to fund Open Source, the question arises as to “what percent of a donation does the core package deserve” vs “what percent of a donation does a dependency package deserve”. We refer to this relationship as “weighting” i.e. how much “weight” does a dependency have on the entire “mass” of a project.

In other words, if I own project A, and project A depends on project B and project C, and project A receives a donation of 10 dollars, how much does project A deserve vs how much should be shard with project B and project C?

In practice

At Flossbank, we’ve settled on an initial solution where if project A received 10 dollars and they depend on project B and C as top level dependencies, they’d split the donation evenly with them i.e. project A would receive 3.33 dollars, as well as B and C.

Now this gets slightly confusing when you introduce a real world example of nested dependencies.

Sometimes this not only means nested dependencies, but cycles as well. Package managers don’t care about cycles because they just care about “did we install every package referenced”. They won’t get into recursive loops. In our case however, we want to divvy up donations in the most equitable way, so even if we’ve see a package before, we continue going down the tree until a set epsilon. You can see our code here.

In order to distribute donations, we first craft a “weight map”. At each level of a dependency tree, Flossbank splits the weight at each level with the top level dependencies of that specific package.

For example, we give package “None” a weight of 100, we would then split the 100 equally with it’s top level deps (see circled top level deps). This means “none” would receive 1/7 of the 100 weight (splitting the 100 equally with it’s 6 top level dependencies).

On the next traversal, we would see that “commonwealth” received 1/7 of 100 weight (14 weight) and would split commonwealths 14 weight equally with it’s three dependencies (in, the, and pennsylvania). Meaning commonwealth would then have 1/4 of the 14 weight it received.

We then continue down the tree, at each level splitting the mass of that level equally with it’s dependencies.

Why

We started with this solution for a few reasons

  • Simplicity: There is ample opportunity to make a confusing weight structure, where leaf nodes are heavier, nodes with more code are heavier, and various other heuristics. We would love to dive into those, but we wanted to start with something simple.
  • Equality: Who is to say how important a dependency is to the package pulling it in? After much discussion, we’ve found that if a package is going to be pulled in, it deserves its share of the weight. After all, it must be contributing value in order to have been pulled in, otherwise the original author would have written and decided to maintain the code themselves.
  • Frequency vs. depth: We tried to find a solution that adheres to both the frequency that a dependency is used in a tree as well as how deep we found the package. In our solution, a package can appear many times at many different depths, accumulating “weight” at each occurrence. This weight is summed up to be a total “weight” for that package.

Discussion

We’d love to start a discussion around various other algorithms for weight distribution. This is just one example, but we’d love to explore models that amplify leaf nodes, amplify parent nodes, or any others. Please share your distribution algorithm ideas with some possible execution notes as well as reasons why you think it’d be good for the Open Source ecosystem.

Also if you have any questions or comments on our initial solution, please share!

Thanks

4 Likes

Hi! This is very interesting. Have you considered learning from the process of economic planning in socialist countries? I feel like jumping right in and crafting an algorithm to distribute funds among FOSS maintainers could very easily lead to large inequalities between certain types of projects. Given many countries around the world (Cuba, Vietnam, China, etc.) have used economic planning fairly heavily, I bet there’s a lot to learn from their experiences.

Thanks for sharing this. Sort of late to this conversation but are their published ways to systematically map dependencies in an ecosystem? As people ask different questions against such a map, I haven’t seen clearly defined ways by which maps are built. Any pointers would be much appreciated.

I definitely think it’s worth checking out. There are many algorithms that can definitely lead to inequality if they’re based on incorrect heuristics.

We settled on our simple “divide equally” solution primarily for the following reason:

If someone pulls in a package as a dep, they are implicitly saying “I value this package enough to pull it’s code in”. If that’s the case, then donations should be split with them. If that’s not the case, the author need not pull the package in and split the donation with them. If I need my kitchen remodeled, I can hire a contractor to do the work and maintain it, but I then owe them money. Or, I can choose to do the work myself but that means i’m on the hook for the labor + the maintenance. That’s kind of how we see this distribution model.

What do you think? Are there any economic models from the countries you mentioned that might translate well to package weighting?

Hey Matt, yea we didn’t find many either. Primarily I think because no one wants to dive into cycles. Package managers stop as soon as they see a dupe, since they’ve already fetched all deps of dupes. This doesnt help us because we actually want to dive into cycles, and i’ll explain why below.

Here is a little example of a cycle, and how we treat it.

                                  A
                      B                     C
                                             A

now, if the above dep tree was given a weight of 10, and we didn’t dive into cycles, the result would be A: (1/3 * 10) + (1/2 * 3.33 (from A being a dep of C)) = 3.33 + 1.65 = ~5

The result for B would be 3.33 and the result for C would be 1.65. And we would stop when we see the first cycle of A.

Result: A: 5 B: 3.33 C: 1.65

Now let’s see how it looks if we dive into cycles.

On first pass - we’d get the result above, A: 3.33 + 1.65, B: 3.33, C: 1.65. But we had just revisit A, now let’s assume we go into A’s deps again to split the 1.65 that was given to A from C. Then we’d split 1.65 three ways again, so A would get .55 of that, B .55, and C .55.

At this point A would have 3.33 + .55 and not 3.33 + 1.65, already we’re seeing the impact of going through cycles.

Let’s keep playing this out. A: 3.33 + .55 + .28 B: 3.33 + .55 C: 1.65 + .28

Now we reach the root node again, with .28 to distribute A: 3.33 + .55 + .08 + .04 B: 3.33 + .55 + .08 C: 1.65 + .28 + .04

Total at this point: A: 4.05 B: 4 C: 1.95

Just on a quick glance, you can see the difference that diving into cycles made. In our algorithm, we choose an “epsilon” which is the smallest fraction we’ll divide by, that’s how we stop the recursive cycle. Our epsilon is very tiny, something like .00001 and can be modified. We make it as small as runtime’s allow in order to get the most accurate weighting distribution.

If you can find any other weighting algorithms or articles, that would be awesome! I know this one is definitely flawed at various levels, so optimizing would be great.

This needs “a visualization and annotation framework” that actually animates how balances travel down the graph to outline and compare different weight strategies. Without this, the exposure for pretty awesome support mechanics may become pretty limited.

1 Like

Hi Everyone!

First of all, I commend you for taking on a vexing problem like dependencies. Please take my comments here as food for thought, based on the premise that I have considered similar phenomena before, and also recognize software dependencies are a different beast than those phenomena. First, early in my research career I recognized the significance of weighting relationships on all forms of network analysis if the goals include understanding the most significant nodes … in this case other software, often referred to as packages.

This post follows this flow:

  1. The kinds of “things” and their “relationships” need to be weighted if we are to use math to process data, and reach conclusions that are grounded in what actually is happening
  2. To accomplish that, there are particular, specific ways you need to build the computational models to weight things usefully, after you name all the most meaningful things.
  3. WHAT we name thing “things” and the “relationships” betrays our biases, and influences our analysis. What we call something is a reflection of how we think about it, and if we don’t critically think about all these “things” and “relationships”, the likelihood that we build something useful goes WAY down.

(Background: I was a software engineer for 10 years, and have been an academic for 11. As I have learned in therapy with my 14 year daughter [one a set of twins], saying “and” he’s an academic is better than “but” he’s an academic. May you never have 14 year old twins yourself during a pandemic.)

In my analysis of “things like dependencies” I realized early that weighting requires us to identify “the different types of relationships”, and to treat them differently. Dependencies in software, I might argue, include your sotware, the software you depend on, and software that depends on you. Those are three different categories of “artifact”. I also have strong empirical evidence, and a general belief based on studying this sort of thing for a decade in open source software, social media, online health forums, online learning (non of that research is currently being applied by any virtual schools I have seen, btw), another “noun” enters the fray: “The People”. The result is that all connections are between “artifact - person”, “artifact - artifact”, and “artifact - person”.

Dependencies are connection, but unlike straight up code, they come in several categories, so this rather complex image is simple in comparison to the dependency problem:

If you have trouble sleeping, here’s the entire paper: https://www.seangoggins.net/wp-content/plugins/zotpress/lib/request/request.dl.php?api_user_id=655145&dlkey=H9SR6RHZ&content_type=application/pdf

2 Likes

We then built computational models around this idea of enumerating specifically what the things in a particular network “are” in this paper (again, insomniacs, this is a GOLD MINE), and ran a few models using data from my collaborators on the MyLyn Project. Here’s a short out take:

From this paper: http://pubs.seangoggins.net/sites/default/files/USER735.pdf

To my third point, what we name things betrays the limits of our world view. If we do not question how we think about dependencies in open source software in this case, and what we call its parts, we are likely to be missing something BIG. Academics say things like "your ontology (words) constrain our epistemology (what you believe about the thing you are studying; who your influences are, etc.).

1 Like

Platform, tools, and practice also complicate things, because the introduction of dependencies through different platforms for git can follow different paths. Here’s a now oldish and brief summary of how basic relational concepts found expression online, circa 2014: image

Full insomnia cure: http://pubs.seangoggins.net/sites/default/files/GogginsPetakovic.pdf

In short, I applaud what you are doing, and think you are absolutely on the right track. My intention here is to provide a sort of catalog so we can choose what to choose, and what not to choose. Unlike my prior work understanding various types of networks, this one has a lot of money, AND a lot of human passion on the table. I would like to contribute to us building something that respects open source, its contributors, and the technology we use to pull this all together.

1 Like

Thanks to Neil Chue Hong for pointing this thread out to Arfon Smith and me. It is very much like some work I proposed in 2013 and then worked with Arfon on how to implement in 2014 and 2015 - see https://doi.org/10.6084/m9.figshare.791606.v3 and https://doi.org/10.5334/jors.be for the original idea, which I named “transitive credit”, and https://doi.org/10.6084/m9.figshare.1234063.v2 and https://doi.org/10.5334/jors.by for work with Arfon on how to implement it. Note that this was inspired in part by software dependencies, but also by dependencies of research on other research more generally. And in particular, the poster with Arfon (https://doi.org/10.6084/m9.figshare.1234063.v2) has a figure that is quite similar to the figure in this discussion thread

1 Like

Hi Dan/All,

I’m going to read those papers (again) right away! Transitive credit and weighted credit aer, I think possibly distinct perspectives on the discussion, but my experience is the best models are those that integrate several different models!

Best,

Sean P. Goggins Professor, Computer Science University of Missouri ‌URL: http://www.seangoggins.net‌

Founding Board Member: URL: http://chaoss.community (for open source health metrics) [Sloan Foundation] visit: URL: http://www.augurlabs.io (Research Lab) [Sloan Foundation, National Science Foundation] visit: URL: http://mhs.missouri.edu (for mission hydro sci!) [i3 & IES]

"Design is what you do when you don’t [yet] know what you are doing.’’ – George Stiny, Professor of Architecture, Massachusetts Institute of Technology,

“The power of Open Source is the power of the people. The people rule.” — Philippe Kahn

I am looking for the practical outcome from this thread, and as an active proponent of https://portal.paperswithcode.com/ I am obliged to ask. Did you paper work produce any open source demos like a Jupyter Notebook that can immediately demonstrate the ideas to those who are not used to the poetry of scientific language? )

I think your model is a great starting point to begin playing with the concept. There are lots of questions about how do you value the labor of individual contractors, does demand determine the value of that labor (even when redistribution is minimal), what about other things such as social good (so who are the people using this resource), “quality”, maintainability, transferability, etc.

In terms of economic models and resources, the first thing that comes to mind is the work that Stafford Beer did around the Viable System Model. The socialist government of Chile (prior to being over thrown in a fascist coup) had a program called Cybersyn which tried to use Beer’s work to help with economic planning and modeling aspects of their economy. This work is all apart of a field of study called Cybernetics which has its roots pretty firmly here in the UK. A more recent work which covers a lot of the history on this subject would be The Cybernetic Brain by Pickering.

I’ve also seen a fella by the name of Patricio Julian Gerpe who has done a bunch of work in this area with co-operatives and might be worth reaching out to.

@abitrolly what is a practical outcome for you on this? I’m sincerely asking to determine if you are looking for papers, software, metrics. The angle that is happening in the CHAOSS project right now is the development of metrics first. Thinking through what dependencies can mean for different people.

Here are some thoughts, inline, based on the dev and test branches of the Augur software that is part of CHAOSS

Sean

Nolski Mike Nolan
March 5

I think your model is a great starting point to begin playing with the concept. There are lots of questions about how do you value the labor of individual contractors, does demand determine the value of that labor (even when redistribution is minimal), what about other things such as social good (so who are the people using this resource), “quality”, maintainability, transferability, etc.

  1. Value of labor: We use the https://github.com/boyter/scc project to calculate labor value. It uses a Cocomo algorithm. Its not perfectly accurate, but it is consistent, as you likely already know.
  2. Social good: obviously there are many ways to measure that, some harder to gather than others. Dependencies seem like the most easily obtained, as clone data is hard to get, and downloads are not readily available. We are taking ideas here.
  3. “Quality”: CHAOSS has a number of process metrics implemented, and test coverage is on the roadmap. What other suggestions might you have?
  4. Maintainability: what do you think? Code complexity as measured in https://github.com/boyter/scc is one indicator. I think active development on a repository may be another.
  5. Transferability : Can you clarify?

In terms of economic models and resources, the first thing that comes to mind is the work that Stafford Beer did around the Viable System Model. The socialist government of Chile (prior to being over thrown in a fascist coup) had a program called Cybersyn which tried to use Beer’s work to help with economic planning and modeling aspects of their economy. This work is all apart of a field of study called Cybernetics which has its roots pretty firmly here in the UK. A more recent work which covers a lot of the history on this subject would be The Cybernetic Brain by Pickering.

I’ve also seen a fella by the name of Patricio Julian Gerpe who has done a bunch of work in this area with co-operatives and might be worth reaching out to.

This part I am following and do not have any questions or answers for; I think these are harder questions, but the transparent the underlying assumptions and models are, I think the more utility solutions will have.

1 Like

Hi @abitrolly :

The work we have been doing on CHAOSS, which is more recent and less published (We have our first paper clearly connecting the metrics and tooling work of CHAOSS at SoHEAL 2021) applies the Augur software that is part of CHAOSS. In the earlier work I reference, I was less attentive to open sourcing the underlying code, though it all exists in GitHub repositories. I think what could be useful given the imperfection of that arc is to try to illustrate network (dependency) based node weights using ideas from the earlier code (written in R), implemented in Augur or GitHub - chaoss/augur-community-reports: A set of Jupyter Lab Notebooks and Other Implementations of Community Reports in Standard Form (which is written in Python).

Candidly I need a few weeks to start on that, though.

1 Like

The practical outcome are coded demonstrations of weight distribution and assignment algorithms. The brainstorming of what weights could be would be interesting for me after the MVP for weight assignment is already done. Then everybody could independently iterate over their own preferred scheme and share the outcomes, rather than asking about what the outcomes can be. I don’t mind discussing weight ontology a separate thread though, but here the practical result would be testing few algrorithms of distribution of initial donation as a coded function + visualization.

Anatoli, could you help me with visualizations if I showed you our initial breakdown of distributions for various projects? I’m not great with visualizations

Just for other’s googling ease, I believe this is what you were referring to, correct? GitHub - chaoss/wg-value: CHAOSS Value Working Group

I was thinking something akin to bus factor, how easy would it be for someone else to take over maintenance if the current maintainer can no longer do so (perhaps this overlaps with quality).

Yeah it’s definitely the more complex part but I think looking at how large scale planning happens is more likely to have success rather than modelling and automating something we don’t understand. CITATIONS NEEDED but I believe when chile used cybernetic modelling, it was used more as a method of informing the planning department. Where as when the soviet union tried to employ similar technology in GOSPLAN (the soviet planning department), it played a much more central role and lead to lots of gaming of the system.

While I think this is a fantastic project, I also think its important to understand these things more deeply before allowing peoples livelihoods to be dependent on it you know?