Eva Maxfield Brown & Boris Veytsman on OSS Dependencies in the Sciences

Tina_Arboleda · June 7, 2024, 2:04pm

Eva Maxfield Brown & Boris Veytsman dive into their paper “Biomedical Open Source Software”, ‘Nebraska’ packages and broader software sustainability.

Listen at Sustain Episode 236: Eva Maxfield Brown & Boris Veytsman on OSS Dependencies in the Sciences

RichardLitt · June 7, 2024, 2:43pm

This paper was good. Came out of the CZI hackathon, with @andrew.

abitrolly · June 20, 2024, 1:24pm

We found a dense core of popular packages that receive many mentions (e.g. ggplot2 in CRAN, tophat in PyPI and limma in Bioconductor), some of which have many dependencies themselves (e.g. ggplot2).

tophat in PyPI, really? Abandoned tar.gz file from 2012 with no repository and no references. Who was reviewing this paper?

abitrolly · June 20, 2024, 1:39pm

here are several different Python package managers with greatly overlap- ping dependency graphs. We used PyPI for this study. Some competitors like pip and Vonda are also worth investigating.

OMG. PyPI is not a package manager - it is package repository. pip is a package manager that uses PyPI. There is conda, no Vonda.

abitrolly · June 20, 2024, 1:40pm

To bring it back to sustainability track. How much does this work cost?

RichardLitt · June 20, 2024, 3:09pm

Are the issues you’re seeing indicative that the work itself should be redone, or that the conclusions they draw are inaccurate?

What do you mean by how much it cost? How much did the research cost? Quite a bit - it was funded by CZI through an in-person workshop.

abitrolly · June 20, 2024, 7:59pm

The issues I am seeing are indicative that the work is not reviewed, and people don’t know what they write about. The conclusions can not be trusted.

Yes, I want to know exactly. Maybe next time CZI should allocate some fund to do independent review.

RichardLitt · June 20, 2024, 8:13pm

It isn’t reviewed. It’s an arxiv preprint. It seems like you’ve pointed out some small issues - it’s not clear to me that this negates the work as a whole.

Did you listen to the podcast? Both Eva and Boris, and @andrew, know what they’re talking about.

abitrolly · June 21, 2024, 3:28am

I hope this forum exists to solve sustainability problems, and not as a platform to advertise the podcast that fails to address real problems like core-js situation.

To discuss sustainability it is important to have real data about how much money is available for open source folks and how it is distributed. If it is only for people doing research, only for people with specific connections - fine, but it how much do they actually get, how many people are backed up, and who is left behind?

Saying that arXiv is for unreviewed low quality papers is the same as saying that GitHub contains a lot of bad code. It surely does, but we are not gathered to promote the bad. Have you read the paper yourself? What did it find in your opinion?

RichardLitt · June 21, 2024, 1:12pm

I also hope it exists to solve sustainability problems. I was asking because I was the host for that podcast – I had a whole discussion with Boris and Eva about it. Yes, I’ve read it.

The Chan Zuckerberg Initiative isn’t set up like the Sovereign Tech Fund - it’s not a slush fund for open source maintainers. Their mandate is to look into how to support the ecosystem as a whole, particularly for open science for healthcare. This is my take on it, in any event. I was excited that they made a workshop looking into dependency issues. This paper was one of the things that came out of that workshop.

Personally, I don’t particularly like that this work has to be privately funded, because our governments haven’t figured out how to fund open source, and because private allocation of wealth leads to the super rich choosing where to allocate funds. On the other hand, I’m overjoyed that some of these people have started setting up philanthropic programs looking into issues that affect the public, like open source sustainability. I don’t think it’s productive to say that their money isn’t allocated correctly when it comes to whether it goes to maintainers or to research - the source of the money itself is already an allocation problem. Maybe that’s just me, though. If you have tangible suggestions for how to better allocate funds coming out of large philanthropies, while also only having the power of a member of the public, I’d be interested in hearing them.

I think the paper is interesting, because I think it is one of the first looks at a database of open source code at scale. I learned about centrality metrics, particularly Katz, from this paper. I agree with the authors that it is only a beginning. “It would be interesting analyze common workflows for different disciplines, perhaps using co-occurrences of mentions, and map them into the dependence graph. This might help to discover packages important for specific sub-fields of biomedical sciences. Adding temporal dependencies to our graph my help to discover and predict the development trends.” they write, in their conclusion. I think this would be interesting. This won’t solve all of the problems in open source - but it would help inform how we think about the ecosystem as a whole.

I didn’t say that arXiv is for unreviewed low quality papers. I said that it was added into arXiv, and that it hasn’t been published elsewhere yet - it is unreviewed. I don’t think your points about vonda, pip, and PyPi invalidate their findings, and I think the authors knew what you mentioned - it looks like a scribal error, to me. Perhaps @andrew or one of the authors could weigh in on that.

abitrolly · March 3, 2025, 6:34am

Yes, wealth allocation after the taxes is the problem. And when the government pays for research, scientists are not allowed spend these funds on software maintenance. When working on COVID folding, folks at https://foldingathome.org/ paid sole software developer guy from their pockets. Even after the big names like NVidia jumped in, there is no evidence that whales gave money. The software is not completely open source, but there are open source parts of it that were needlessly outdated for 2020. You can dig the story to make a reference if you want.

There is a recent book specifically about Capital Allocation https://allobook.gitcoin.co/ but it is too highlevel for me to digest. Maybe I should use NotebookML to convert it into ELI5 podcast…

For metrics, there is a recent talk by Homerew creator about OSS that mentions CHAI - DB of dependencies and ranking them - https://www.youtube.com/watch?v=JmECGDrbTxU

Dependencies graph is a necessary component for “mk shredder” - machine that you throw money at, and it shreds them converting into mana that could be used to power employments.

Giving everybody a sub-divisible token to help build own value (and “i know this guy”) graph also helps to make dependency story personal. But what science folks excel at - they can actually create a methodology/gameplay to organize deps much like they organize paper references.

Just throwing in my cents.

Topic		Replies	Views
Sustain Together Summary January 26th 📣 Sustainer Talk sustain	9	487	February 9, 2024
What is the right amount to give away? 📣 Sustainer Talk podcast	19	1100	April 12, 2023
SustainOSS Media as a Product 📣 Sustainer Talk	3	404	January 24, 2022
Great reads - December 2018 Edition 📣 Sustainer Talk	22	1121	November 4, 2019
Dependency Mapping Working Group 👥 Working Groups	83	2436	June 22, 2021

Eva Maxfield Brown & Boris Veytsman on OSS Dependencies in the Sciences

Related topics