Eva Maxfield Brown & Boris Veytsman on OSS Dependencies in the Sciences

Eva Maxfield Brown & Boris Veytsman dive into their paper “Biomedical Open Source Software”, ‘Nebraska’ packages and broader software sustainability.

Listen at :studio_microphone: Sustain Episode 236: Eva Maxfield Brown & Boris Veytsman on OSS Dependencies in the Sciences


This paper was good. Came out of the CZI hackathon, with @andrew.

1 Like

We found a dense core of popular packages that receive many mentions (e.g. ggplot2 in CRAN, tophat in PyPI and limma in Bioconductor), some of which have many dependencies themselves (e.g. ggplot2).

tophat in PyPI, really? Abandoned tar.gz file from 2012 with no repository and no references. Who was reviewing this paper?

here are several different Python package managers with greatly overlap- ping dependency graphs. We used PyPI for this study. Some competitors like pip and Vonda are also worth investigating.

OMG. PyPI is not a package manager - it is package repository. pip is a package manager that uses PyPI. There is conda, no Vonda.


To bring it back to sustainability track. How much does this work cost?

Are the issues you’re seeing indicative that the work itself should be redone, or that the conclusions they draw are inaccurate?

What do you mean by how much it cost? How much did the research cost? Quite a bit - it was funded by CZI through an in-person workshop.

The issues I am seeing are indicative that the work is not reviewed, and people don’t know what they write about. The conclusions can not be trusted.

Yes, I want to know exactly. Maybe next time CZI should allocate some fund to do independent review.

It isn’t reviewed. It’s an arxiv preprint. It seems like you’ve pointed out some small issues - it’s not clear to me that this negates the work as a whole.

Did you listen to the podcast? Both Eva and Boris, and @andrew, know what they’re talking about.

I hope this forum exists to solve sustainability problems, and not as a platform to advertise the podcast that fails to address real problems like core-js situation.

To discuss sustainability it is important to have real data about how much money is available for open source folks and how it is distributed. If it is only for people doing research, only for people with specific connections - fine, but it how much do they actually get, how many people are backed up, and who is left behind?

Saying that arXiv is for unreviewed low quality papers is the same as saying that GitHub contains a lot of bad code. It surely does, but we are not gathered to promote the bad. Have you read the paper yourself? What did it find in your opinion?

I also hope it exists to solve sustainability problems. I was asking because I was the host for that podcast – I had a whole discussion with Boris and Eva about it. Yes, I’ve read it.

The Chan Zuckerberg Initiative isn’t set up like the Sovereign Tech Fund - it’s not a slush fund for open source maintainers. Their mandate is to look into how to support the ecosystem as a whole, particularly for open science for healthcare. This is my take on it, in any event. I was excited that they made a workshop looking into dependency issues. This paper was one of the things that came out of that workshop.

Personally, I don’t particularly like that this work has to be privately funded, because our governments haven’t figured out how to fund open source, and because private allocation of wealth leads to the super rich choosing where to allocate funds. On the other hand, I’m overjoyed that some of these people have started setting up philanthropic programs looking into issues that affect the public, like open source sustainability. I don’t think it’s productive to say that their money isn’t allocated correctly when it comes to whether it goes to maintainers or to research - the source of the money itself is already an allocation problem. Maybe that’s just me, though. If you have tangible suggestions for how to better allocate funds coming out of large philanthropies, while also only having the power of a member of the public, I’d be interested in hearing them.

I think the paper is interesting, because I think it is one of the first looks at a database of open source code at scale. I learned about centrality metrics, particularly Katz, from this paper. I agree with the authors that it is only a beginning. “It would be interesting analyze common workflows for different disciplines, perhaps using co-occurrences of mentions, and map them into the dependence graph. This might help to discover packages important for specific sub-fields of biomedical sciences. Adding temporal dependencies to our graph my help to discover and predict the development trends.” they write, in their conclusion. I think this would be interesting. This won’t solve all of the problems in open source - but it would help inform how we think about the ecosystem as a whole.

I didn’t say that arXiv is for unreviewed low quality papers. I said that it was added into arXiv, and that it hasn’t been published elsewhere yet - it is unreviewed. I don’t think your points about vonda, pip, and PyPi invalidate their findings, and I think the authors knew what you mentioned - it looks like a scribal error, to me. Perhaps @andrew or one of the authors could weigh in on that.