Open Development of Scientific Software
Revolutionizing research with collaborative software infrastructure
The Revolution of Open Source
Have you ever stopped to think about the fact that the technology we use every day is primarily built on the work of volunteers? The open-source movement has revolutionized the tech industry by allowing individuals and companies to freely access, modify, and distribute source code. But it’s not just about saving a few bucks on software licenses. The open-source movement has fostered a sense of community and collaboration, leading to some of the industry’s most widely used projects.
It all started back in the early days of computing, when software was shared freely among academics and researchers. Fast forward to the late 1980s and early 1990s, when the term ‘open source’ was coined and the movement began to gain traction. Prominent characters from that time include Richard Stallman, founder of the Free Software Foundation and the GNU Project (that includes the popular compiler GCC). He argued for a rather strict definition of the movement, that let anyone use open source code and sell it but that developers would need to make their software open source as well (authoring the, usually called “restrictive” GPL license).
It wasn’t until the late 1990s and early 2000s that open-source really started to shake things up. The release of the Linux operating system and the Apache web server proved that open source could be not only viable, but also highly successful. Major companies, such as IBM and Red Hat, saw the potential and began investing in and supporting open-source projects.
Today, open source is everywhere. It’s hard to find a piece of technology that hasn’t been influenced by open source in some way. From mobile operating systems and cloud infrastructure to machine learning and data analysis, open source has become an integral part of the tech industry.
Adoption in Life Science Industries
In this article, I want to give a perspective on open-source development with a particular emphasis on the perspective of a software & service provider in the life science industry. While there are many articles about the general adoption of open-source software (and books to evaluate projects for use) and relatively healthy market projections for the open-source service market (e.g. this report), there isn’t as much content about the actual contributions to other verticals than software (e.g. biotech or manufacturing).
In my experience, compared to the “tech industry” (that somehow started to mean only electronics & software technology), the adoption of open source principles in other industries is still lacking behind. While getting funding in tech is also difficult and talked about (good article by James Turner), there are large foundations such as the Apache Software Foundation or the Linux Foundation. The number of people using e.g. a web server utility might make sustainable funding by donations (e.g. through GitHub sponsors or Open Collective) at least conceivable.
The situation is rather dire in more niche areas, such as open-source software in scientific work. Even though it’s clear that the most recent breakthroughs wouldn’t be possible without it (see this analysis by the Chan Zuckerberg Initiative), most public funding organizations don’t acknowledge its importance.
They’ll fund 50 different groups to make 50 different algorithms, but they won’t pay for one software engineer. — Anne Carpenter
Only recently, the voices have become louder about the funding problem of software infrastructure for scientific endeavors. As described, for example, by Adam Siepel or Anna Nowogrodzki, researchers usually have to program new tools as part of their research without getting any recognition nor training, and as the maintenance work of an open source project increases — even more so if it becomes successful — the scientists might have no other choice than to drop their efforts entirely as only “pure scientific work” is being recognized and funded.
Luckily, at least in the scientific domain, this is slowly changing. Large private US foundations are creating specific grants, such as the “Essential Open Source Software for Science” grant by the Chan Zuckerberg Initiative. In addition, the National Science Foundation (NSF) created a fitting program called “Pathways to Enable Open-Source Ecosystems (POSE)” (first proposals 2022).
In contrast, in the commercial biotech sector, there is very little activity in funding and contributions to joint open-source efforts. This has its root in two main factors: Missing software development focus as well as missing sharing culture.
- Software development focus: Traditionally, software and IT infrastructure efforts have been seen as a pure cost center rather than expertise that built a competitive edge. New companies in the space are slowly changing this dynamic with a fresh breed of “techbio” startups, defining software development as an explicit core business activity.
- Missing sharing culture: Intellectual property protection is nowhere as strong as in the biotech sector, especially drug development. Compared to patents in software being relatively worthless, their value for pharmaceutical companies is enormous, and their creation could be seen as their “raison d’être”. There are pre-competitive efforts such as Pistoia Alliance. Still, I would argue that much more is needed to foster the sense that sharing efforts on tools & infrastructure benefits all.
The “vibe” in the field is slowly changing as younger companies such as Colossal build software spin-outs such as Form Bio and members of big pharma such as Roche advocate their use of open source tools, for example Arvados or Camunda.
Strategies for Success
A project is usually started by one or a small number of individual contributors. Examples in the biotech industry include MultiQC by Phil Ewels, PyLabRobot by Stefan Golas, or Poly by Timothy Stiles. The initial push for a new project usually comes from a need arising in different contexts:
- Context of scientific work (indirectly funded by a research grant)
- Context of a project for a client (possibly directly funded)
- Context of product development (usually funded by employer)
The initial work is usually inspiring, and not much thought has to be put into the long-term sustainability of the effort. Unfortunately, after a while, the “cute puppy” phase, as Jacob Thornton beautifully described in his talk, comes to an end, and the project needs to be properly maintained — even more so if it’s successful and an increasing number of users adopt it. You will want to be paid for your efforts at some point, especially if you need to hire more hands to meet demand.
After some years starting and maintaining SiLA 2 (an open connectivity standard & tooling for scientific instruments & software) as well as observing the “open source in life sciences” space, I can nothing but agree with Aaron Stannard that you can’t rely on donations for a sustainable continuation of your project, but need to have a commercial offering around it — either as an independent consultant or as a company. In Aaron’s article, he lays out the different funding models that I am reciting here:
- Services: Training, consulting, and support sales to users of your software. The Hyve is a cool example of a consulting company built around open-source tools for biology.
- Open core: A free & open core offering with proprietary “enterprise” features for paying customers. Popular features are, for example, access control and audit readiness. GitLab, a software management (git) platform, is a great example.
- Licensing: Setting a free license for non-commercial open-source use and a proprietary license for commercial use of the software. A popular example is the graphical user interface tool Qt.
- Managed Services: Building a managed service using open-source software that can be sold as a Platform or Software as a Service (PaaS or SaaS). This model often gets combined with certain proprietary features to make management easier. A good example in the bioinformatics space is Netflow Tower.
- Reputation: Instead of selling something directly, a company or individual can use the prestige of contributing to a community project to attract customers or talent to your company. This is currently the operating model for the company I work for: Wega, where the SiLA project has resulted in new customers and new hires (more on this below).
Many developers (including myself) would think that if a piece of code becomes of critical importance to a company’s success, it will analyze its dependency and make sure that it is sustainability maintained. Unfortunately, it has been proven countless times that this notion couldn’t be further from the truth¹. The security toolkit OpenSSL is a case in point: Only after the publication of the famous “Heartbleed” vulnerability (which has been estimated to cost USD 500 million) have the donations increased from USD 2000 a year to USD 9000, which is obviously nowhere enough to even feed one developer, let alone the resources such a project would need (Source). As always, XCKD illustrates the situation perfectly:
More recently, a vulnerability in the popular Java logging library log4j created a worldwide uproar as it affects nearly every system and had one maintainer with three Github sponsors. In the aftermath, Filippo Valsorda wrote about the need to directly pay maintainers for their work in a contractual fashion (Article) — and I agree. Still, I fear many procurement departments will not understand this case for a while.
There are counter-examples where communities around projects were created that became so deeply entrenched in the value chain of the companies using them that they received sustainable funding, but it’s usually a long and lonely road. Apart from the famous Linux and Apache projects, Jupyter seems to be doing reasonably well, with significant donations already from Microsoft to its predecessor IPython and later on from public organizations. Another interesting example is the Robot Operating System (ROS) which started as a research project, then became part of a private company, and is now part of the Open Source Robotics Foundation, which receives significant funding from the likes of Amazon, Bosch, or Nvidia.
Open Source and Standardization
Many industries profited from standardizing formats to enable the same solution to be applied in different contexts, often cited examples include shipping containers, USB, or in the laboratory, the SBS microplate specification. These stories motivate the creation of new standards in unexplored areas, often leading to several competing definitions at the start. This situation is commonly mocked with another XKCD comic:
When thinking about what a standard is, I came to appreciate that what we commonly refer to as a standard is some versioned documentation approved by an “independent” committee such as ISO. In contrast, I would argue that, in reality, a standard is just a set of definitions most commonly used in a particular context/use case. Examples include the Amazon S3 API used by competing services, Docker image descriptions being compatible with other container services (e.g. Singularity), or the before-mentioned ROS and its messaging interfaces.
The “winning standards” are simply the ones being most often used — and although it would be nicer for a user to always have one definition to rule them all, in reality, at least a handful is needed to ensure healthy competition that incentivizes continuous improvement.
Given this insight, any piece of software could be seen as a standard if sufficient adoption is reached. To ensure the continued accessibility and maintenance of these standard pieces of infrastructure, an independent not-for-profit organization is usually created and sustained by membership fees and donations. Examples in the life science industry include Open Microscopy Environment, SiLA, or LADS.
Many standardization efforts fail for several reasons. Some of them can be easily alleviated with an open culture and open-source software as the backbone. Some rules in this regard:
- Membership only for decision-making, not access: Some foundations focus too much on creating incentives for membership by only giving (paying) members access to the definitions and source files. This hinders adoption drastically.
- Implementation code from the beginning: Even before a first definition is released, there need to be (at least) open access (better open source) applications that use the definitions in industry-relevant scenarios to battle-test the concepts.
- Community building: It needs to be easy for interested parties to see and join the latest discussions — easy as in a few clicks, no intro call, and no sign-up forms! Platforms such as in-person events and online forums need to be created for regular exchange between members.
These kinds of rules should be established as early as possible, as profit-seeking companies (without open source principles) will inevitably seek to gain an advantage from an upcoming standard by excluding newcomers.
Benefits of Open Development
I talked about open-source as if it was a given that it’s advantageous — and I assumed that the impact of past projects speaks for itself. Still, of course, closed development has advantages, such as tighter control² and easier monetization. So what are the concrete advantages of opening up part of your intellectual property?
On the statistical front, you will find many articles e.g. from top consulting firms, such as this one from McKinsey, which show increased innovation in companies that adopt open source. Concretely, I think the main advantages are:
- Community contributions: Users of your software will come up with a variety of features and extensions a single company can’t compete with. A great example is the difference between contributions to the open-source Stable Diffusion versus the closed alternative DALL-E — as shown by Yannic Kilcher.
- Better feedback: When users can analyze the code themselves and tinker with its inner workings, you profit from deeper and quicker feedback on the quality — this can lead to dramatic improvements, especially on the security aspects. An alternative to open-sourcing key components can also be to at least have open access (such as the recent phenomenon ChatGPT, see also the paper on user-centered innovation by Eric von Hippel).
- Reputation: Already mentioned as a funding mechanism, contributing to open source projects can help maintain or create the reputation of being a thought leader in a field — for Wega, its SiLA contributions reinforced their image as an expert for lab digitalization, for example. It also makes for a much more attractive employer for prospective engineers.
Many business leaders fear making anything open source because they think it just means giving away work for free. This thinking ignores the most valuable asset you retain — expertise. Of course, this asset can also be stolen (i.e. “talent poaching”), but when it comes to talent, you have to live with this risk anyway.
The most crucial aspect, though, is summarized well summarized in this article: “When we share our resources, our work, and our expertise in open source, everyone benefits. But the companies which make the best of it are the ones which actively participate in open-source projects.”
It is best if you embark on an open-source journey without direct profit in mind, but it is good to be clear that it will also help your company’s bottom line in the long run. I encourage you to think about where open-source initiatives could make sense in your business :-)
Acknowledgments
The writing of this article was supported by my employer, wega Informatik AG, and inspired by inputs and discussions in the Bits In Bio community.
Footnotes
[1] The dynamic reminds me of the tragedy of the commons
[2] The somewhat loose development model of open source is well characterized in this seminal article “The Cathedral and the Bazaar”