Image: DALL-E

Is Open Source AI a sham?

March 11, 2024

In the world of tech, regulations and definitions often lag behind emerging realities. The advent of content moderation on social media platforms necessitated a reliance on the decades-old Section 230 policy. Similarly, the General Data Protection Regulation (GDPR) modernized the Data Protection Directive of 1995, when internet cookies were not yet a significant concern in the infant web. Today, in the age of generative AI, the concept of 'open-source AI' presents new challenges. The traditional definition of open source, tailored for software from the 1990s and 2000s, no longer suffices for today's AI systems. Generative AI encompasses more than just software code: it involves extensive resources in terms of computing power, data, and human effort to develop and deploy larger AI models. The Open Source Initiative (OSI), which sets the standards for open-source definitions, acknowledged last summer the urgent need for a unified understanding of what 'open' means in the context of AI systems. Since then, they have initiated a vibrant debate on the definition of open source AI and have begun drafting a preliminary version. By the year's end, they aim to publish the first official version of this definition. Let's dive into these struggles.

Why is this important?

While the open-source community and commercial enterprises were once sworn enemies, today open source stands on the shoulders of tech giants. This shift has led to skepticism among some observers, as the advocacy for open source by big tech companies could be seen as a form of openwashing —a strategy to appear committed to open-source principles without fully embracing them. Historically, open source has been praised for its ability to stimulate competition and innovation, uphold community values, and reduce production costs. Much of the (pieces of) software on our laptops is the product of open-source efforts, and it's widely acknowledged that the foundation of the internet is largely built on open source, albeit today with proprietary software layered on top. The critical issue now is whether leading AI firms genuinely support open-source principles and aims, or if they leverage this narrative for their own benefit. While a simplistic answer might suggest it's a mix of both, a deeper exploration of this topic is warranted.

What are the characteristics of openwashing?

Let's begin by highlighting the main concerns raised by skeptics. Firstly, adopting open-source strategies may be a tactical move to attract the open-source community and create a dependency among users and developers. Given the scarcity of talent, drawing them into a specific developers ecosystem is crucial. Secondly, gaining control over platforms and setting standards can be economically advantageous. Large AI corporations might foster innovation by releasing smaller, open-source models, only to later incorporate the most successful projects or refined applications into their proprietary, high-cost flagship models. Thirdly, cultivating an image of openness and branding AI models as open source could potentially lead to regulatory exemptions, presenting an appealing facade. There's intense lobbying by technology companies in the debates around the European AI Act, suggesting that open models could be exempted on the basis that they contribute to a more democratic, innovative, and competitive AI landscape in Europe.

Who is claiming to be open in AI?

The revelation that ‘OpenAI’, the creator of ChatGPT, is not an open AI company, may not come as a shock anymore. What began as a not-for-profit organization committed to open-source ideals has since transitioned into a for-profit enterprise backed by Microsoft, and the model behind (GPT-4) its flagship product, ChatGPT, is far from being open source. However, contrary to OpenAI closed business model, Meta has explicitly declared their Large Language Model (LLMs) ‘Llama 2’ as open source, offering access for public download and limited commercial use. Google has on their turn released the ‘Gemma’ model as their smaller open-source alternative to the closed ‘Gemini’ flagship model. Aware of the current discussion, they are careful with calling it open but do not restrain from it. In the EU, the European competitor Mistral has launched the open-source model 'Mixtral'. Initially, the company branded itself very explicitly as an open company. However, it appears they are reevaluating this stance. Regarding this issue, it's important to recall that during the 2010s, all major machine learning innovators such as DeepMind, OpenAI, and Anthropic were either ‘assimilated into’ or ‘became partners’ (I leave that choice to the reader) with big tech companies. In this case, Google, Microsoft, and Amazon.

Then how do we evaluate their open-source AI models?

Let’s see how OSI defined (traditional) open source. It uses ten criteria which boil down to things such as free redistribution, transparency, integrity of source code, reusability of derived works, and no accompanied discrimination of users and groups or restrictions. Accordingly, the OSI swiftly communicated its stance following Meta's launch of LLaMa 2 last year: the model does not qualify as open source. While OSI acknowledges Meta's effort in reducing barriers to access, the model falls short of meeting the stringent criteria necessary for the “open source” title. Meta's licensing for LLaMa imposes usage restrictions for commercial entities with significant user bases (exceeding 700 million monthly users) and precludes certain applications to prevent an advantage to their large tech rivals. OSI's concerns also extend to the lack of transparency, particularly since Meta does not furnish the datasets or disclose detailed information about the development. Licensees receive the code but remain in the dark about the intricate training processes behind the model. Consequently, the OSI is now in the process of drafting a new version.

In the meantime, researchers and experts have suggested different approaches to evaluate Open Source AI. A study of last year (Solaiman, 2023), proposed a gradient for assessing openness in generative AI, rather than a binary open-or-closed classification. And indeed, according to this spectrum, LLaMa 2 ranks toward the more open end because the model is available for download. However, it does not achieve complete openness due to certain limitations. Accordingly, we could call most of the above listed models open access instead of open source. This gradient approach is one of the options to evaluate Open Source AI and, as we can see, it assesses current models quite positively. Others have adopted a more critical viewpoint.

Source: The Gradient of Generative AI Release, Solaiman, 2023

Towards a broader approach of Open Source AI

By embracing new criteria like granting deeper insights into the workings of the model (beyond just the source code) and removing usage restrictions, leading AI companies could improve their standing in terms of openness. For instance, they could earn their ‘open source’ credentials by offering smaller models that might meet all the renewed open-source AI criteria. But does this automatically mean they are fostering the open-source community? To properly assess the openness of modern AI models, we must consider the entire AI system ecosystem. A more holistic approach is therefore to view generative AI models not just as a mathematical model or piece of software, but as a very complex ecosystem. An open-source ecosystem transcends merely licensing open software.

In an informative study, Widder, Whittaker, and West (2023) summarize what this perspective to open source implies for the openness of generative AI systems. They highlight an important point: regardless of the AI model's openness, its deployment invariably ties users and developers to the mainly closed ecosystems of big tech corporations, reinforcing their dominance and benefiting from the users' dependency. In today’s AI world, ‘openness’ and concentration of power could easily imply each other. The crux of the issue lies in the control and accessibility of critical resources for the training and deployment of AI models, categorized by the researchers into computational power and resources, software development tools, data practices, and labor.

Access to computational resources is perhaps the biggest problem that undermines the democratization or openness of generative AI. Securing computational resources stands as a crucial aspect of AI development, which unfolds in two stages. The initial phase involves a significant financial outlay for developing and training a generative AI model. Beyond this initial investment, the demand for computational power persists in the second phase, as running a foundation model for its user base is very costly as well. Moreover, even if you would have deep pockets, there's currently a notable shortage of specialized hardware, such as Nvidia chips. This is leading to intense competition among major tech firms to procure this computing hardware to train and run AI models. Access to such computational capacity is largely controlled by a few dominant industry players. Even if you have succeeded in developing or fine-tuning an AI model independent of leading big tech AI firms, you will likely end up at their cloud services. Recognizing this bottleneck, there is a growing emphasis on creating and managing LLMs that are more efficient in terms of computational usage or make use of edge computing, yet still maintain high (enough) performance, which we will discuss below.

Next, the authors discuss the software development tools. Just as with conventional software, the construction of AI models relies on open-source tools and frameworks. TensorFlow and PyTorch are the open-source industry standards in machine learning. However, TensorFlow is a platform initially developed internally by Google, and PyTorch, while now functioning under the research foundation Linux, continues receiving financial support from its founder Meta. Such affiliations provide these tech behemoths with strategic benefits, allowing them to direct AI development towards norms and standards that seamlessly integrate into their comprehensive suite of AI products, much like fitting together modular building blocks.

When it comes to data, similar dependencies and asymmetries pop up according to the researchers. Despite the existence of openly available datasets like Pile and Common Crawl, transforming them into a usable database for AI deployment requires extensive labor. Foundational models are trained on vast, meticulously curated datasets, demanding a laborious data governance process to filter out irrelevant information such as noise, unwanted content and computer code. Additionally, the intensive labor of data labeling, which adds context and meaning to training and evaluation datasets, is primarily conducted on for-profit platforms like Amazon Mechanical Turk and Sama. In the generative AI world, there are no off-the-shelf AI datasets.

Moreover, the researchers argue labor in AI extends beyond data curation and labeling. It also includes the nuanced work of model calibration through reinforcement learning, which involves human feedback to guide the model towards outputs that align with human preferences. Finally, post-deployment involves content moderation to ensure trust and safety, done by cheap outsourced labor, and maintaining and further developing the large foundational models, which requires highly talented and skilled engineers.

This comprehensive view uncovers the layers of complexity behind the notion of open-source AI models, revealing how the intricate web of resources, labor, and corporate interests challenges the genuine spirit of openness in the AI domain. If you want to make the current wave of generative AI truly open, you need a full-stack approach, democratizing all layers and also proactively supporting open alternatives in their capacity to develop and run AI models. Yet, the current generative AI ‘race to the bottom’ is markedly divergent, leading to a concentration of computing power, resources, talent, and critical software within a few dominant tech giants. Because all the breakthrough of the past years essentially boiled down to scale and resources, the democratization of open AI seems a farce.

Conversely, what should our realistic expectations be? Are we perhaps clinging to an overly idealistic notion of openness? This leads us to a complex discussion, as the interpretation of 'openness' in open source varies widely—from a strict economic perspective to a more expansive view that includes social and democratic ideals such as data sovereignty, data solidarity, autonomy, and social equality, values we usually cherish when we say ‘open society’. Particularly concerning the wider interpretation, the generative AI boom and its potential to further centralize power within big tech represent a considerable threat.

Bigger isn’t always better

The above issue is especially acute with the colossal generative AI models that can have as many as one trillion parameters. Therefore, a promising direction is the creation of smaller models, let’s say smaller than ten billion parameters (this is an arbitrary number of course, but an important criterion could be that it can run locally on CPUs instead of specialized GPUs).

In the past years, the major breakthroughs in large language models have come from bigger size and more data, but many argue now that the practical future of generative AI might be smaller. Bigger isn’t always better. These models are somewhat easier to understand (very relative of course), and more important, they are cheaper to deploy and potentially safer if they run on local servers or devices instead of the cloud. They may not be groundbreaking but suffice for modest tasks. This is crucial, because using colossal models like ChatGPT 4 for simple prompts is overkill: everytime we ask ChatGPT 4 a simple task, such as a summary or a different title for an article, we use a sledgehammer to crack a nut. AI is intended to boost efficiency, but we now burden the system with inefficiencies, using big models for small tasks. This argument highlights also critical aspects of the googling versus gen AI debate. If Google would move all Google searches to their power-demanding premier model, Gemini, this would cause an enormous (and unnecessary) increase in energy use.

Overall, the advent of smaller models might reduce reliance on Big Tech for development and deployment and are more energy efficient. Smaller, miniaturized or fine-tuned models could be a moderate answer to some open-source issues posed by large, energy-intensive, and expensive models. But to all of them?

A minor philosophical reflection on openwashing and closedwashing

As a layman, I cannot fully judge the potential of smaller models with regard to Open Source AI. It sounds promising, but it is important to remain skeptical as well. The strategy of combining proprietary large models with freely available smaller open-source ones could be a powerful move for big tech in the coming years. To put it another way, a smart approach for dominating AI development, as nobody in the industry would object to, always involves a significant degree of openness. Today, there seems to be a very beneficial synergy between being open and closed, tapping into the strengths of each.

On one side, leading AI firms can validate their proprietary models by playing into the prevalent fears about AI risks. They position themselves as guardians, suggesting that keeping AI under wraps is a wise and responsible act. The argument that unrestricted access could lead to misuse by malicious actors holds some truth. Yet, this safety-centric narrative can sometimes lead to an overly simplistic or trivialized concept of what constitutes responsible or safe AI.

On the other side, these companies may release smaller open models to the public, claiming to encourage 'innovation and competition.' Yet, similar to how responsibility is often narrowed down to mere safeguards, this approach frequently interprets the benefits of openness strictly through an economic lens. This involves focusing on the development of happy consumers and happy businesses that gain from cost-effective, efficient AI services to maximize their own utility. This perspective is overly simplistic as well. The economic, cultural and political implications of Open Source AI extend far beyond just facilitating happy consumers.

Consequently, this may result in a cunning approach that combines aspects of closedwashing ('look at us, we are so safe and responsible and would never let this AI fall into the wrong hands') and openwashing ('look at us, together with ‘the community,’ we have contributed to all these innovative, affordable, and cutting-edge AI tools for everybody'). Regrettably, I view this as a likely direction for many leading AI companies in the coming years.

To wrap it up, the trend of Open Source AI needs to be examined both from internal and external perspectives. Internally, the effect of open-source AI models on the ethos and goals of the open-source community is ultimately for its members to evaluate. Externally, the close link between openness principles and Big Tech's AI strategies deserves a broader audience's attention. This issue extends beyond just safety and the economics of open source; it begins with a reflection on AI's essential role in future society—a conversation we all play a vital part in.