I had a discussion with someone at the Linux Foundation Member Summit after Richard Fontana from Red Hat's talk and they didn't buy any arguments that, perhaps, training data not being open didn't invalidate the model as a whole being considered open.
But I agree with the pragmatism angle. I'm something of an absolutist with respect to the open source definition with respect to code. But there are clearly limitations in the degree to which you can open training data in many cases because of privacy and other concerns. This 2023 article I wrote for Red Hat Research Quarterly goes into some of the ways even supposedly anonymized data can be less anonymized than you may think.
Thus, while open data training sets are certainly a good goal, an absolutist position that open models (model weights and code) don't have significant value in the absence of open training data isn't a helpful one given that we know certain types of data, such as healthcare records, are going to be challenging to open up.
It strikes me that more measured approaches to AI openness that embody principles from the open source definition, such as not restricting how an open AI model can be used, are more practical and therefore more useful than insisting it's all or nothing.
CTO Chris Wright has recently shared Red Hat's perspective. I encourage you to read the whole piece which goes into more detail than I will here. But a couple salient excerpts.
The majority of improvements and enhancements to AI models now taking place in the community do not involve access to or manipulation of the original training data. Rather, they are the result of modifications to model weights or a process of fine tuning which can also serve to adjust model performance. Freedom to make those model improvements requires that the weights be released with all the permissions users receive under open source licenses.
This strikes me as an important point. While there are underlying principals as they relate to openness and open source, the actual outcomes usually matter more than philosophical rightness or open source as marketing.
While model weights are not software, in some respects they serve a similar function to code. It’s easy to draw the comparison that data is, or is analogous to, the source code of the model. In open source, the source code is commonly defined as the “preferred form” for making modifications to the software. Training data alone does not fit this role, given its typically vast size and the complicated pre-training process that results in a tenuous and indirect connection any one item of training data has to the trained weights and the resulting behavior of the model.
Model weights have a much closer analog to open source software than the training data does. That's of course not to say that training data shouldn't be opened up where practical. But for the many cases where it can't be, the perfect shouldn't be made the enemy of the good.
To be fair, we could make similar arguments about some of the newer "business open source" licenses that skirt the open source definition, but it's about drawing lines. After all, many large commercial open source products don't also release all their own build system code, test suites, and other information that would be useful for someone to reliably clone the delivered binary. Nonetheless, very few people object to calling the final product open source so long as its source code is under an approved open source license.
Steven Vaughn-Nichols also tackles this topic over at ZDNET in Red Hat's take on open-source AI: Pragmatism over utopian dreams.
No comments:
Post a Comment