Public big data infrastructure

Tags: #<Tag:0x00007f7cc554f580>

Distillation of Who Owns Big Data? (2014) by Michael Nielsen:

  • big data is useful
  • requires big infrastructure, different algorithms than small data
  • “open data” is publication of small data; complements but not the same thing as open big data (which includes infrastructure; Nielsen uses the term public data infrastructure)
  • only a handful of closed entities (eg NSA, Google) have access to really big data infrastructure, used for their own purposes
  • public data infrastructure would be good both so above entities are not so hugely advantaged and so that benefits can be obtained from publicly interested, collaborative really big data research and applications
  • prefers public data infrastructure be provided by non-profits, arguing that for-profit provision leads to co-option and lock-in, and government provision is bad at experimenting and failing
  • Wikimedia and OpenStreetMap are hopeful examples of non-profit infrastructure, but they aren’t big data: their datasets can easily be processed on a single machine; Google datasets are a million times larger
  • both non-profits and non-profit funders need to encourage more risk and tolerance of failure in order for public data infrastructure efforts to have a chance

Complement with the (inaccurately titled) Socialize the Data Centers! (2015) interview with Evgeny Morozov:

You think the fundamental choice is between two different kinds of ‘big data’ world—one run by private companies such as Google and Facebook, the other by something like the state?

I’m not saying that the system should be run by the state. But you would have at least to pass some sort of legislation to change the status of data, and you would need the state to enforce it. Certainly, the less the state is involved otherwise, the better. I’m not saying that there should be a Stasi-like operation soaking up everyone’s data. The radical left notion of the commons probably has something to contribute here. There are ways you can spell out a structure for this data storage, data ownership, data sharing, that will not just default to a centrally planned and run repository. When it’s owned by citizens, it doesn’t necessarily have to be run by the state.

It seems to me that “ownership” of data is not a helpful frame. Intellectual property regimes (both copyright and sui generis) only increase co-option, lock-in, and commodification. That kind of ownership should be abolished. Privacy, data collection, and data sharing policies and practices seem worth discussing without being clouded by the concept of data ownership. As does control of and practices concerning running big data infrastructure, up to ownership of the relevant tangible property such as data centers.

Nielsen makes a fairly compelling case for public big data infrastructure, though urging the non-profit sector to take more risks seems like a weak action plan.

First, the problem is more subtle than aversion to failure. In some ways, the non-profit sector is extremely tolerant of failure, is even built for it: a project that does something and then winds down after a grant or few is considered acceptable. But that’s failure when it comes to provision of public infrastructure. The simple answer to that would seem to be requiring projects to work towards financial sustainability from the beginning, but that is complicated too, frequently pushing projects toward acting like for-profits, co-opting their supposed public infrastructure, etc. Improving the capacity of the non-profit sector to provide infrastructure is certainly valuable, but I wouldn’t count on it.

Second, lack of capability to experiment and fail is the only reason Nielsen suggests as to why not government-run public big data infrastructure. If the non-profit sector has that same problem, why prefer it to government? Is the non-profit sector more reformable? Probably for some small increment, but to this observer it seems there is vastly more energy directed at government IT innovation and reform (even if that often means small open data) than at all sorts of reform of the non-profit sector. There might be other reasons to favor non-profit provision over government provision, e.g., fear of “a Stasi-like operation soaking up everyone’s data” mentioned by Morozov (except of course that such operations exist already).

That said, I’d be most comfortable with a project intending to serve as public big data infrastructure to operate as a non-profit, both because plausible forerunners exist, and it is relatively straightforward to commit a non-profit organization to that mission. If starting such a thing with risk-tolerant and pivot-allowing funding is what Nielsen has planned (surely a plausible reading of his article), I hope he is wildly successful at it.

While a dedicated attempt to build public big data infrastructure would be very exciting, such may emerge from a publicly-interested project that gets huge for some other purpose. Google and Facebook didn’t start with the intention to build private big data infrastructure; that was an outcome of providing wildly successful web search, advertising, and social networking products. I suggest a very broad approach will be needed to ensure a public big data infrastructure emerges, is competitive, and beneficent:

  • Enumerate a set of requirements for public big data infrastructure, probably including: 100% open source software, all public data is open data (covering both permission and access), private data can be exported or expunged by persons concerned, perhaps some security best practices, mechanisms for non-public and in some cases public sharing of private data (eg with researchers upon consent), limitations on the collection of unneeded private data, mechanisms to encourage reproducibility, mechanisms to obtain data that ought be public but is not (FOIA-like), some form of open governance (doubtless I’m missing several things and lots of nuance, this is off-the-cuff; I am sure of the first two though; added: among other similar things see Principles for Open Scholarly Infrastructures (previously) and constitutionally open services)
  • Government and non-profit funders require all IT projects funded or procured to comply with aforementioned requirements
  • Ensure that public big data infrastructure is free to operate without being subject to third party intellectual property constraints, incrementally by requiring open source/open data, commons-favoring reform exempting commons-based production from such constraints, and finally abolition of copyright and the like for data.
  • Encourage and allow without restriction private measures to restrict data collection, e.g., encryption, ad blocking, and decentralized messaging
  • Consider for what purpose, when appropriate, and how to regulate remaining closed big data infrastructure (big proprietary platforms and military agencies as mentioned, but also financial institutions, perhaps others?)

In order for any of this to happen in a significant way, I suspect that:

  • some people committed to existing free/libre/open movements will have to be convinced of the value of public big data infrastructure; presently each has reasons to ignore, e.g., free/open source software generally concerned with creating software, not with running services; open data often does not demand open source infrastructure and does not confront privacy issues that public big data infrastructure would need to
  • open projects of various sorts need to be more ambitious (perhaps that’s what Nielsen had in mind regarding talking risks) such that is is possible they become huge and thus play some public big data infrastructure role; there’s nothing wrong with itch-scratching, but let’s have a rash of world liberation; to start with consider how Wikimedia and OpenStreetMap could be more ambitious
  • people with some kind of existing “open” commitment are not enough; policy makers and wonks as well as the general public need to be convinced that public big data infrastructure is something worth demanding, both as customers and citizens

Other issues not mentioned above:

  • What about data centers and physical hardware? I suspect it is fine and most efficient for most of them to be run by for-profits, but much more like the Open Compute Project is needed to keep barriers to entry low; government and other publicly interested data center users could demand open hardware and operational transparency from data centers; one paper suggests close linking to demands for environmental sustainability
  • What about data collection devices, i.e., soon every object?
  • Does massive data collection by individuals and individual devices blur the line between small and big data, and what are the implications for big data public infrastructure? Nielsen notes that a driverless car generates about 1 gigabyte per second of data about its environment.
  • What about universities and libraries as public big data infrastructure? Hasn’t billions been spent on cyberinfrastructure? Shouldn’t these institutions have the needed orientation (caring for creation and dissemination of knowledge as well as privacy, and intellectual freedom generally)? I’m surprised Nielsen didn’t mention these actors/projects, and wonder why.

I find these topics interesting for a number of reasons, but two base ones:

  • The more useful/the more of a competitive advantage big data is, the more public big data infrastructure is a corrective to WPIO
  • Huge organizations committed to and reliant upon commons-based production are necessary to obtain a potent constituency for commons-based production and allergic to intellectual freedom restricting regimes; public big data infrastructure requires just such organizations

Addendum: As mentioned above (and briefly by Morozov, but not by Nielsen) Google and Facebook’s private big data infrastructures are in part the result of (certainly sustained by) those entities being huge advertising platforms. It seems a direct way to try to create and sustain public big data infrastructure and common advertising is to create a commons-based and commons-favoring advertising platform.