What is important, and what is not, for bioscience data handling


There is an on-going discussion between the main bureaucratic players of Swedish science regarding the issue of data storage and data handling for the biosciences in Sweden. The question they discuss is "who should pay for what, and how should the money be channelled?"

It is exactly the wrong question.

Since all of these actors (except for non-governmental organisations, e.g. KAW) are financed by tax-payers money, it is a technical budget issue how they decide to finance things. There seems to be a strange idea floating around that if one fiddles with the financing paths, the whole problem of data storage will become more maintainable. This is, in my mind, to miss the point entirely.

Instead, we should focus on the primary issues for science:

  • How is science going to be done in the best way? Data handling and storage must be geared to the researcher's needs.
  • What makes life easy for the researcher? What should the basic design of the system be?
  • How do we design the system so that it fits with Open Science (at least sometime in the future).
  • How do we give the scientists the correct incentives to store important data, scrap temporary and unimportant data, and to make data searchable?
  • How do we make data available to the world, in keeping with Open Science principles?
  • How do we ensure that researchers keep only data that is being worked on in expensive and performant media, while moving currently-not-so-relevant data to slow, cheap media?

The big players involved are: Vetenskapsrådet (VR, the Swedish Research Council), the Swedish National Infrastructure for Computing (SNIC), the universities, the various national research infrastructures (e.g. the National Genomics Infrastructure, NGI, where I work) and also the non-governmental funding bodies such as Knut and Alice Wallenberg foundation (KAW). They are currently discussing these issues, but it appears only from the bureaucratic perspective of "who should pay". Fine, they need to do that.

But who discusses what the researchers need? I have so far seen very little of that essential question. And I know that there are unmet needs. We at NGI have received many desperate questions of how to handle the big data sets we produce. We also get questions on how to make certain data sets generally available to the public, data sets that do not fit into the generally accepted international databases. I have first-hand experience with researchers who buy storage at DropBox and other commercial sites. I have no idea if this is in keeping with university policies. A policy is as good as reality allows it to be. If a university does not provide data storage for its researchers, they will get it elsewhere. I have seen attempts at promoting figshare for Stockholm university, and also, if I remember correctly, box.com, but these are not well known among scientists, their status is unclear, and the strategy is even more opaque.

We need to think about how to realize the law-enforced requirement to store data for 10 years. There is the legal aspect, of course, but mainly there is the fundamental issue that the tax-payers have a legitimate reason to demand that data produced using their money becomes accessible to them. Today, the responsibility for this lies on the universities, which delegate it to the research groups, which proceed to do what they find practical.

Of course, this leads, in the best scenario, to many different fragmented solutions. In practice, many groups will be hard pressed to dig out data sets 10 years old. Computers do not last 10 years, and what's on old hard drives tends not to be brought along to the new computers, especially since the graduate students and postdocs who generated the data are no longer around.

Let's focus on the main problem here: The researcher's problem. And let the bureaucracy find solutions that are appropriate to solve that problem. Do not let the bureaucracy concentrate on finding solutions that are convenient for it. Only by pure luck would such a solution be the most optimal for the scientists.