15

Data from a published study was deposited on the Harvard Dataverse abiding by the replication policy of a scientific journal. I have no connections with the scholars who collected the data and published the study, but I am conducting related work so I downloaded the "replication" data and used it to also conduct some new analyses. I have now found some interesting results, and since my new tests are not part of the original publication, I would like to publish them in new articles (I will also use some original data that I have collected).

Question: is it appropriate to use data collected by other researchers and made public for replication purposes in other to provide new scientific insights? Should I ask permission to use the replication data for a new publication? In particular, shall I contact these authors and, eventually, agree with data restrictions or co-authorship requests?

Existing questions cover related cases where the same author(s) re-use old data, a case of scanning data from an existing publication, and a case where replication (apparently without extensions) is used to connect with the author(s), and a similar case where a database becomes available upon publication.

000andy8484
  • 253
  • 2
  • 7
  • 3
    One requirement of Federally funded research is that the the researchers must have a plan to make the data available. That doesn't always mean publishing the data or providing a PUF, public use file, that can be easily downloaded. Dong so meets the requirement for a plan. The data you found is likely a consequence of that requirement. The purpose, of course, is to let anyone use the data. – David Smith Oct 27 '21 at 17:53

2 Answers2

39

Yes, that's not only appropriate, but -- next to replication -- a main purpose of publishing data. Not being able to build on published data would greatly limit the accumulation of knowledge and lead to wasteful duplication of data collection efforts.

Of course you must cite the data source. Perhaps you should even acknowledge its authors beyond this citation, especially if you have solicited any useful feedback on your work from them, which is something you should try. They know the limitations and "peculiarities" of their dataset better than anyone, and there might even be potential for cooperation.

henning
  • 35,032
  • 10
  • 121
  • 151
  • 12
    Beware, of course, of the problem of forming a hypothesis only after you know the outcome of a particular sample and then using that sample to claim you have results valid for the population from which the sample was drawn. – Buffy Oct 26 '21 at 13:04
  • 2
    @Buffy I'm not sure how that relates to using or not using existing datasets, to be honest. Surely you could but shouldn't hypothesize ex-post in both cases? – henning Oct 26 '21 at 13:51
  • Just a warning to the OP here about the proper use of samples. – Buffy Oct 26 '21 at 13:54
  • 1
    @henning to me the relation is obvious. if i collect some data, I will need funding (in advance) to pay for it. I am going to try to convince my sponsor that my hypotheses is worth funding. Before I collect any data, I have committed to my hypotheses. There may be some interesting non-hypothesized outcomes in my data. Other than the hypotheses I pitched to my sponsors who is to say what is ex-post and what is not? The temptation is strong to pretend one formed ones hypotheses before and not after analyzing the data. – emory Oct 26 '21 at 21:26
  • 1
    @emory So then the temptation for ex post hypothesizing is greater if I don't reuse an existing dataset. Or do I misunderstand you? Perhaps you are saying, to the contrary, that pitching a hypothesis to the funder is similar to registering a hypothesis before the analysis? (And this almost-registration is lacking if I reuse existing data?) – henning Oct 26 '21 at 21:29
  • @henning Exactly. Since there is no hypothesis registration before the analysis, it is entirely on your honor to not hypothesize ex-post. – emory Oct 26 '21 at 22:10
  • @emory okay, but you could of course preregister also before analysing existing data, and I'm not sure most grant application really commit you to particular hypotheses. So while what buffy says is true, it really applies regardless of reusing existing data. – henning Oct 27 '21 at 07:07
  • 3
    @emory That’s a purely theoretical argument. In practice, just as henning hypothesises, the opposite seems to be the case: people are reluctant to just throw away data they generated at great expense if it doesn’t support their stated hypothesis, so they start trawling it for insight. — Besides, hypothesis-driven research isn’t the only valid approach. Exploratory analysis is a common, and valid, alternative. – Konrad Rudolph Oct 27 '21 at 09:11
  • 2
    @KonradRudolph Exploratory Data Analysis is good. Exploratory analysis to form hypotheses which are then tested against the same data set (while pretending to have not done the exploratory analysis) is bad. – emory Oct 27 '21 at 11:38
  • 1
    A partial solution to the issue @Buffy and emory raise (and a necessary step) is that it should be very clear that the data used in the subsequent papers are the same dataset used for the original one. That's true whether the paper is written by the original author(s) or new ones, as this answer already states. – Bryan Krause Oct 27 '21 at 22:44
3

The data supporting scientific research is by default public. Hiding or withholding it must not be considered the default. So, you're really asking us:

is it appropriate to use data collected by other researchers and not withheld from the public (which would be strange and inappropriate) in order to provide new scientific insights?

Yes, of course it is, why wouldn't it be?

Just like when I cite a paper, I don't care whether the author published it to help humanity or to brag about their achievements - I also don't care what the official excuse is for doing the default, obvious and necessary thing, which is publishing the data.

einpoklum
  • 39,047
  • 6
  • 75
  • 192