Reason number 1: reproducibility helps to avoid disaster
Reason number 2: reproducibility makes it easier to write papers
Reason number 3: reproducibility helps reviewers see it your way
Reason number 4: reproducibility enables continuity of your work
Reason number 5: reproducibility helps to build your reputation
Who?
Whom do we need to share with?
collaborators
peer reviewers & journal editors
broad scientific community
generally the public
For research to be reproducible, the research products (data, code) need to be publicly available in a form that people can find and understand them.
What?
Catalog the artifacts you produced this morning
What needs to be published?
What does not need to be published?
Anything that cannot be published?
Activity outcomes
share? YES!
starting data set (raw data)
metadata
data cleaning steps
analysis scripts
source code
readme
share? maybe?
processed / cleaned data
intermediate results
share? NO!
confidential (e.g., patient) data
material already published
pre-existing restrictive license
passwords, private keys
Activity outcomes
Advice: One way to determine what you need to publish is to go through and redo the analyses in your paper. Make note of the data and code and notes you needed to do that analysis. Make sure all of that is available. This might seem time consuming, but it assures that what you think you did is what you actually did.
You can make your code and data public at any point of the research process.
However, at the point of paper submission, the results in your paper should be reproducible and therefore the data and code used in the paper published.
Journals now often require it
Lets the editor and the reviewers accurately review the paper
There are options for publishing, but keeping things private for just reviewers until the paper is published
Only some of these are archival, meaning they’re committing to retaining your data and products for longer periods of time. This is an important consideration depending on your funders requirements.
how to choose?
is there a domain specific repository?
what are the backup & replication policies?
is there a plan for long-term preservation?
can people find your materials?
is it citable? (does it provide DOIs)
is your purpose archival, sharing or publication?
what goes where when?
You will likely have different artifacts:
Rmarkdown
source code
other documentation
raw data
derived data
Possible workflow:
develop data & code on GitHub
upon publication
share markdown on RPubs
archive a snapshot of data in Dryad
code snapshot to Zenodo
University libraries try to help
Libraries often have good resources for data management plans and information and access to repositories. They are particularly good at thinking about data archives.
Librarians are very helpful and super awesome! They’re a great resource.
##How to share, publish: file formats
Do’s
Open source file formats
Text file formats (.csv, .tsv, .txt)
Don’t’s
proprietary file formats (.xls)
data as PDFs or images
data in Word documents
how to share, publish: standard data formats
Using standard data formats is sometimes required, but even when it’s not, conforming to standards greatly increases opportunties for re-use and understanding.
how to share, publish: checklist
top-level README that describes the data or software package
Github will automatically link to CONTRIBUTING file for new issues and pull requests
Documenting your research (in pairs)
collect all of the to-be-archived artifacts from the preceding lesson into a directory
_ write a README file that describes the contents of the directory
The Open Definition sets out principles that define “openness” in relation to data and content. It makes precise the meaning of “open” in the terms open data, open content, and open source:
“Open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness).”
“Open data and content can be freely used, modified, and shared by anyone for any purpose”
Waiving copyright
CC0 enables scientists, educators, artists and other creators and owners of copyright- or database-protected content to waive those interests in their works and thereby place them as completely as possible in the public domain, so that others may freely build upon, enhance and reuse the works for any purposes without restriction under copyright or database law.
Dryad requires CC0
Dryad’s use of CC0 to make the terms of reuse explicit has some important advantages:
Interoperability: Since CC0 is both human and machine-readable, other people and indexing services will automatically be able to determine the terms of use.
Universality: CC0 is a single mechanism that is both global and universal, covering all data and all countries. It is also widely recognized.
Simplicity: There is no need for humans to make, or respond to, individual data requests, and no need for click-through agreements. This allows more scientists to spend their time doing science.
[…] in the scholarly research community the act of citation is a commonly held community norm when reusing another community member’s work.
Community norms can be a much more effective way of encouraging positive behaviour, such as citation, than applying licenses. A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.
licenses are legal instruments
Licenses, copyright, terms of use are complicated issues.
There are legal implications to your choices.
Citation is a professional norm in science.
We have good systems for ensuring proper citation.
Would you try to sue someone in court who fails to cite you properly?
Keep it simple by putting the least-restrictive license possible
Let scientists do science without having to talk to lawyers.
Challenges and concerns about publishing data and code
Discussion
What are some of the challenges of publishing research products? What are some of the concerns that people have?