My path to becoming part of a data science community was happenstance. I started programming as a graduate student because I wanted to increase the spatial and temporal scales of my analyses. The scale increase was from one location for a few months to the entire coast of Western North America for many years. For larger scale analyses, I needed satellite data, which I downloaded from a data repository. I learned to use shell scripts to automatically download data because going through 5 to 10 internet links to manually download hundreds of files was tedious. The files with satellite data were large and in special file formats so I learned how to access and analyze them using R.
When I first started programming, I would spend hours (sometimes days) trying to figure out a single R command. For example, I would want to find monthly means for my data but the dates would need to be converted to a different date format before I could make the monthly mean calculation. The process for figuring out the R command for converting date formats would be:
I would type keywords such as “date convert R” into a Google search. Then the frustrating sequence would be: click search result, read text, click back arrow, click next search result, read text, click back arrow, etc. I would add more keywords and subtract keywords. I would stare at my screen and type a few of the R commands that I discovered during my Google search into my R console. In response, R would print an incomprehensible error message. I would desperately try the same R commands again because I thought through some coding miracle that they would convert the date the next time. I would update R, restart the program, and try again. Then I would go back to Google and eventually find the answer or ask for help from more advanced R coders.
Persistence is critical when learning to code. At the most frustrating moments, I thought that I was going to be a student forever because my data would never be analyzed. This feeling of frustration would be quickly followed by euphoria when I finally typed in the R command that converted the dates to the correct format for my analysis. My feeling of euphoria would transition into feeling powerful in a “the sky’s the limit” way because I could quickly convert dates for infinite numbers of data files. Then I would try to write the next line of code, and I would be back to typing words in a Google search.
Through my graduate school experience, I recognized the power of using programming to more efficiently and effectively analyze scientific data. I was talking with faculty in the School of Oceanography at the University of Washington about a postdoctoral position when I saw the announcement for the data science postdoctoral fellowship in the eScience Institute on a mailing list. I was doing a lot of programming, but I was not sure if it counted as “data science” or “big data.” However, I was intrigued by the opportunity, and the faculty in the School of Oceanography encouraged me to apply.
I started my postdoc in the eScience Institute, which was rapidly expanding from a small group of faculty to a much larger organization with graduate students, postdocs, data scientists, and research scientists. I transitioned from being an observer to a participator in the data science community when I started asking questions about how to publish code from my research projects. I thought publishing my code was important, but publishing code on my personal website did not seem worthwhile because my website was not a long-term archive. A data scientist told me about GitHub (a platform designed for version control and sharing code) and Zenodo (a long-term archive connected to GitHub). We also discussed documentation and software licensing. In a short period of time, I revised my code and archived a version on Zenodo as the scientific paper based on it was published. I felt my publication was more complete because the code was available.
I learned about the reproducibility working group when I asked questions about publishing my code and started attending meetings. The goal of the reproducibility working group is for researchers to share the entire process including code that produces results. More transparency in the calculation of results will increase confidence in scientific discoveries. We are working on initiatives to promote reproducibility at the University of Washington. Through the reproducibility working group, I learned about Software/Data Carpentry. I started as a helper at a Software Carpentry Workshop and eventually became a certified instructor.
Joining the reproducibility working group and becoming an instructor for Software/Data Carpentry helped me to become an active member of the data science community. I have met other researchers who are working on really different subjects but using similar methods. I really enjoy chatting with other instructors about techniques for teaching programming. For someone looking to join a data science community, my recommendation is to get involved.
I spend most of my time doing oceanographic research. When I talk to graduate students, postdocs, and faculty in oceanography, I make a point of talking about the resources and opportunities available in the eScience Institute. If someone asks me about learning Python or R, I steer them towards a Software/Data Carpentry workshop. An oceanography graduate student was really interested in publishing code with a publication, so I walked the student through the process that I use for publishing my code. Thus far, my conversations have been informal, but I am organizing a more formal presentation that will address both reproducibility and data science educational opportunities in the environmental sciences. By reaching a wider audience, I hope that the connections between data science and environmental science communities will grow.
Who is in your data science community? Is your community growing? How do you attract new members to your community?
Comment below, and tweet us your thoughts @datacarpentry and @eco2logy.« Previous Next »