Materials by: Jeffrey Hollister, Alexander Duryee, Jennifer Bryan, Daisie Huang, Ben Marwick, Christie Bahlai, Owen Jones, Aleksandra Pawlik
In the previous lesson we discussed how to export Excel file formats into
*.csv. Whilst Comma Separated Value files are indeed very useful allowing for easily exchanging and sharing data.
However, there are some significant problems with this particular format. Quite often the data values themselves may include commas (,). In that case, the software which you use (including Excel) will most likely incorrectly display the data in columns. It is because the commas which are a part of the data values will be interpreted as a delimiter.
For example, our data could look like this:
species_id,genus,species,taxa AB,Amphispiza,bilineata,Bird AH,Ammospermophilus,harrisi,Rodent-not,censused AS,Ammodramus,savannarum,Bird
AH,Ammospermophilus,harrisi,Rodent-not,censused the value for taxa includes a comma (
If we try to read the above into Excel (or other spreadsheet programme), we will get something like this:
The value for ‘taxa’ was split into two columns (instead of being put in one column
D). This can propagate to a number of further errors. For example, the “extra” column will be interpreted as a column with many missing values (and without a proper header!). In addition to that, the value in column
D for the record in row 3 (so the one where the value for ‘taxa’ contained the comma) is now incorrect.
If you want to store your data in
*.csv and expect that your data may contain commas in their values, you can avoid the problem discussed above by putting the values in quotes (“”). This example data file applies this rule so the actual data looks like:
species_id,genus,species,taxa "AB","Amphispiza","bilineata","Bird" "AH","Ammospermophilus","harrisi","Rodent-not censused" "AS","Ammodramus","savannarum","Bird" "BA","Baiomys","taylori","Rodent" "CB","Campylorhynchus","brunneicapillus","Bird" "CM","Calamospiza","melanocorys","Bird" "CQ","Callipepla","squamata","Bird" "CS","Crotalus","scutalatus","Reptile" "CT","Cnemidophorus","tigris","Reptile" "CU","Cnemidophorus","uniparens","Reptile"
This original file does not contain commas in the values.
But let’s see what would happen if we introduced a comma into
"Rodent-not censused" - so that it looks like this:
csvformat) and reopen it in Excel.
However, if you are working with already existing dataset in which the data values are not included in “” and but which have commas as both delimiters and parts of data values, you are potentially facing a major problem with data cleaning.
If the dataset you’re dealing with contains hundreds or thousands of records, cleaning them up manually (by either removing commas from the data values or putting the values into quotes - “”) is not only going to take hours and hours but may potentially end up with you accidentally introducing many errors.
Cleaning up datasets is one of major problems in many scientific disciplines. The approach almost always depends on the particular context. However, it is a good practice to clean the data in an automated fashion, for example by writing and running a script. Other lessons in Data Carpentry covering shell, Python and R will give you the basis for developing skills to build relevant scripts.
When working with tabular data on the command line, using tabs as delimiters can make spreadsheets much easier to read. You may also choose to use tabs if you expect your data to contain commas. Here is an example of the same data in comma and then tab separated format:
Comma Separated (.csv)
species_id,genus,species,taxa AB,Amphispiza,bilineata,Bird AS,Ammodramus,savannarum,Bird CM,Calamospiza,melanocorys,Bird CQ,Callipepla,squamata,Bird CS,Crotalus,scutalatus,Reptile
Tab separated (.tsv)
species_id genus species taxa AB Amphispiza bilineata Bird AS Ammodramus savannarum Bird CM Calamospiza melanocorys Bird CQ Callipepla squamata Bird CS Crotalus scutalatus Reptile