This lesson is being piloted (Beta version)

Foundations of Astronomical Data Science

Key Points

Basic queries
  • If you can’t download an entire dataset (or it is not practical) use queries to select the data you need.

  • Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Use ADQL features like TOP and COUNT to test before you run a query that might return a lot of data.

  • If you know your query will return fewer than 3000 rows, you can run it synchronously. If it might return more than 3000 rows, you should run it asynchronously.

  • ADQL and SQL are not case-sensitive. You don’t have to capitalize the keywords, but it will make your code more readable.

  • ADQL and SQL don’t require you to break a query into multiple lines, but it will make your code more readable.

  • Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.

  • Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase.

Coordinate Transformations
  • For measurements with units, use Quantity objects that represent units explicitly and check for errors.

  • Use the format function to compose queries; it is often faster and less error-prone.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Once you have a query working, save the data in a local file. If you shut down the notebook and come back to it later, you can reload the file; you don’t have to run the query again.

Plotting and Tabular Data
  • When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.

  • For simple scatter plots in Matplotlib, plot is faster than scatter.

  • An Astropy Table and a Pandas DataFrame are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.

  • To store data from a Pandas DataFrame, a good option is an HDF5 file, which can contain multiple Datasets (we’ll dig in more in the Join lesson).

Plotting and Pandas
  • A workflow is often prototyped on a small set of data which can be explored more easily and used to identify ways to limit a dataset to exactly the data you want.

  • To store data from a Pandas DataFrame, a good option is an HDF5 file, which can contain multiple Datasets.

Transform and Select
  • When possible, ‘move the computation to the data’; that is, do as much of the work as possible on the database server before downloading the data.

  • Use JOIN operations to combine data from multiple tables in a database, using some kind of identifier to match up records from one table with records from another. This is another example of a practice we saw in the previous notebook, moving the computation to the data.

  • For most applications, saving data in FITS or HDF5 is better than CSV. FITS and HDF5 are binary formats, so the files are usually smaller, and they store metadata, so you don’t lose anything when you read the file back.

  • On the other hand, CSV is a ‘least common denominator’ format; that is, it can be read by practically any application that works with data.

  • Matplotlib provides operations for working with points, polygons, and other geometric entities, so it is not just for making figures.

  • Use Matplotlib options to control the size and aspect ratio of figures to make them easier to interpret.

  • Record every element of the data analysis pipeline that would be needed to replicate the results.

  • Effective figures focus on telling a single story clearly and authentically. The major decisions needed in creating an effective summary figure like this one can be done away from a computer and built up from low fidelity (hand drawn) to high (tweaking rcParams, etc.).

  • Consider using annotations to guide the reader’s attention to the most important elements of a figure, while keeping in mind accessiblity issues that such detail may introduce.

  • The default Matplotlib style generates good quality figures, but there are several ways you can override the defaults.

  • If you find yourself making the same customizations on several projects, you might want to create your own style sheet.