This lesson is being piloted (Beta version)

Foundations of Astronomical Data Science

Basic queries

Overview

Teaching: 65 min
Exercises: 25 min
Questions
  • How can we select and download the data we want from the Gaia server?

Objectives
  • Compose a basic query in ADQL/SQL.

  • Use queries to explore a database and its tables.

  • Use queries to download data.

  • Develop, test, and debug a query incrementally.

As a running example, we will replicate parts of the analysis in the paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian Price-Whelan and Ana Bonaca.

Outline

This episode demonstrates the steps for selecting and downloading data from the Gaia Database:

  1. First we will make a connection to the Gaia server,

  2. We will explore information about the database and the tables it contains,

  3. We will write a query and send it to the server, and finally

  4. We will download the response from the server.

Query Language

In order to select data from a database, you need to compose a query, which is a program written in a “query language”. The query language we will use is ADQL, which stands for “Astronomical Data Query Language”.

ADQL is a dialect of SQL (Structured Query Language), which is by far the most commonly used query language. Almost everything you will learn about ADQL also works in SQL.

The reference manual for ADQL is here. But you might find it easier to learn from this ADQL Cookbook.

Using Jupyter

If you have not worked with Jupyter notebooks before, you might start with the tutorial on from Jupyter.org called “Try Classic Notebook”, or this tutorial from DataQuest.

There are two environments you can use to write and run notebooks:

For this lesson, you can use either one.

Here are the most important things to know about running a Jupyter notebook:

  1. Notebooks are made up of code cells and text cells (and a few other less common kinds). Code cells contain code; text cells contain explanatory text written in Markdown.

  2. To run a code cell, click the cell to select it and press Shift-Enter. The output of the code should appear below the cell.

  3. In general, notebooks only run correctly if you run every code cell in order from top to bottom. If you run cells out of order, you are likely to get errors.

  4. You can modify existing cells, but then you have to run them again to see the effect.

  5. You can add new cells, but you need to be careful about the order you run them in.

  6. If you have added or modified cells, and the behavior of the notebook seems strange, you can restart the “kernel”, which clears all of the variables and functions you have defined, and run the cells again from the beginning.

Before you continue with this lesson, you might want to explore the other menus and the toolbar to see what else you can do.

Connecting to Gaia

The library we will use to get Gaia data is Astroquery. Astroquery provides Gaia, which is an object that represents a connection to the Gaia database.

We can connect to the Gaia database like this:

from astroquery.gaia import Gaia
Created TAP+ (v1.2.1) - Connection:
	Host: gea.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443
Created TAP+ (v1.2.1) - Connection:
	Host: geadata.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443

This import statement creates a TAP+ connection; TAP stands for “Table Access Protocol”, which is a network protocol for sending queries to the database and getting back the results.

Databases and Tables

What is a database? Most generally, it can be any collection of data, but when we are talking about ADQL or SQL:

We can use Gaia.load_tables to get the names of the tables in the Gaia database. With the option only_names=True, it loads information about the tables, called “metadata”, but not the data itself.

tables = Gaia.load_tables(only_names=True)
INFO: Retrieving tables... [astroquery.utils.tap.core]
INFO: Parsing tables... [astroquery.utils.tap.core]
INFO: Done. [astroquery.utils.tap.core]

The following for loop prints the names of the tables.

for table in tables:
    print(table.name)
external.apassdr9
external.gaiadr2_geometric_distance
external.gaiaedr3_distance
external.galex_ais
external.ravedr5_com
external.ravedr5_dr5
external.ravedr5_gra
external.ravedr5_on
external.sdssdr13_photoprimary
external.skymapperdr1_master
external.skymapperdr2_master
[Output truncated]

So that is a lot of tables. The ones we will use are:

We can use load_table (not load_tables) to get the metadata for a single table. The name of this function is misleading, because it only downloads metadata, not the contents of the table.

table_metadata = Gaia.load_table('gaiadr2.gaia_source')
table_metadata
Retrieving table 'gaiadr2.gaia_source'
Parsing table 'gaiadr2.gaia_source'...
Done.

<astroquery.utils.tap.model.taptable.TapTableMeta at 0x7f50edd2aeb0>

Jupyter shows that the result is an object of type TapTableMeta, but it does not display the contents.

To see the metadata, we have to print the object.

print(table_metadata)
TAP Table name: gaiadr2.gaiadr2.gaia_source
Description: This table has an entry for every Gaia observed source as listed in the
Main Database accumulating catalogue version from which the catalogue
release has been generated. It contains the basic source parameters,
that is only final data (no epoch data) and no spectra (neither final
nor epoch).
Num. columns: 96

Columns

The following loop prints the names of the columns in the table.

for column in table_metadata.columns:
    print(column.name)
solution_id
designation
source_id
random_index
ref_epoch
ra
ra_error
dec
dec_error
parallax
parallax_error
[Output truncated]

You can probably infer what many of these columns are by looking at the names, but you should resist the temptation to guess. To find out what the columns mean, read the documentation.

Exercise (2 minutes)

One of the other tables we will use is gaiadr2.panstarrs1_original_valid. Use load_table to get the metadata for this table. How many columns are there and what are their names?

Solution

panstarrs_metadata = Gaia.load_table('gaiadr2.panstarrs1_original_valid')
print(panstarrs_metadata)

for column in panstarrs_metadata.columns:
    print(column.name)

Writing queries

You might be wondering how we download these tables. With tables this big, you generally don’t. Instead, you use queries to select only the data you want.

A query is a program written in a query language like SQL. For the Gaia database, the query language is a dialect of SQL called ADQL.

Here’s an example of an ADQL query.

query1 = """SELECT 
TOP 10
source_id, ra, dec, parallax 
FROM gaiadr2.gaia_source
"""

Triple-quotes strings

We use a triple-quoted string here so we can include line breaks in the query, which makes it easier to read.

The words in uppercase are ADQL keywords:

The third line is a list of column names, indicating which columns we want.

In this example, the keywords are capitalized and the column names are lowercase. This is a common style, but it is not required. ADQL and SQL are not case-sensitive.

Also, the query is broken into multiple lines to make it more readable. This is a common style, but not required. Line breaks don’t affect the behavior of the query.

To run this query, we use the Gaia object, which represents our connection to the Gaia database, and invoke launch_job:

job1 = Gaia.launch_job(query1)
job1
<astroquery.utils.tap.model.job.Job at 0x7f50edd2adc0>

The result is an object that represents the job running on a Gaia server.

If you print it, it displays metadata for the forthcoming results.

print(job1)
<Table length=10>
   name    dtype  unit                            description                             n_bad
--------- ------- ---- ------------------------------------------------------------------ -----
source_id   int64      Unique source identifier (unique within a particular Data Release)     0
       ra float64  deg                                                    Right ascension     0
      dec float64  deg                                                        Declination     0
 parallax float64  mas                                                           Parallax     2
Jobid: None
Phase: COMPLETED
Owner: None
Output file: sync_20210315090602.xml.gz
[Output truncated]

Don’t worry about Results: None. That does not actually mean there are no results.

However, Phase: COMPLETED indicates that the job is complete, so we can get the results like this:

results1 = job1.get_results()
type(results1)
astropy.table.table.Table

The type function indicates that the result is an Astropy Table.

Repetition

Why is table repeated three times? The first is the name of the module, the second is the name of the submodule, and the third is the name of the class. Most of the time we only care about the last one. It’s like the Linnean name for the Western lowland gorilla, which is Gorilla gorilla gorilla.

An Astropy Table is similar to a table in an SQL database except:

Jupyter knows how to display the contents of a Table.

results1
<Table length=10>
     source_id              ra                 dec               parallax      
                           deg                 deg                 mas         
       int64             float64             float64             float64       
------------------- ------------------ ------------------- --------------------
5887983246081387776   227.978818386372  -53.64996962450103   1.0493172163332998
5887971250213117952 228.32280834041364  -53.66270726203726  0.29455652682279093
5887991866047288704  228.1582047014091 -53.454724911639794  -0.5789179941669236
5887968673232040832 228.07420888099884   -53.8064612895961  0.41030970779603076
5887979844465854720 228.42547805195946  -53.48882284470035 -0.23379683441525864
5887978607515442688 228.23831627636855  -53.56328249482688  -0.9252161956789068
[Output truncated]

Each column has a name, units, and a data type.

For example, the units of ra and dec are degrees, and their data type is float64, which is a 64-bit floating-point number, used to store measurements with a fraction part.

This information comes from the Gaia database, and has been stored in the Astropy Table by Astroquery.

Exercise (3 minutes)

Read the documentation of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?

Solution

For example, we can add radial_velocity : Radial velocity (double, Velocity[km/s] ) - Spectroscopic radial velocity in the solar barycentric reference frame. The radial velocity provided is the median value of the radial velocity measurements at all epochs.

query1_with_rv = """SELECT 
TOP 10
source_id, ra, dec, parallax, radial_velocity
FROM gaiadr2.gaia_source
"""

Asynchronous queries

launch_job asks the server to run the job “synchronously”, which normally means it runs immediately. But synchronous jobs are limited to 2000 rows. For queries that return more rows, you should run “asynchronously”, which mean they might take longer to get started.

If you are not sure how many rows a query will return, you can use the SQL command COUNT to find out how many rows are in the result without actually returning them. We will see an example in the next lesson.

The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results. For anonymous users, files are kept for three days.

As an example, let us try a query that is similar to query1, with these changes:

query2 = """SELECT 
TOP 3000
source_id, ra, dec, pmra, pmdec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1
"""

A WHERE clause indicates which rows we want; in this case, the query selects only rows “where” parallax is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away. We’ll use this clause to exclude nearby stars that are unlikely to be part of GD-1.

WHERE is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to download only the rows we need from the database.

We use launch_job_async to submit an asynchronous query.

job2 = Gaia.launch_job_async(query2)
job2
INFO: Query finished. [astroquery.utils.tap.core]

<astroquery.utils.tap.model.job.Job at 0x7f50edd40f40>

And here are the results.

results2 = job2.get_results()
results2
<Table length=10>
     source_id              ra         ...       parallax       radial_velocity
                           deg         ...         mas               km / s    
       int64             float64       ...       float64            float64    
------------------- ------------------ ... -------------------- ---------------
5895270396817359872 213.08433715252883 ...    2.041947005434917              --
5895272561481374080  213.2606587905109 ...  0.15693467895110133              --
5895247410183786368 213.38479712976664 ... -0.19017525742552605              --
5895249226912448000 213.41587389088238 ...                   --              --
5895261875598904576  213.5508930114549 ... -0.29471722363529257              --
5895258302187834624 213.87631129557286 ...   0.6468437015289753              --
[Output truncated]

You might notice that some values of parallax are negative. As this FAQ explains, “Negative parallaxes are caused by errors in the observations.” They have “no physical meaning,” but they can be a “useful diagnostic on the quality of the astrometric solution.”

Exercise (5 minutes)

The clauses in a query have to be in the right order. Go back and change the order of the clauses in query2 and run it again. The modified query should fail, but notice that you don’t get much useful debugging information.

For this reason, developing and debugging ADQL queries can be really hard. A few suggestions that might help:

  • Whenever possible, start with a working query, either an example you find online or a query you have used in the past.

  • Make small changes and test each change before you continue.

  • While you are debugging, use TOP to limit the number of rows in the result. That will make each test run faster, which reduces your development time.

  • Launching test queries synchronously might make them start faster, too.

Solution

In this example, the WHERE clause is in the wrong place.

query2_erroneous = """SELECT 
TOP 3000
WHERE parallax < 1
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
"""

Operators

In a WHERE clause, you can use any of the SQL comparison operators; here are the most common ones:

Symbol Operation
> greater than
< less than
>= greater than or equal
<= less than or equal
= equal
!= or <> not equal

Most of these are the same as Python, but some are not. In particular, notice that the equality operator is =, not ==. Be careful to keep your Python out of your ADQL!

You can combine comparisons using the logical operators:

Finally, you can use NOT to invert the result of a comparison.

Exercise (5 minutes)

Read about SQL operators here and then modify the previous query to select rows where bp_rp is between -0.75 and 2.

Solution

Here is a solution using > and < operators:

query2_sol1 = """SELECT 
TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp > -0.75 AND bp_rp < 2
"""

And here is a solution using the BETWEEN operator:

query2_sol2 = """SELECT 
TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp BETWEEN -0.75 AND 2
"""

bp_rp contains BP-RP color, which is the difference between two other columns, phot_bp_mean_mag and phot_rp_mean_mag. You can read about this variable here.

This Hertzsprung-Russell diagram shows the BP-RP color and luminosity of stars in the Gaia catalog (Copyright: ESA/Gaia/DPAC, CC BY-SA 3.0 IGO).

Hertzsprung-Russell diagram of BP-RP color versus luminosity.

Selecting stars with bp-rp less than 2 excludes many class M dwarf stars, which are low temperature, low luminosity. A star like that at GD-1’s distance would be hard to detect, so if it is detected, it it more likely to be in the foreground.

Formatting queries

The queries we have written so far are string “literals”, meaning that the entire string is part of the program. But writing queries yourself can be slow, repetitive, and error-prone.

It is often better to write Python code that assembles a query for you. One useful tool for that is the string format method.

As an example, we will divide the previous query into two parts; a list of column names and a “base” for the query that contains everything except the column names.

Here is the list of columns we will select.

columns = 'source_id, ra, dec, pmra, pmdec, parallax'

And here is the base. It is a string that contains at least one format specifier in curly brackets (braces).

query3_base = """SELECT 
TOP 10 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2
"""

This base query contains one format specifier, {columns}, which is a placeholder for the list of column names we will provide.

To assemble the query, we invoke format on the base string and provide a keyword argument that assigns a value to columns.

query3 = query3_base.format(columns=columns)

In this example, the variable that contains the column names and the variable in the format specifier have the same name. That is not required, but it is a common style.

The result is a string with line breaks. If you display it, the line breaks appear as \n.

query3
'SELECT \nTOP 10 \nsource_id, ra, dec, pmra, pmdec\nFROM gaiadr2.gaia_source\nWHERE parallax < 1\n  AND bp_rp BETWEEN -0.75 AND 2\n'

But if you print it, the line breaks appear as line breaks.

print(query3)
SELECT 
TOP 10 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2

Notice that the format specifier has been replaced with the value of columns.

Let’s run it and see if it works:

job3 = Gaia.launch_job(query3)
print(job3)
<Table length=10>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: None
Phase: COMPLETED
Owner: None
[Output truncated]
results3 = job3.get_results()
results3
<Table length=10>
     source_id              ra         ...        pmdec        
                           deg         ...       mas / yr      
       int64             float64       ...       float64       
------------------- ------------------ ... --------------------
5895272561481374080  213.2606587905109 ...   1.2299266281737415
5895261875598904576  213.5508930114549 ...   -4.672602679543312
5895247444506644992 213.33215109206796 ...   -3.538080792097856
5895259470417635968 213.78815034206346 ...  -0.8163762113468646
5895265925746051584 213.17082359534547 ...  -4.8585444120179595
5895260913525974528 213.66936020541976 ...  -1.5566420086447643
[Output truncated]

Exercise (10 minutes)

This query always selects sources with parallax less than 1. But suppose you want to take that upper bound as an input.

Modify query3_base to replace 1 with a format specifier like {max_parallax}. Now, when you call format, add a keyword argument that assigns a value to max_parallax, and confirm that the format specifier gets replaced with the value you provide.

Solution

query_base_sol = """SELECT 
TOP 10
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < {max_parallax} AND 
bp_rp BETWEEN -0.75 AND 2
"""

query_sol = query_base_sol.format(columns=columns,
                          max_parallax=0.5)
print(query_sol)

Summary

This episode has demonstrated the following steps:

  1. Making a connection to the Gaia server,

  2. Exploring information about the database and the tables it contains,

  3. Writing a query and sending it to the server, and finally

  4. Downloading the response from the server as an Astropy Table.

In the next episode we will extend these queries to select a particular region of the sky.

Key Points

  • If you can’t download an entire dataset (or it is not practical) use queries to select the data you need.

  • Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Use ADQL features like TOP and COUNT to test before you run a query that might return a lot of data.

  • If you know your query will return fewer than 3000 rows, you can run it synchronously. If it might return more than 3000 rows, you should run it asynchronously.

  • ADQL and SQL are not case-sensitive. You don’t have to capitalize the keywords, but it will make your code more readable.

  • ADQL and SQL don’t require you to break a query into multiple lines, but it will make your code more readable.

  • Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.

  • Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase.


Coordinate Transformations

Overview

Teaching: 75 min
Exercises: 18 min
Questions
  • How do we transform celestial coordinates from one frame to another and save results in files?

Objectives
  • Use Python string formatting to compose more complex ADQL queries.

  • Work with coordinates and other quantities that have units.

  • Download the results of a query and store them in a file.

In the previous episode, we wrote ADQL queries and used them to select and download data from the Gaia server. In this episode, we will write a query to select stars from a particular region of the sky.

Outline

We’ll start with an example that does a “cone search”; that is, it selects stars that appear in a circular region of the sky.

Then, to select stars in the vicinity of GD-1, we will:

  • Use Quantity objects to represent measurements with units.

  • Use Astropy to convert coordinates from one frame to another.

  • Use the ADQL keywords POLYGON, CONTAINS, and POINT to select stars that fall within a polygonal region.

  • Submit a query and download the results.

  • Store the results in a FITS file.

Working with Units

The measurements we will work with are physical quantities, which means that they have two parts, a value and a unit. For example, the coordinate 30° has value 30 and its units are degrees.

Until recently, most scientific computation was done with values only; units were left out of the program altogether, sometimes with catastrophic results.

Astropy provides tools for including units explicitly in computations, which makes it possible to detect errors before they cause disasters.

To use Astropy units, we import them like this:

import astropy.units as u

u is an object that contains most common units and all SI units.

You can use dir to list them, but you should also read the documentation.

dir(u)
['A',
 'AA',
 'AB',
 'ABflux',
 'ABmag',
 'AU',
 'Angstrom',
 'B',
 'Ba',
 'Barye',
 'Bi',
[Output truncated]

To create a quantity, we multiply a value by a unit:

angle = 10 * u.degree
type(angle)
astropy.units.quantity.Quantity

The result is a Quantity object. Jupyter knows how to display Quantities like this:

angle
<Quantity 10. deg>

10°

Quantities provides a method called to that converts to other units. For example, we can compute the number of arcminutes in angle:

angle_arcmin = angle.to(u.arcmin)
angle_arcmin
<Quantity 600. arcmin>

600

If you add quantities, Astropy converts them to compatible units, if possible:

angle + 30 * u.arcmin
<Quantity 10.5 deg>

10.5°

If the units are not compatible, you get an error. For example:

angle + 5 * u.second

causes a UnitConversionError.

Exercise (3 minutes)

Create a quantity that represents 5 arcminutes and assign it to a variable called radius.

Then convert it to degrees.

Solution

radius = 5 * u.arcmin
print(radius)

radius.to(u.degree)

Selecting a Region

One of the most common ways to restrict a query is to select stars in a particular region of the sky. For example, here is a query from the Gaia archive documentation that selects objects in a circular region centered at (88.8, 7.4) with a search radius of 5 arcmin (0.08333 deg).

cone_query = """SELECT 
TOP 10 
source_id
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

This query uses three keywords that are specific to ADQL (not SQL):

A query like this is called a cone search because it selects stars in a cone. Here is how we run it:

from astroquery.gaia import Gaia

cone_job = Gaia.launch_job(cone_query)
cone_job
Created TAP+ (v1.2.1) - Connection:
	Host: gea.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443
Created TAP+ (v1.2.1) - Connection:
	Host: geadata.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443

<astroquery.utils.tap.model.job.Job at 0x7f277785fa30>
cone_results = cone_job.get_results()
cone_results
<Table length=10>
     source_id     
       int64       
-------------------
3322773965056065536
3322773758899157120
3322774068134271104
3322773930696320512
3322774377374425728
3322773724537891456
3322773724537891328
[Output truncated]

Exercise (5 minutes)

When you are debugging queries like this, you can use TOP to limit the size of the results, but then you still don’t know how big the results will be.

An alternative is to use COUNT, which asks for the number of rows that would be selected, but it does not return them.

In the previous query, replace TOP 10 source_id with COUNT(source_id) and run the query again. How many stars has Gaia identified in the cone we searched?

Solution

count_cone_query = """SELECT 
COUNT(source_id)
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

count_cone_job = Gaia.launch_job(count_cone_query)
count_cone_results = count_cone_job.get_results()
count_cone_results
<Table length=1>
count
int64
-----
594

Getting GD-1 Data

From the Price-Whelan and Bonaca paper, we will try to reproduce Figure 1, which includes this representation of stars likely to belong to GD-1:

On-sky positions of likely GD-1 members in the GD-1 coordinate system, where selection by proper motion and photometry reveals the stream in great detail.

The axes of this figure are defined so the x-axis is aligned with the stars in GD-1, and the y-axis is perpendicular.

Ideally, we would select all stars from this rectangle, but there are more than 10 million of them. This would be difficult to work with, and as anonymous Gaia users, we are limited to 3 million rows in a single query. While we are developing and testing code, it will be faster to work with a smaller dataset.

So we will start by selecting stars in a smaller rectangle near the center of GD-1, from -55 to -45 degrees φ1 and -8 to 4 degrees φ2. First we will learn how to represent these coordinates with Astropy.

Transforming coordinates

Astronomy makes use of many different coordinate systems. Transforming between coordinate systems is a common task in observational astronomy, and thankfully, Astropy has abstracted the required spherical trigonometry for us. Below we show the steps to go from Equatorial coordinates (sky coordinates) to Galactic coordinates and finally to a reference frame defined to more easily study GD-1.

Astropy provides a SkyCoord object that represents sky coordinates relative to a specified reference frame.

The following example creates a SkyCoord object that represents the approximate coordinates of Betelgeuse (alf Ori) in the ICRS frame.

ICRS is the “International Celestial Reference System”, adopted in 1997 by the International Astronomical Union.

from astropy.coordinates import SkyCoord

ra = 88.8 * u.degree
dec = 7.4 * u.degree
coord_icrs = SkyCoord(ra=ra, dec=dec, frame='icrs')

coord_icrs
<SkyCoord (ICRS): (ra, dec) in deg
    (88.8, 7.4)>

SkyCoord objects require units in order to understand the context. There are a number of ways to define SkyCoord objects, in our example, we explicitly specified the coordinates and units and provided a reference frame.

SkyCoord provides the transform_to function to transform from one reference frame to another reference frame. For example, we can transform coords_icrs to Galactic coordinates like this:

coord_galactic = coord_icrs.transform_to('galactic')
coord_galactic
<SkyCoord (Galactic): (l, b) in deg
    (199.79693102, -8.95591653)>

Coordinate Variables

Notice that in the Galactic frame, the coordinates are the variables we usually use for Galactic longitude and latitude called l and b, respectively, not ra and dec. Most reference frames have different ways to specify coordinates and the SkyCoord object will use these names.

To transform to and from GD-1 coordinates, we will use a frame defined by Gala, which is an Astropy-affiliated library that provides tools for galactic dynamics.

Gala provides GD1Koposov10, which is “a Heliocentric spherical coordinate system defined by the orbit of the GD-1 stream”. In this coordinate system, one axis is defined along the direction of the steam (the x-axis in Figure 1) and one axis is defined perpendicular to the direction of the stream (the y-axis in Figure 1). These are called the φ1 and φ2 coordinates, respectively.

from gala.coordinates import GD1Koposov10

gd1_frame = GD1Koposov10()
gd1_frame
<GD1Koposov10 Frame>

We can use it to find the coordinates of Betelgeuse in the GD-1 frame, like this:

coord_gd1 = coord_icrs.transform_to(gd1_frame)
coord_gd1
<SkyCoord (GD1Koposov10): (phi1, phi2) in deg
    (-94.97222038, 34.5813813)>

Exercise (10 minutes)

Find the location of GD-1 in ICRS coordinates.

  1. Create a SkyCoord object at 0°, 0° in the GD-1 frame.

  2. Transform it to the ICRS frame.

Hint: Because ICRS is a standard frame, it is built into Astropy. You can specify it by name, icrs (as we did with galactic).

Solution

origin_gd1 = SkyCoord(0*u.degree, 0*u.degree, frame=gd1_frame)

origin_gd1.transform_to('icrs')

Notice that the origin of the GD-1 frame maps to ra=200, exactly, in ICRS. That is by design.

Selecting a rectangle

Now that we know how to define coordinate transformations, we are going to use them to get a list of stars that are in GD-1. As we mentioned before, this is a lot of stars, so we are going to start by defining a rectangle that encompasses a small part of GD-1. This is easiest to define in GD-1 coordinates.

The following variables define the boundaries of the rectangle in φ1 and φ2.

phi1_min = -55 * u.degree 
phi1_max = -45 * u.degree
phi2_min = -8 * u.degree
phi2_max = 4 * u.degree

Throughout this lesson we are going to be defining a rectangle often. Rather than copy and paste multiple lines of code, we will write a function to build the rectangle for us. By having the code contained in a single location, we can easily fix bugs or update our implementation as needed. By choosing an explicit function name our code is also self documenting, meaning its easy for us to understand that we are building a rectangle when we call this function.

To create a rectangle, we will use the following function, which takes the lower and upper bounds as parameters and returns a list of x and y coordinates of the corners of a rectangle starting with the lower left corner and working clockwise.

def make_rectangle(x1, x2, y1, y2):
    """Return the corners of a rectangle."""
    xs = [x1, x1, x2, x2, x1]
    ys = [y1, y2, y2, y1, y1]
    return xs, ys

The return value is a tuple containing a list of coordinates in φ1 followed by a list of coordinates in φ2.

phi1_rect, phi2_rect = make_rectangle(
    phi1_min, phi1_max, phi2_min, phi2_max)

phi1_rect and phi2_rect contains the coordinates of the corners of a rectangle in the GD-1 frame.

While it is easier to visualize the regions we want to define in the GD-1 frame, the coordinates in the Gaia catalog are in the ICRS frame. In order to use the rectangle we defined, we need to convert the coordinates from the GD-1 frame to the ICRS frame. We will do this using the SkyCoord object. Fortunately SkyCoord objects can take lists of coordinates, in addition to single values.

corners = SkyCoord(phi1=phi1_rect, phi2=phi2_rect, frame=gd1_frame)
corners
<SkyCoord (GD1Koposov10): (phi1, phi2) in deg
    [(-55., -8.), (-55.,  4.), (-45.,  4.), (-45., -8.), (-55., -8.)]>

Now we can use transform_to to convert to ICRS coordinates.

corners_icrs = corners.transform_to('icrs')
corners_icrs
<SkyCoord (ICRS): (ra, dec) in deg
    [(146.27533314, 19.26190982), (135.42163944, 25.87738723),
     (141.60264825, 34.3048303 ), (152.81671045, 27.13611254),
     (146.27533314, 19.26190982)]>

Notice that a rectangle in one coordinate system is not necessarily a rectangle in another. In this example, the result is a (non-rectangular) polygon. This is why we defined our rectangle in the GD-1 frame.

Defining a polygon

In order to use this polygon as part of an ADQL query, we have to convert it to a string with a comma-separated list of coordinates, as in this example:

"""
POLYGON(143.65, 20.98, 
        134.46, 26.39, 
        140.58, 34.85, 
        150.16, 29.01)
"""

SkyCoord provides to_string, which produces a list of strings.

corners_list_str = corners_icrs.to_string()
corners_list_str
['146.275 19.2619',
 '135.422 25.8774',
 '141.603 34.3048',
 '152.817 27.1361',
 '146.275 19.2619']

We can use the Python string function join to join corners_list_str into a single string (with spaces between the pairs):

corners_single_str = ' '.join(corners_list_str)
corners_single_str
'146.275 19.2619 135.422 25.8774 141.603 34.3048 152.817 27.1361 146.275 19.2619'

This is almost what we need, but we have to replace the spaces with commas.

corners_single_str.replace(' ', ', ')
'146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619'

This is something we will need to do multiple times. We will write a function to do it for us so we don’t have to copy and paste every time. The following function combines these steps.

def skycoord_to_string(skycoord):
    """Convert SkyCoord to string."""
    corners_list_str = skycoord.to_string()
    corners_single_str = ' '.join(corners_list_str)
    return corners_single_str.replace(' ', ', ')

Here is how we use this function:

sky_point_list = skycoord_to_string(corners_icrs)
sky_point_list
'146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619'

Assembling the query

Now we are ready to assemble our query to get all of the stars in the Gaia catalog that are in the small rectangle we defined and are likely to be part of GD-1 with the criteria we previously defined.

We need columns again (as we saw in the previous episode).

columns = 'source_id, ra, dec, pmra, pmdec, parallax'

And here is the query base we used in the previous lesson:

query3_base = """SELECT 
TOP 10 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2
"""

Now we will add a WHERE clause to select stars in the polygon we defined and start using more descriptive variables for our queries.

polygon_top10query_base = """SELECT
TOP 10
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({sky_point_list}))
"""

The query base contains format specifiers for columns and sky_point_list.

We will use format to fill in these values.

polygon_top10query = polygon_top10query_base.format(columns=columns, 
                          sky_point_list=sky_point_list)
print(polygon_top10query)
SELECT
TOP 10
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619))


As always, we should take a minute to proof-read the query before we launch it.

polygon_top10query_job = Gaia.launch_job_async(polygon_top10query)
print(polygon_top10query_job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=10>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1615815873808O
Phase: COMPLETED
[Output truncated]

Here are the results.

polygon_top10query_results = polygon_top10query_job.get_results()
polygon_top10query_results
<Table length=10>
    source_id              ra         ...        pmdec       
                          deg         ...       mas / yr     
      int64             float64       ...       float64      
------------------ ------------------ ... -------------------
637987125186749568 142.48301935991023 ...   2.941813096629439
638285195917112960 142.25452941346344 ... -12.165984395577347
638073505568978688 142.64528557468074 ...  -7.950659620550862
638086386175786752 142.57739430926034 ...  -2.584105480335548
638049655615392384 142.58913564478618 ...  -4.941079187010136
638267565075964032 141.81762228999614 ...  1.8838892877285924
[Output truncated]

Finally, we can remove TOP 10 and run the query again.

The result is bigger than our previous queries, so it will take a little longer.

polygon_query_base = """SELECT
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({sky_point_list}))
"""
polygon_query = polygon_query_base.format(columns=columns, 
                          sky_point_list=sky_point_list)
print(polygon_query)
SELECT
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619))
polygon_job = Gaia.launch_job_async(polygon_query)
print(polygon_job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1615815886707O
Phase: COMPLETED
[Output truncated]
polygon_results = polygon_job.get_results()
len(polygon_results)
140339

There are more than 100,000 stars in this polygon, but that’s a manageable size to work with.

Saving results

This is the set of stars we will work with in the next step. Since we have a substantial dataset now, this is a good time to save it.

Storing the data in a file means we can shut down our notebook and pick up where we left off without running the previous query again.

Astropy Table objects provide write, which writes the table to disk.

filename = 'gd1_results.fits'
polygon_results.write(filename, overwrite=True)

Because the filename ends with fits, the table is written in the FITS format, which preserves the metadata associated with the table.

If the file already exists, the overwrite argument causes it to be overwritten.

We can use getsize to confirm that the file exists and check the size:

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
5.36407470703125

Summary

In this notebook, we composed more complex queries to select stars within a polygonal region of the sky. Then we downloaded the results and saved them in a FITS file.

In the next notebook, we’ll reload the data from this file and replicate the next step in the analysis, using proper motion to identify stars likely to be in GD-1.

Key Points

  • For measurements with units, use Quantity objects that represent units explicitly and check for errors.

  • Use the format function to compose queries; it is often faster and less error-prone.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Once you have a query working, save the data in a local file. If you shut down the notebook and come back to it later, you can reload the file; you don’t have to run the query again.


Plotting and Tabular Data

Overview

Teaching: 65 min
Exercises: 20 min
Questions
  • How do we manipulate Astropy Tables? How do we make scatter plots in Matplotlib? How do we store data in a Pandas DataFrame? How do we save a workflow into a reusable function?

Objectives
  • Select rows and columns from an Astropy Table.

  • Use Matplotlib to make a scatter plot.

  • Use Gala to transform coordinates.

  • Make a Pandas DataFrame and use a Boolean Series to select rows.

  • Save a DataFrame in an HDF5 file.

In the previous episode, we wrote a query to select stars from the region of the sky where we expect GD-1 to be, and saved the results in a FITS file.

Now we will read that data back in and implement the next step in the analysis, identifying stars with the proper motion we expect for GD-1.

Outline

  1. We will read back the results from the previous lesson, which we saved in a FITS file.

  2. Then we will transform the coordinates and proper motion data from ICRS back to the coordinate frame of GD-1.

  3. We will put those results into a Pandas DataFrame.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

In the previous episode, we ran a query on the Gaia server, downloaded data for roughly 140,000 stars, and saved the data in a FITS file. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need to run the following lines of code.

This imports previously imported functions:

import astropy.units as u
from astropy.coordinates import SkyCoord
from gala.coordinates import GD1Koposov10
from astropy.table import Table

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_results.fits')

filename = 'gd1_results.fits'
polygon_results = Table.read(filename)

gd1_frame = GD1Koposov10()

Selecting rows and columns

In the previous episode, we selected spatial and proper motion information from the Gaia catalog for stars around a small part of GD-1. The output was returned as an Astropy Table. We can use info to check the contents.

polygon_results.info()
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
 parallax float64      mas                                                           Parallax

In this episode, we will see operations for selecting columns and rows from an Astropy Table. You can find more information about these operations in the Astropy documentation.

We can get the names of the columns like this:

polygon_results.colnames
['source_id', 'ra', 'dec', 'pmra', 'pmdec', 'parallax']

And select an individual column like this:

polygon_results['ra']
<Column name='ra' dtype='float64' unit='deg' description='Right ascension' length=140339>
142.48301935991023
142.25452941346344
142.64528557468074
142.57739430926034
142.58913564478618
141.81762228999614
143.18339801317677
 142.9347319464589
142.26769745823267
142.89551292869012
[Output truncated]

The result is a Column object that contains the data, and also the data type, units, and name of the column.

type(polygon_results['ra'])
astropy.table.column.Column

The rows in the Table are numbered from 0 to n-1, where n is the number of rows. We can select the first row like this:

polygon_results[0]
<Row index=0>
    source_id              ra                dec                pmra              pmdec             parallax     
                          deg                deg              mas / yr           mas / yr             mas        
      int64             float64            float64            float64            float64            float64      
------------------ ------------------ ----------------- ------------------- ----------------- -------------------
637987125186749568 142.48301935991023 21.75771616932985 -2.5168384683875766 2.941813096629439 -0.2573448962333354

The result is a Row object.

type(polygon_results[0])
astropy.table.row.Row

Notice that the bracket operator can be used to select both columns and rows. You might wonder how it knows which to select. If the expression in brackets is a string, it selects a column; if the expression is an integer, it selects a row.

If you apply the bracket operator twice, you can select a column and then an element from the column.

polygon_results['ra'][0]
142.48301935991023

Or you can select a row and then an element from the row.

polygon_results[0]['ra']
142.48301935991023

You get the same result either way.

Scatter plot

To see what the results look like, we will use a scatter plot. The library we will use is Matplotlib, which is the most widely-used plotting library for Python. The Matplotlib interface is based on MATLAB (hence the name), so if you know MATLAB, some of it will be familiar.

We will import like this:

import matplotlib.pyplot as plt

Pyplot is part of the Matplotlib library. It is conventional to import it using the shortened name plt.

Keeping plots in the notebook

In recent versions of Jupyter, plots appear “inline”; that is, they are part of the notebook. In some older versions, plots appear in a new window. If your plots appear in a new window, you might want to run the following Jupyter magic command in a notebook cell:

%matplotlib inline

Pyplot provides two functions that can make scatter plots, plt.scatter and plt.plot.

Jake Vanderplas explains these differences in The Python Data Science Handbook.

Since we are plotting more than 100,000 points and they are all the same size and color, we will use plot.

Here is a scatter plot of the stars we selected in the GD-1 region with right ascension on the x-axis and declination on the y-axis, both ICRS coordinates in degrees.

x = polygon_results['ra']
y = polygon_results['dec']
plt.plot(x, y, 'ko')

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)')
<Figure size 432x288 with 1 Axes>

Scatter plot of right ascension and declination in ICRS coordinates, demonstrating overplotting.

The arguments to plt.plot are x, y, and a string that specifies the style. In this case, the letters ko indicate that we want a black, round marker (k is for black because b is for blue). The functions xlabel and ylabel put labels on the axes.

Looking at this plot, we can see that the region we selected, which is a rectangle in GD-1 coordinates, is a non-rectanglar region in ICRS coordinates.

However, this scatter plot has a problem. It is “overplotted”, which means that there are so many overlapping points, we cannot distinguish between high and low density areas.

To fix this, we can provide optional arguments to control the size and transparency of the points.

Exercise (5 minutes)

In the call to plt.plot, use the keyword argument markersize to make the markers smaller.

Then add the keyword argument alpha to make the markers partly transparent.

Adjust these arguments until you think the figure shows the data most clearly.

Note: Once you have made these changes, you might notice that the figure shows stripes with lower density of stars. These stripes are caused by the way Gaia scans the sky, which you can read about here. The dataset we are using, Gaia Data Release 2, covers 22 months of observations; during this time, some parts of the sky were scanned more than others.

Solution

x = polygon_results['ra']
y = polygon_results['dec']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)')

Transform back

Remember that we selected data from a rectangle of coordinates in the GD-1 frame, then transformed them to ICRS when we constructed the query. The coordinates in the query results are in ICRS.

To plot them, we will transform them back to the GD-1 frame; that way, the axes of the figure are aligned with the orbit of GD-1, which is useful for two reasons:

To do the transformation, we will put the results into a SkyCoord object. In a previous episode, we created a SkyCoord object like this:

skycoord = SkyCoord(ra=polygon_results['ra'], dec=polygon_results['dec'])

Notice that we did not specify the reference frame. That is because when using ra and dec in SkyCoord, the ICRS frame is assumed by default.

The SkyCoord object can keep track not just of location, but also proper motions. This means that we can initialize a SkyCoord object with location and proper motions, then use all of these quantities together to transform into the GD-1 frame.

Now we are going to do something similar, but now we will take advantage of the SkyCoord object’s capacity to include and track space motion information in addition to ra and dec. We will now also include:

distance = 8 * u.kpc
radial_velocity= 0 * u.km/u.s

skycoord = SkyCoord(ra=polygon_results['ra'], 
                    dec=polygon_results['dec'],
                    pm_ra_cosdec=polygon_results['pmra'],
                    pm_dec=polygon_results['pmdec'], 
                    distance=distance, 
                    radial_velocity=radial_velocity)

For the first four arguments, we use columns from polygon_results.

For distance and radial_velocity we use constants, which we explain in the section on reflex correction.

The result is an Astropy SkyCoord object, which we can transform to the GD-1 frame.

transformed = skycoord.transform_to(gd1_frame)

The result is another SkyCoord object, now in the GD-1 frame.

Reflex Correction

The next step is to correct the proper motion measurements for the effect of the motion of our solar system around the Galactic center.

When we created skycoord, we provided constant values for distance and radial_velocity rather than measurements from Gaia.

That might seem like a strange thing to do, but here is the motivation:

With this preparation, we can use reflex_correct from Gala (documentation here) to correct for the motion of the solar system.

from gala.coordinates import reflex_correct

skycoord_gd1 = reflex_correct(transformed)

The result is a SkyCoord object that contains

We can select the coordinates and plot them like this:

x = skycoord_gd1.phi1
y = skycoord_gd1.phi2
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)')
<Figure size 432x288 with 1 Axes>

Scatter plot of phi1 versus phi2 in GD-1 coordinates, showing selected region is rectangular.

We started with a rectangle in the GD-1 frame. When transformed to the ICRS frame, it is a non-rectangular region. Now, transformed back to the GD-1 frame, it is a rectangle again.

Pandas DataFrame

At this point we have two objects containing different sets of the data relating to identifying stars in GD-1. polygon_results is the Astropy Table we downloaded from Gaia.

type(polygon_results)
astropy.table.table.Table

And skycoord_gd1 is a SkyCoord object that contains the transformed coordinates and proper motions.

type(skycoord_gd1)
astropy.coordinates.sky_coordinate.SkyCoord

On one hand, this division of labor makes sense because each object provides different capabilities. But working with multiple object types can be awkward. It will be more convenient to choose one object and get all of the data into it.

Now we can extract the columns we want from skycoord_gd1 and add them as columns in the Astropy Table polygon_results. phi1 and phi2 contain the transformed coordinates.

polygon_results['phi1'] = skycoord_gd1.phi1
polygon_results['phi2'] = skycoord_gd1.phi2
polygon_results.info()
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
 parallax float64      mas                                                           Parallax
     phi1 float64      deg                                                                   
     phi2 float64      deg                                                                   

pm_phi1_cosphi2 and pm_phi2 contain the components of proper motion in the transformed frame.

polygon_results['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
polygon_results['pm_phi2'] = skycoord_gd1.pm_phi2
polygon_results.info()
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
 parallax float64      mas                                                           Parallax
     phi1 float64      deg                                                                   
     phi2 float64      deg                                                                   
  pm_phi1 float64 mas / yr                                                                   
  pm_phi2 float64 mas / yr     

Detail If you notice that SkyCoord has an attribute called proper_motion, you might wonder why we are not using it.

We could have: proper_motion contains the same data as pm_phi1_cosphi2 and pm_phi2, but in a different format.

Pandas DataFrames versus Astropy Tables

Two common choices are the Pandas DataFrame and Astropy Table. Pandas DataFrames and Astropy Tables share many of the same characteristics and most of the manipulations that we do can be done with either. As you become more familiar with each, you will develop a sense of which one you prefer for different tasks. For instance you may choose to use Astropy Tables to read in data, especially astronomy specific data formats, but Pandas DataFrames to inspect the data. Fortunately, Astropy makes it easy to convert between the two data types. We will choose to use Pandas DataFrame, for two reasons:

  1. It provides capabilities that are (almost) a superset of the other data structures, so it is the all-in-one solution.

  2. Pandas is a general-purpose tool that is useful in many domains, especially data science. If you are going to develop expertise in one tool, Pandas is a good choice.

However, compared to an Astropy Table, Pandas has one big drawback: it does not keep the metadata associated with the table, including the units for the columns. Nevertheless, we think its a useful data type to be familiar with.

It is straightforward to convert an Astropy Table to a Pandas DataFrame.

import pandas as pd

results_df = polygon_results.to_pandas()

DataFrame provides shape, which shows the number of rows and columns.

results_df.shape
(140339, 10)

It also provides head, which displays the first few rows. head is useful for spot-checking large results as you go along.

results_df.head()
            source_id          ra        dec       pmra       pmdec   parallax        phi1       phi2    pm_phi1     pm_phi2
0  637987125186749568  142.483019  21.757716  -2.516838    2.941813  -0.257345  -54.975623  -3.659349   6.429945    6.518157
1  638285195917112960  142.254529  22.476168   2.662702  -12.165984   0.422728  -54.498247  -3.081524  -3.168637   -6.206795
2  638073505568978688  142.645286  22.166932  18.306747   -7.950660   0.103640  -54.551634  -3.554229   9.129447  -16.819570
3  638086386175786752  142.577394  22.227920   0.987786   -2.584105  -0.857327  -54.536457  -3.467966   3.837120    0.526461
4  638049655615392384  142.589136  22.110783   0.244439   -4.941079   0.099625  -54.627448  -3.542738   1.466103   -0.185292

Attributes vs functions

shape is an attribute, so we display its value without calling it as a function.

head is a function, so we need the parentheses.

Before we go any further, we will take all of the steps that we have done and consolidate them into a single function that we can use to take the coordinates and proper motion that we get as an Astropy Table from our Gaia query, add columns representing the reflex corrected GD-1 coordinates and proper motions, and transform it into a Pandas DataFrame. This is a general function that we will use multiple times as we build different queries so we want to write it once and then call the function rather than having to copy and paste the code over and over again.

def make_dataframe(table):
    """Transform coordinates from ICRS to GD-1 frame.
    
    table: Astropy Table
    
    returns: Pandas DataFrame
    """
    #Create a SkyCoord object with the coordinates and proper motions
    # in the input table
    skycoord = SkyCoord(
               ra=table['ra'], 
               dec=table['dec'],
               pm_ra_cosdec=table['pmra'],
               pm_dec=table['pmdec'], 
               distance=8*u.kpc, 
               radial_velocity=0*u.km/u.s)

    # Define the GD-1 reference frame
    gd1_frame = GD1Koposov10()

    # Transform input coordinates to the GD-1 reference frame
    transformed = skycoord.transform_to(gd1_frame)

    # Correct GD-1 coordinates for solar system motion around galactic center
    skycoord_gd1 = reflex_correct(transformed)

    #Add GD-1 reference frame columns for coordinates and proper motions
    table['phi1'] = skycoord_gd1.phi1
    table['phi2'] = skycoord_gd1.phi2
    table['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
    table['pm_phi2'] = skycoord_gd1.pm_phi2

    # Create DataFrame
    df = table.to_pandas()

    return df

Here is how we use the function:

results_df = make_dataframe(polygon_results)

Saving the DataFrame

At this point we have run a successful query and combined the results into a single DataFrame. This is a good time to save the data.

To save a Pandas DataFrame, one option is to convert it to an Astropy Table, like this:

from astropy.table import Table

results_table = Table.from_pandas(results_df)
type(results_table)
astropy.table.table.Table

Then we could write the Table to a FITS file, as we did in the previous lesson.

But, like Astropy, Pandas provides functions to write DataFrames in other formats; to see what they are find the functions here that begin with to_.

One of the best options is HDF5, which is Version 5 of Hierarchical Data Format.

HDF5 is a binary format, so files are small and fast to read and write (like FITS, but unlike XML).

An HDF5 file is similar to an SQL database in the sense that it can contain more than one table, although in HDF5 vocabulary, a table is called a Dataset. (Multi-extension FITS files can also contain more than one table.)

And HDF5 stores the metadata associated with the table, including column names, row labels, and data types (like FITS).

Finally, HDF5 is a cross-language standard, so if you write an HDF5 file with Pandas, you can read it back with many other software tools (more than FITS).

We can write a Pandas DataFrame to an HDF5 file like this:

filename = 'gd1_data.hdf'

results_df.to_hdf(filename, 'results_df', mode='w')

Because an HDF5 file can contain more than one Dataset, we have to provide a name, or “key”, that identifies the Dataset in the file.

We could use any string as the key, but it is generally a good practice to use a descriptive name (just like your DataFrame variable name) so we will give the Dataset in the file the same name (key) as the DataFrame.

By default, writing a DataFrame appends a new dataset to an existing HDF5 file. We will use the argument mode='w' to overwrite the file if it already exists rather than append another dataset to it.

Summary

In this episode, we re-loaded the Gaia data we saved from a previous query.

We transformed the coordinates and proper motion from ICRS to a frame aligned with the orbit of GD-1, stored the results in a Pandas DataFrame, and visualized them.

We combined all of these steps into a single function that we can reuse in the future to go straight from the output of a query with object coordinates in the ICRS reference frame directly to a Pandas DataFrame that includes object coordinates in the GD-1 reference frame.

We saved our results to an HDF5 file which we can use to restart the analysis from this stage or verify our results at some future time.

Key Points

  • When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.

  • For simple scatter plots in Matplotlib, plot is faster than scatter.

  • An Astropy Table and a Pandas DataFrame are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.

  • To store data from a Pandas DataFrame, a good option is an HDF5 file.


Plotting and Pandas

Overview

Teaching: 65 min
Exercises: 20 min
Questions
  • How do efficiently explore our data and identify appropriate filters to produce a clean sample (in this case of GD-1 stars)?

Objectives
  • Use a Boolean Pandas Series to select rows in a DataFrame.

  • Save multiple DataFrames in an HDF5 file.

In the previous episode, we wrote a query to select stars from the region of the sky where we expect GD-1 to be, and saved the results in a FITS and HDF5 file.

Now we will read that data back in and implement the next step in the analysis, identifying stars with the proper motion we expect for GD-1.

Outline

  1. We will put those results into a Pandas DataFrame, which we will use to select stars near the centerline of GD-1.

  2. Plotting the proper motion of those stars, we will identify a region of proper motion for stars that are likely to be in GD-1.

  3. Finally, we will select and plot the stars whose proper motion is in that region.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

Previously, we ran a query on the Gaia server, downloaded data for roughly 140,000 stars, transformed the coordinates to the GD-1 reference frame, and saved the results in an HDF5 file (Dataset name results_df). We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need to run the following lines of code.

This imports previously imported functions:

import astropy.units as u
import matplotlib.pyplot as plt
import pandas as pd

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_data.hdf')

filename = 'gd1_data.hdf'
results_df = pd.read_hdf(filename, 'results_df')

Exploring data

One benefit of using Pandas is that it provides functions for exploring the data and checking for problems. One of the most useful of these functions is describe, which computes summary statistics for each column.

results_df.describe()
          source_id             ra            dec           pmra  \
count  1.403390e+05  140339.000000  140339.000000  140339.000000   
mean   6.792399e+17     143.823122      26.780285      -2.484404   
std    3.792177e+16       3.697850       3.052592       5.913939   
min    6.214900e+17     135.425699      19.286617    -106.755260   
25%    6.443517e+17     140.967966      24.592490      -5.038789   
50%    6.888060e+17     143.734409      26.746261      -1.834943   
75%    6.976579e+17     146.607350      28.990500       0.452893   
max    7.974418e+17     152.777393      34.285481     104.319923   

               pmdec       parallax           phi1           phi2  \
[Output truncated]

Exercise (10 minutes)

Review the summary statistics in this table.

  • Do the values make sense based on what you know about the context?

  • Do you see any values that seem problematic, or evidence of other data issues?

Solution

The most noticeable issue is that some of the parallax values are negative, which seems non-physical.

Negative parallaxes in the Gaia database can arise from a number of causes like source confusion (high negative values) and the parallax zero point with systematic errors (low negative values).

Fortunately, we do not use the parallax measurements in the analysis (one of the reasons we used constant distance for reflex correction).

Plot proper motion

Now we are ready to replicate one of the panels in Figure 1 of the Price-Whelan and Bonaca paper, the one that shows components of proper motion as a scatter plot:

Scatter of proper motion phi1 versus phi2 showing overdensity in negative proper motions of GD-1 stars.

In this figure, the shaded area identifies stars that are likely to be in GD-1 because:

By plotting proper motion in the GD-1 frame, we hope to find this cluster. Then we will use the bounds of the cluster to select stars that are more likely to be in GD-1.

The following figure is a scatter plot of proper motion, in the GD-1 frame, for the stars in results_df.

x = results_df['pm_phi1']
y = results_df['pm_phi2']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
    
plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')
<Figure size 432x288 with 1 Axes>

Scatter plot of proper motion in GD-1 frame of selected stars showing most are near the origin.

Most of the proper motions are near the origin, but there are a few extreme values. Following the example in the paper, we will use xlim and ylim to zoom in on the region near the origin.

x = results_df['pm_phi1']
y = results_df['pm_phi2']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
    
plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')

plt.xlim(-12, 8)
plt.ylim(-10, 10)
<Figure size 432x288 with 1 Axes>

Zoomed in view of previous scatter plot showing overdense region.

There is a hint of an overdense region near (-7.5, 0), but if you did not know where to look, you would miss it.

To see the cluster more clearly, we need a sample that contains a higher proportion of stars in GD-1. We will do that by selecting stars close to the centerline.

Selecting the centerline

As we can see in the following figure, many stars in GD-1 are less than 1 degree from the line phi2=0.

Scatter plot with selection on proper motion and photometry showing many stars in GD-1 are within 1 degree of phi2 = 0.

Stars near this line have the highest probability of being in GD-1.

To select them, we will use a “Boolean mask”. We wil start by selecting the phi2 column from the DataFrame:

phi2 = results_df['phi2']
type(phi2)
pandas.core.series.Series

The result is a Series, which is the structure Pandas uses to represent columns.

We can use a comparison operator, >, to compare the values in a Series to a constant.

phi2_min = -1.0 * u.degree
phi2_max = 1.0 * u.degree

mask = (phi2 > phi2_min)
type(mask)
pandas.core.series.Series

The result is a Series of Boolean values, that is, True and False.

mask.head()
0    False
1    False
2    False
3    False
4    False
Name: phi2, dtype: bool

To select values that fall between phi2_min and phi2_max, we will use the & operator, which computes “logical AND”. The result is true where elements from both Boolean Series are true.

mask = (phi2 > phi2_min) & (phi2 < phi2_max)

Logical operators

Python’s logical operators (and, or, and not) do not work with NumPy or Pandas. Both libraries use the bitwise operators (&, |, and ~) to do elementwise logical operations (explanation here).

Also, we need the parentheses around the conditions; otherwise the order of operations is incorrect.

The sum of a Boolean Series is the number of True values, so we can use sum to see how many stars are in the selected region.

mask.sum()
25084

A Boolean Series is sometimes called a “mask” because we can use it to mask out some of the rows in a DataFrame and select the rest, like this:

centerline_df = results_df[mask]
type(centerline_df)
pandas.core.frame.DataFrame

centerline_df is a DataFrame that contains only the rows from results_df that correspond to True values in mask. So it contains the stars near the centerline of GD-1.

We can use len to see how many rows are in centerline_df:

len(centerline_df)
25084

And what fraction of the rows we have selected.

len(centerline_df) / len(results_df)
0.1787386257562046

There are about 25,000 stars in this region, about 18% of the total.

Plotting proper motion

This is the second time we are plotting proper motion, and we can imagine we might do it a few more times. Instead of copying and pasting the previous code, we will write a function that we can reuse on any dataframe.

def plot_proper_motion(df):
    """Plot proper motion.
    
    df: DataFrame with `pm_phi1` and `pm_phi2`
    """
    x = df['pm_phi1']
    y = df['pm_phi2']
    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlabel('Proper motion phi1 (mas/yr)')
    plt.ylabel('Proper motion phi2 (mas/yr)')

    plt.xlim(-12, 8)
    plt.ylim(-10, 10)

And we can call it like this:

plot_proper_motion(centerline_df)
<Figure size 432x288 with 1 Axes>

Scatter plot of proper motion of selected stars showing cluster near (-7.5, 0).

Now we can see more clearly that there is a cluster near (-7.5, 0).

You might notice that our figure is less dense than the one in the paper. That is because we started with a set of stars from a relatively small region. The figure in the paper is based on a region about 10 times bigger.

In the next episode we will go back and select stars from a larger region. But first we will use the proper motion data to identify stars likely to be in GD-1.

Filtering based on proper motion

The next step is to select stars in the “overdense” region of proper motion, which are candidates to be in GD-1.

In the original paper, Price-Whelan and Bonaca used a polygon to cover this region, as shown in this figure.

Scatter plot of proper motion with overlaid polygon showing overdense region selected for analysis in Price-Whelan and Bonaca paper.

We will use a simple rectangle for now, but in a later lesson we will see how to select a polygonal region as well.

Here are bounds on proper motion we chose by eye:

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0

To draw these bounds, we will use the make_rectangle function we wrote in episode 2 to make two lists containing the coordinates of the corners of the rectangle.

pm1_rect, pm2_rect = make_rectangle(
    pm1_min, pm1_max, pm2_min, pm2_max)

Here is what the plot looks like with the bounds we chose.

plot_proper_motion(centerline_df)
plt.plot(pm1_rect, pm2_rect, '-')
<Figure size 432x288 with 1 Axes>

Scatter plot of proper motion with blue box showing overdense region selected for our analysis.

Now that we have identified the bounds of the cluster in proper motion, we will use it to select rows from results_df.

We will use the following function, which uses Pandas operators to make a mask that selects rows where series falls between low and high.

def between(series, low, high):
    """Check whether values are between `low` and `high`."""
    return (series > low) & (series < high)

The following mask selects stars with proper motion in the region we chose.

pm1 = results_df['pm_phi1']
pm2 = results_df['pm_phi2']

pm_mask = (between(pm1, pm1_min, pm1_max) & 
           between(pm2, pm2_min, pm2_max))

Again, the sum of a Boolean series is the number of TRUE values.

pm_mask.sum()
1049

Now we can use this mask to select rows from results_df.

selected_df = results_df[pm_mask]
len(selected_df)
1049

These are the stars we think are likely to be in GD-1. We can inspect these stars, plotting their coordinates (not their proper motion).

x = selected_df['phi1']
y = selected_df['phi2']
plt.plot(x, y, 'ko', markersize=1, alpha=1)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)')
<Figure size 432x288 with 1 Axes>

Scatter plot of coordinates of stars in selected region, showing tidal stream.

Now that is starting to look like a tidal stream!

To clean up the plot a little bit we can add two new Matplotlib commands:

In an example like this, where x and y represent coordinates in space, equal axes ensures that the distance between points is represented accurately. Since we are now constraining the relative proportions of our axes, the data may not fill the entire figure.

x = selected_df['phi1']
y = selected_df['phi2']

plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

plt.xlabel('phi1 [deg]')
plt.ylabel('phi2 [deg]')
plt.title('Proper motion selection', fontsize='medium')

plt.axis('equal')
<Figure size 432x288 with 1 Axes>

Scatter plot of coordinates of stars in selected region, showing tidal stream with equally proportioned axes.

Before we go any further, we will put the code we wrote to make one of the panel figures into a function that we will use in future episodes to recreate this entire plot with a single line of code.

def plot_pm_selection(df):
    """Plot in GD-1 spatial coordinates the location of the stars
    selected by proper motion
    """
    x = df['phi1']
    y = df['phi2']

    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlabel('phi1 [deg]')
    plt.ylabel('phi2 [deg]')
    plt.title('Proper motion selection', fontsize='medium')

    plt.axis('equal')

Now our one line plot command is:

plot_pm_selection(selected_df)

Saving the DataFrame

At this point we have run a successful query and cleaned up the results. This is a good time to save the data. We have already started a results file called gd1_data.hdf which we wrote results_df to.

Recall that we chose HDF5 because it is a binary format producing small files that are fast to read and write and are a cross-language standard. . Additionally, HDF5 files can contain more than one dataset and can store metadata associated with each dataset (such as column names or observatory information, like a FITS header).

We can add to our existing Pandas DataFrame to an HDF5 file by omitting the mode='w' keyword like this:

filename = 'gd1_data.hdf'

selected_df.to_hdf(filename, 'selected_df')

Because an HDF5 file can contain more than one Dataset, we have to provide a name, or “key”, that identifies the Dataset in the file.

We could use any string as the key, but it is generally a good practice to use a descriptive name (just like your DataFrame variable name) so we will give the Dataset in the file the same name (key) as the DataFrame.

Exercise (5 minutes)

We are going to need centerline_df later as well. Write a line of code to add it as a second Dataset in the HDF5 file.

Hint: Since the file already exists, you should not use mode='w'.

Solution

centerline_df.to_hdf(filename, 'centerline_df')

We can use getsize to confirm that the file exists and check the size. getsize returns a value in bytes. For the size files we’re looking at, it will be useful to view their size in MegaBytes (MB), so we will divide by 1024*1024.

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
2.2084197998046875

If you forget what the names of the Datasets in the file are, you can read them back like this:

with pd.HDFStore(filename) as hdf:
    print(hdf.keys())
['/centerline_df', '/results_df', '/selected_df']

Context Managers

We use a with statement here to open the file before the print statement and (automatically) close it after. Read more about context managers.

The keys are the names of the Datasets which makes it easy for us to remember which DataFrame is in which Dataset.

Summary

In this episode, we re-loaded the transformed Gaia data we saved from a previous query.

Then we prototyped the selection process from the Price-Whelan and Bonaca paper locally using data that we had already downloaded.:

So far, we have used data from a relatively small region of the sky so that our local dataset was not too big. In the next lesson, we will write a query that selects stars based on the proper motion limits we identified in this lesson, which will allow us to explore a larger region.

Key Points

  • A workflow is often prototyped on a small set of data which can be explored more easily and used to identify ways to limit a dataset to exactly the data you want.

  • To store data from a Pandas DataFrame, a good option is an HDF5 file, which can contain multiple Datasets.


Transform and Select

Overview

Teaching: 60 min
Exercises: 15 min
Questions
  • How do we move the computation to the data?

Objectives
  • Transform proper motions from one frame to another.

  • Compute the convex hull of a set of points.

  • Write an ADQL query that selects based on proper motion.

In the previous episode, we identified stars with the proper motion we expect for GD-1.

Now we will do the same selection in an ADQL query, which will make it possible to work with a larger region of the sky and still download less data.

Outline

  1. Using data from the previous episode, we will identify the values of proper motion for stars likely to be in GD-1.

  2. Then we will compose an ADQL query that selects stars based on proper motion, so we can download only the data we need.

That will make it possible to search a bigger region of the sky in a single query.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

Previously, we ran a query on the Gaia server, downloaded data for roughly 140,000 stars, and saved the data in a FITS file. We then selected just the stars with the same proper motion as GD-1 and saved the results to an HDF5 file. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need to run the following lines of code.

This imports previously imported functions:

import astropy.units as u
from astropy.coordinates import SkyCoord
from astroquery.gaia import Gaia
from gala.coordinates import GD1Koposov10, GD1, reflex_correct
import matplotlib.pyplot as plt
import pandas as pd

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_data.hdf')

filename = 'gd1_data.hdf'
centerline_df = pd.read_hdf(filename, 'centerline_df')
selected_df = pd.read_hdf(filename, 'selected_df')

This defines previously defined quantities:

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0

pm1_rect, pm2_rect = make_rectangle(
    pm1_min, pm1_max, pm2_min, pm2_max)
gd1_frame = GD1Koposov10()

Selection by proper motion

Let us review how we got to this point.

  1. We made an ADQL query to the Gaia server to get data for stars in the vicinity of a small part of GD-1.

  2. We transformed the coordinates to the GD-1 frame (GD1Koposov10) so we could select stars along the centerline of GD-1.

  3. We plotted the proper motion of stars along the centerline of GD-1 to identify the bounds of an anomalous overdense region associated with the proper motion of stars in GD-1.

  4. We made a mask that selects stars whose proper motion is in this overdense region and which are therefore likely to be part of the GD-1 stream.

At this point we have downloaded data for a relatively large number of stars (more than 100,000) and selected a relatively small number (around 1000).

It would be more efficient to use ADQL to select only the stars we need. That would also make it possible to download data covering a larger region of the sky.

However, the selection we did was based on proper motion in the GD-1 frame. In order to do the same selection on the Gaia catalog in ADQL, we have to work with proper motions in the ICRS frame as this is the frame that the Gaia catalog uses.

First, we will verify that our proper motion selection was correct, starting with the plot_proper_motion function that we defined in episode 3. The following figure shows:

plot_proper_motion(centerline_df)

plt.plot(pm1_rect, pm2_rect)

x = selected_df['pm_phi1']
y = selected_df['pm_phi2']
plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3);
<Figure size 432x288 with 1 Axes>

Proper motion of stars in GD-1, showing selected region as blue box and stars within selection as green points.

Now we will make the same plot using proper motions in the ICRS frame, which are stored in columns named pmra and pmdec.

x = centerline_df['pmra']
y = centerline_df['pmdec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

x = selected_df['pmra']
y = selected_df['pmdec']
plt.plot(x, y, 'gx', markersize=1, alpha=0.3)
    
plt.xlabel('Proper motion ra (ICRS frame)')
plt.ylabel('Proper motion dec (ICRS frame)')

plt.xlim([-10, 5])
plt.ylim([-20, 5]);
<Figure size 432x288 with 1 Axes>

Proper motion in ICRS frame, showing selected stars are more spread out in this frame.

The proper motions of the selected stars are more spread out in this frame, which is why it was preferable to do the selection in the GD-1 frame.

But now we can define a polygon that encloses the proper motions of these stars in ICRS, and use that polygon as a selection criterion in an ADQL query.

Convex Hull

SciPy provides a function that computes the convex hull of a set of points, which is the smallest convex polygon that contains all of the points.

To use this function, we will select the columns pmra and pmdec and convert them to a NumPy array.

import numpy as np

points = selected_df[['pmra','pmdec']].to_numpy()
points.shape
(1049, 2)

Older versions of Pandas

If you are using an older version of Pandas, you might not have to_numpy(). You can use values instead, like this:

points = selected_df[['pmra','pmdec']].values

We will pass the points to ConvexHull, which returns an object that contains the results.

from scipy.spatial import ConvexHull

hull = ConvexHull(points)
hull
<scipy.spatial.qhull.ConvexHull at 0x7ff6207866a0>

hull.vertices contains the indices of the points that fall on the perimeter of the hull.

hull.vertices
array([ 692,  873,  141,  303,   42,  622,   45,   83,  127,  182, 1006,
        971,  967, 1001,  969,  940], dtype=int32)

We can use them as an index into the original array to select the corresponding ICRS frame proper motion data points.

pm_vertices = points[hull.vertices]
pm_vertices
array([[ -4.05037121, -14.75623261],
       [ -3.41981085, -14.72365546],
       [ -3.03521988, -14.44357135],
       [ -2.26847919, -13.7140236 ],
       [ -2.61172203, -13.24797471],
       [ -2.73471401, -13.09054471],
       [ -3.19923146, -12.5942653 ],
       [ -3.34082546, -12.47611926],
       [ -5.67489413, -11.16083338],
       [ -5.95159272, -11.10547884],
       [ -6.42394023, -11.05981295],
[Output truncated]

To plot the resulting polygon, we have to pull out the x and y coordinates.

pmra_poly, pmdec_poly = np.transpose(pm_vertices)

Note

This use of transpose is a useful NumPy idiom to turn data that is listed as rows of (x,y) pairs into an array of x values and an array of y values. Because pm_vertices has two columns, its matrix transpose has two rows, which are assigned to the two variables pmra_poly and pmdec_poly.

The following figure shows proper motion in ICRS again, along with the convex hull we just computed.

x = centerline_df['pmra']
y = centerline_df['pmdec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

x = selected_df['pmra']
y = selected_df['pmdec']
plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3)

plt.plot(pmra_poly, pmdec_poly)
    
plt.xlabel('Proper motion phi1 (ICRS frame)')
plt.ylabel('Proper motion phi2 (ICRS frame)')

plt.xlim([-10, 5])
plt.ylim([-20, 5]);
<Figure size 432x288 with 1 Axes>

Proper motion in ICRS, with convex hull shown as blue boundary and selected stars as green points.

So pm_vertices represents the polygon we want to select. The next step is to use this polygon as part of an ADQL query.

Assembling the query

In episode 2 we used the following query to select stars in a polygonal region around a small part of GD-1 with a few filters on color and distance (parallax):

candidate_coord_query_base = """SELECT
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({sky_point_list}))
"""

In this episode we will make two changes:

  1. We will select stars with coordinates in a larger region to include more of GD-1.

  2. We will add another clause to select stars whose proper motion is in the polygon we just computed, pm_vertices.

The fact that we remove most contaminating stars with the proper motion filter is what allows us to expand our query to include most of GD-1 without returning too many results. As we did in episode 2, we will define the physical region we want to select in the GD-1 frame and transform it to the ICRS frame to query the Gaia catalog which is in the ICRS frame.

Here are the coordinates of the larger rectangle in the GD-1 frame.

phi1_min = -70 * u.degree
phi1_max = -20 * u.degree
phi2_min = -5 * u.degree
phi2_max = 5 * u.degree

We selected these bounds by trial and error, defining the largest region we can process in a single query.

phi1_rect, phi2_rect = make_rectangle(
    phi1_min, phi1_max, phi2_min, phi2_max)

Here is how we transform it to ICRS, as we saw in episode 2.

corners = SkyCoord(phi1=phi1_rect, 
                   phi2=phi2_rect, 
                   frame=gd1_frame)

corners_icrs = corners.transform_to('icrs')

To use corners_icrs as part of an ADQL query, we have to convert it to a string. Fortunately, we wrote a function, skycoord_to_string to do this in episode 2 which we will call now.

sky_point_list = skycoord_to_string(corners_icrs)
sky_point_list
'135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862'

Here are the columns we want to select.

columns = 'source_id, ra, dec, pmra, pmdec'

Now we have everything we need to assemble the query, but DO NOT try to run this query. Because it selects a larger region, there are too many stars to handle in a single query. Until we select by proper motion, that is.

candidate_coord_query = candidate_coord_query_base.format(columns=columns, 
                            sky_point_list=sky_point_list)
print(candidate_coord_query)
SELECT
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862))

Selecting proper motion

Now we are ready to add a WHERE clause to select stars whose proper motion falls in the polygon defined by pm_vertices.

To use pm_vertices as part of an ADQL query, we have to convert it to a string. Using flatten to convert from a 2D array to a 1D array and array2string to convert the result from an array to a string, we can almost get the format we need.

s = np.array2string(pm_vertices.flatten(), 
                    max_line_width=1000,
                    separator=',')
s
'[ -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176]'

But we need to remove the brackets:

pm_point_list = s.strip('[]')
pm_point_list
' -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176'

Exercise (10 minutes)

Define candidate_coord_pm_query_base, starting with candidate_coord_query_base and adding a new clause to select stars whose coordinates of proper motion, pmra and pmdec, fall within the polygon defined by pm_point_list.

Solution

candidate_coord_pm_query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({sky_point_list}))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON({pm_point_list}))
"""

Exercise (5 minutes)

Use format to format candidate_coord_pm_query_base and define candidate_coord_pm_query, filling in the values of columns, sky_point_list, and pm_point_list.

Solution

candidate_coord_pm_query = candidate_coord_pm_query_base.format(columns=columns, 
                            sky_point_list=sky_point_list,
                            pm_point_list=pm_point_list)
print(candidate_coord_pm_query)

Now we can run the query like this:

candidate_coord_pm_job = Gaia.launch_job_async(candidate_coord_pm_query)
print(candidate_coord_pm_job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=7345>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1616771462206O
Phase: COMPLETED
[Output truncated]

And get the results.

candidate_gaia_table = candidate_coord_pm_job.get_results()
len(candidate_gaia_table)
7345

We call the results candidate_gaia_table because it contains information from the Gaia table for stars that are good candidates for GD-1.

Both sky_point_list and pm_point_list are a set of selection criteria that we derived from data downloaded from the Gaia Database. To make sure we can repeat our analysis at a later date we should save both lists to a file. There are several ways we could do that, but since we are already storing data in an HDF5 file, we will do the same with these variables.

To save them to an HDF5 file we first need to put them in a Pandas object. We have seen how to create a Series from a column in a DataFrame. Now we will build a Series from scratch. We do not need the full DataFrame format with multiple rows and columns because we are only storing two strings (sky_point_list and pm_point_list). We can store each string as a row in the Series and save it. One aspect that is nice about Series is that we can label each row. To do this we need an object that can define both the name of each row and the data to go in that row. We can use a Python Dictionary for this, defining the row names with the dictionary keys and the row data with the dictionary values.

d = dict(sky_point_list=sky_point_list, pm_point_list=pm_point_list)
d
{'sky_point_list': '135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862',
 'pm_point_list': ' -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176'}

And use this Dictionary to initialize a Series.

point_series = pd.Series(d)
point_series
sky_point_list       135.306, 8.39862, 126.51, 13.4449, 163.017, 54...
pm_point_list     -4.05037121,-14.75623261, -3.41981085,-14.723...
dtype: object

Now we can save our Series using to_hdf().

filename = 'gd1_data.hdf'
point_series.to_hdf(filename, 'point_series')

Plotting one more time

Now we can examine the results:

x = candidate_gaia_table['ra']
y = candidate_gaia_table['dec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)');
<Figure size 432x288 with 1 Axes>

Scatter plot of right ascension and declination of selected stars in ICRS frame.

This plot shows why it was useful to transform these coordinates to the GD-1 frame. In ICRS, it is more difficult to identity the stars near the centerline of GD-1.

We can use our make_dataframe function from episode 3 to transform the results back to the GD-1 frame. In addition to doing the coordinate transformation and reflex correction for us, this function also compiles everything into a single object (a DataFrame) to make it easier to use. Note that because we put this code into a function, we can do all of this with a single line of code!

candidate_gaia_df = make_dataframe(candidate_gaia_table)

We can check the results using the plot_pm_selection function we wrote in episode 3.

plot_pm_selection(candidate_gaia_df)
<Figure size 432x288 with 1 Axes>

Scatter plot of phi1 versus phi2 in GD-1 frame after selecting on proper motion.

We are starting to see GD-1 more clearly. We can compare this figure with this panel from Figure 1 from the original paper:

Figure from Price-Whelan and Bonaca paper showing phi1 vs phi2 in GD-1 after selecting on proper motion.

This panel shows stars selected based on proper motion only, so it is comparable to our figure (although notice that the original figure covers a wider region).

In the next episode, we will use photometry data from Pan-STARRS to do a second round of filtering, and see if we can replicate this panel.

Figure from Price-Whelan and Bonaca paper showing phi1 vs phi2 in GD-1 after selecting on proper motion and photometry.

Later we will learn how to add annotations like the ones in the figure and customize the style of the figure to present the results clearly and compellingly.

Summary

In the previous episode we downloaded data for a large number of stars and then selected a small fraction of them based on proper motion.

In this episode, we improved this process by writing a more complex query that uses the database to select stars based on proper motion. This process requires more computation on the Gaia server, but then we are able to either:

  1. Search the same region and download less data, or

  2. Search a larger region while still downloading a manageable amount of data.

In the next episode, we will learn about the database JOIN operation, which we will use in later episodes to join our Gaia data with photometry data from Pan-STARRS.

Key Points

  • When possible, ‘move the computation to the data’; that is, do as much of the work as possible on the database server before downloading the data.


Join

Overview

Teaching: 50 min
Exercises: 35 min
Questions
  • How do we use JOIN to combine information from multiple tables?

Objectives
  • Write ADQL queries involving JOIN operations.

  • Save data in CSV format.

The next step in our analysis is to select candidate stars based on photometry data. The following figure from the Price-Whelan and Bonaca paper is a color-magnitude diagram for the stars selected based on proper motion:

Color-magnitude diagram for the stars selected based on proper motion, from Price-Whelan and Bonaca paper.

In red is a stellar isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster.

By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from younger background stars.

Outline

  1. We will reload the candidate stars we identified in the previous episode.

  2. Then we will run a query on the Gaia server that uploads the table of candidates and uses a JOIN operation to select photometry data for the candidate stars.

  3. We will write the results to a file for use in the next episode.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

In the previous episode, we define a rectangle around stars in GD-1 in spatial coordinates and in proper motion which we transformed into ICRS coordinates and created point lists of the polygon vertices. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need run the following lines of code.

This imports previously imported functions:

from astroquery.gaia import Gaia
import pandas as pd

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_data.hdf')

filename = 'gd1_data.hdf'
point_series = pd.read_hdf(filename, 'point_series')
point_series

Getting photometry data

The Gaia dataset contains some photometry data, including the variable bp_rp, which contains BP-RP color (the difference in mean flux between the BP and RP bands). We use this variable to select stars with bp_rp between -0.75 and 2, which excludes many class M dwarf stars.

But we can do better than that. Assuming GD-1 is a globular cluster, all of the stars formed at the same time from the same material, so the stars’ photometric properties should be consistent with a single isochrone in a color magnitude diagram. Therefore, to select stars with the age and metal richness we expect in GD-1, we can use g-i color and apparent g-band magnitude, which are available from the Pan-STARRS survey.

Conveniently, the Gaia server provides data from Pan-STARRS as a table in the same database we have been using, so we can access it by making ADQL queries.

A caveat about matching stars between catalogs

In general, choosing a star from the Gaia catalog and finding the corresponding star in the Pan-STARRS catalog is not easy. This kind of cross matching is not always possible, because a star might appear in one catalog and not the other. And even when both stars are present, there might not be a clear one-to-one relationship between stars in the two catalogs. Additional catalog matching tools are available from the Astropy coordinates package.

Fortunately, people have worked on this problem, and the Gaia database includes cross-matching tables that suggest a best neighbor in the Pan-STARRS catalog for many stars in the Gaia catalog.

This document describes the cross matching process. Briefly, it uses a cone search to find possible matches in approximately the right position, then uses attributes like color and magnitude to choose pairs of observations most likely to be the same star.

The best neighbor table

So the hard part of cross-matching has been done for us. Using the results is a little tricky, but it gives us a chance to learn about one of the most important tools for working with databases: “joining” tables.

A “join” is an operation where you match up records from one table with records from another table using as a “key” a piece of information that is common to both tables, usually some kind of ID code.

In this example:

For each candidate star we have selected so far, we have the source_id; the goal is to find the obj_id for the same star in the Pan-STARRS catalog.

To do that we will:

  1. Use the JOIN operator to look up each Pan-STARRS obj_id for the stars we are interested in in the panstarrs1_best_neighbour table using the source_ids that we have already identified.

  2. Use the JOIN operator again to look up the Pan-STARRS photometry for these stars in the panstarrs1_original_valid table using the obj_ids we just identified.

Before we get to the JOIN operation, we will explore these tables. Here is the metadata for panstarrs1_best_neighbour.

ps_best_neighbor_meta = Gaia.load_table('gaiadr2.panstarrs1_best_neighbour')
Retrieving table 'gaiadr2.panstarrs1_best_neighbour'
Parsing table 'gaiadr2.panstarrs1_best_neighbour'...
Done.
print(ps_best_neighbor_meta)
TAP Table name: gaiadr2.gaiadr2.panstarrs1_best_neighbour
Description: Pan-STARRS1 BestNeighbour table lists each matched Gaia object with its
best neighbour in the external catalogue.
There are 1 327 157 objects in the filtered version of Pan-STARRS1 used
to compute this cross-match that have too early epochMean.
Num. columns: 7

And here are the columns.

for column in ps_best_neighbor_meta.columns:
    print(column.name)
source_id
original_ext_source_id
angular_distance
number_of_neighbours
number_of_mates
best_neighbour_multiplicity
gaia_astrometric_params

Here is the documentation for these variables.

The ones we will use are:

Ideally, best_neighbour_multiplicity should be 1 and number_of_mates should be 0; in that case, there is a one-to-one match between the source in Gaia and the corresponding source in Pan-STARRS.

Here is a query that selects these columns and returns the first 5 rows.

ps_best_neighbor_query = """SELECT 
TOP 5
source_id, best_neighbour_multiplicity, number_of_mates, original_ext_source_id
FROM gaiadr2.panstarrs1_best_neighbour
"""
ps_best_neighbor_job = Gaia.launch_job_async(query=ps_best_neighbor_query)
INFO: Query finished. [astroquery.utils.tap.core]
ps_best_neighbor_results = ps_best_neighbor_job.get_results()
ps_best_neighbor_results
<Table length=5>
     source_id      best_neighbour_multiplicity number_of_mates original_ext_source_id
       int64                  int32                  int16              int64         
------------------- --------------------------- --------------- ----------------------
6745938972433480704                           1               0      69742925668851205
6030466788955954048                           1               0      69742509325691172
6756488099308169600                           1               0      69742879438541228
6700154994715046016                           1               0      69743055581721207
6757061941303252736                           1               0      69742856540241198

The Pan-STARRS table

Now that we know the Pan-STARRS obj_id, we are ready to match this to the photometry in the panstarrs1_original_valid table. Here is the metadata for the table that contains the Pan-STARRS data.

ps_valid_meta = Gaia.load_table('gaiadr2.panstarrs1_original_valid')
Retrieving table 'gaiadr2.panstarrs1_original_valid'
Parsing table 'gaiadr2.panstarrs1_original_valid'...
Done.
print(ps_valid_meta)
TAP Table name: gaiadr2.gaiadr2.panstarrs1_original_valid
Description: The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is
a system for wide-field astronomical imaging developed and operated by
the Institute for Astronomy at the University of Hawaii. Pan-STARRS1
(PS1) is the first part of Pan-STARRS to be completed and is the basis
for Data Release 1 (DR1). The PS1 survey used a 1.8 meter telescope and
its 1.4 Gigapixel camera to image the sky in five broadband filters (g,
r, i, z, y).

The current table contains a filtered subsample of the 10 723 304 629
entries listed in the original ObjectThin table.
[Output truncated]

And here are the columns.

for column in ps_valid_meta.columns:
    print(column.name)
obj_name
obj_id
ra
dec
ra_error
dec_error
epoch_mean
g_mean_psf_mag
g_mean_psf_mag_error
g_flags
r_mean_psf_mag
[Output truncated]

Here is the documentation for these variables .

The ones we will use are:

Here is a query that selects these variables and returns the first 5 rows.

ps_valid_query = """SELECT 
TOP 5
obj_id, g_mean_psf_mag, i_mean_psf_mag 
FROM gaiadr2.panstarrs1_original_valid
"""
ps_valid_job = Gaia.launch_job_async(query=ps_valid_query)
INFO: Query finished. [astroquery.utils.tap.core]
ps_valid_results = ps_valid_job.get_results()
ps_valid_results
<Table length=5>
      obj_id      g_mean_psf_mag  i_mean_psf_mag 
                                       mag       
      int64          float64         float64     
----------------- -------------- ----------------
67130655389101425             -- 20.3516006469727
67553305590067819             --  19.779899597168
67551423248967849             -- 19.8889007568359
67132026238911331             -- 20.9062995910645
67553513677687787             -- 21.2831001281738

Joining tables

The following figure shows how these tables are related.

There is no guarantee that the corresponding rows of these tables are in the same order, so the JOIN operation involves some searching. However, ADQL/SQL databases are implemented in a way that makes this kind of search efficient. If you are curious, you can read more about it.

Diagram showing relationship between the gaia_source, panstarrs1_best_neighbor, and panstarrs1_original_valid tables and result table.

Now we will get to the details of performing a JOIN operation.

We are about to build a complex query using software that doesn’t provide us with any helpful information for debugging. For this reason we are going to start with a simplified version of what we want to do until we are sure we are joining the tables correctly, then we will slowly add more layers of complexity, checking at each stage that our query still works. As a starting place, we will go all the way back to the cone search from episode 2.

test_cone_query = """SELECT 
TOP 10 
source_id
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

And we will run it, to make sure we have a working query to build on.

test_cone_job = Gaia.launch_job(query=test_cone_query)
INFO: Query finished. [astroquery.utils.tap.core]
test_cone_results = test_cone_job.get_results()
test_cone_results
<Table length=10>
     source_id     
       int64       
-------------------
3322773965056065536
3322773758899157120
3322774068134271104
3322773930696320512
3322774377374425728
3322773724537891456
3322773724537891328
[Output truncated]

Now we can start adding features. First, we will replace source_id with the format specifier columns so that we can alter what columns we want to return without having to modify our base query:

cone_base_query = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

As a reminder, here are the columns we want from the Gaia table:

columns = 'source_id, ra, dec, pmra, pmdec'

cone_query = cone_base_query.format(columns=columns)
print(cone_query)
SELECT 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))

We run the query again.

cone_job = Gaia.launch_job_async(query=cone_query)
INFO: Query finished. [astroquery.utils.tap.core]
cone_results = cone_job.get_results()
cone_results
<Table length=594>
     source_id              ra        ...        pmdec       
                           deg        ...       mas / yr     
       int64             float64      ...       float64      
------------------- ----------------- ... -------------------
3322773965056065536 88.78178020183375 ... -2.5057036964736907
3322773758899157120 88.83227057144585 ...                  --
3322774068134271104  88.8206092188033 ... -1.5260889445858044
3322773930696320512 88.80843339290348 ... -0.9292104395445717
3322774377374425728 88.86806108182265 ... -3.8676624830902435
3322773724537891456 88.81308602813434 ... -33.078133430952086
[Output truncated]

Adding the best neighbor table

Now we are ready for the first join. The join operation requires two clauses:

In this example, we join with gaiadr2.panstarrs1_best_neighbour AS best, which means we can refer to the best neighbor table with the abbreviated name best, which will save us a lot of typing. Similarly, we will be referring to the gaiadr2.gaia_source table by the abbreviated name gaia.

The ON clause indicates that we will match up the source_id column from the Gaia table with the source_id column from the best neighbor table.

neighbors_base_query = """SELECT 
{columns}
FROM gaiadr2.gaia_source AS gaia
JOIN gaiadr2.panstarrs1_best_neighbour AS best
  ON gaia.source_id = best.source_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

SQL detail

In this example, the ON column has the same name in both tables, so we could replace the ON clause with a simpler USINGclause:

USING(source_id)

Now that there is more than one table involved, we can’t use simple column names any more; we have to use qualified column names. In other words, we have to specify which table each column is in. The column names do not have to be the same and, in fact, in the next join they will not be. That is one of the reasons that we explicitly specify them. Here is the complete query, including the columns we want from the Gaia and best neighbor tables. Here you can start to see that using the abbreviated names is making our query easier to read and requires less typing for us. In addition to the spatial coordinates and proper motion, we are going to return the best_neighbour_multiplicity and number_of_mates columns from the panstarrs1_best_neighbour table in order to evaluate the quality of the data that we are using by evaluating the number of one-to-one matches between the catalogs. Recall that best_neighbour_multiplicity tells us the number of PanSTARRs objects that match a Gaia object and number_of_mates tells us the number of Gaia objects that match a PanSTARRs object.

column_list_neighbours = ['gaia.source_id',
               'gaia.ra',
               'gaia.dec',
               'gaia.pmra',
               'gaia.pmdec',
               'best.best_neighbour_multiplicity',
               'best.number_of_mates',
              ]
columns = ', '.join(column_list_neighbours)

neighbors_query = neighbors_base_query.format(columns=columns)
print(neighbors_query)
SELECT 
gaia.source_id, gaia.ra, gaia.dec, gaia.pmra, gaia.pmdec, best.best_neighbour_multiplicity, best.number_of_mates
FROM gaiadr2.gaia_source AS gaia
JOIN gaiadr2.panstarrs1_best_neighbour AS best
  ON gaia.source_id = best.source_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
neighbors_job = Gaia.launch_job_async(neighbors_query)
INFO: Query finished. [astroquery.utils.tap.core]
neighbors_results = neighbors_job.get_results()
neighbors_results
<Table length=490>
     source_id              ra        ... number_of_mates
                           deg        ...                
       int64             float64      ...      int16     
------------------- ----------------- ... ---------------
3322773965056065536 88.78178020183375 ...               0
3322774068134271104  88.8206092188033 ...               0
3322773930696320512 88.80843339290348 ...               0
3322774377374425728 88.86806108182265 ...               0
3322773724537891456 88.81308602813434 ...               0
3322773724537891328 88.81570329208743 ...               0
[Output truncated]

This result has fewer rows than the previous result. That is because there are sources in the Gaia table with no corresponding source in the Pan-STARRS table.

By default, the result of the join only includes rows where the same source_id appears in both tables. This default is called an “inner” join because the results include only the intersection of the two tables. You can read about the other kinds of join here.

Adding the Pan-STARRS table

Exercise (10 minutes)

Now we are ready to bring in the Pan-STARRS table. Starting with the previous query, add a second JOIN clause that joins with gaiadr2.panstarrs1_original_valid, gives it the abbreviated name ps, and matches original_ext_source_id from the best neighbor table with obj_id from the Pan-STARRS table.

Add g_mean_psf_mag and i_mean_psf_mag to the column list, and run the query. The result should contain 490 rows and 9 columns.

Solution

join_solution_query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source as gaia
JOIN gaiadr2.panstarrs1_best_neighbour as best
  ON gaia.source_id = best.source_id
JOIN gaiadr2.panstarrs1_original_valid as ps
  ON best.original_ext_source_id = ps.obj_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

column_list = ['gaia.source_id',
               'gaia.ra',
               'gaia.dec',
               'gaia.pmra',
               'gaia.pmdec',
               'best.best_neighbour_multiplicity',
               'best.number_of_mates',
               'ps.g_mean_psf_mag',
               'ps.i_mean_psf_mag']

columns = ', '.join(column_list)

join_solution_query = join_solution_query_base.format(columns=columns)
print(join_solution_query)

join_solution_job = Gaia.launch_job_async(join_solution_query)
join_solution_results = join_solution_job.get_results()
join_solution_results
<Table length=490>
     source_id              ra        ...  g_mean_psf_mag   i_mean_psf_mag 
                           deg        ...                        mag       
       int64             float64      ...     float64          float64     
------------------- ----------------- ... ---------------- ----------------
3322773965056065536 88.78178020183375 ... 19.9431991577148 17.4221992492676
3322774068134271104  88.8206092188033 ... 18.6212005615234 16.6007995605469
3322773930696320512 88.80843339290348 ...               -- 20.2203998565674
3322774377374425728 88.86806108182265 ... 18.0676002502441 16.9762001037598
3322773724537891456 88.81308602813434 ... 20.1907005310059 17.8700008392334
3322773724537891328 88.81570329208743 ... 22.6308002471924 19.6004009246826
[Output truncated]

Selecting by coordinates and proper motion

We are now going to replace the cone search with the GD-1 selection that we built in previous episodes. We will start by making sure that our previous query works, then add in the JOIN. Now we will bring in the WHERE clause from the previous episode, which selects sources based on parallax, BP-RP color, sky coordinates, and proper motion.

Here is candidate_coord_pm_query_base from the previous episode.

candidate_coord_pm_query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({sky_point_list}))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON({pm_point_list}))
"""

Now we can assemble the query using the sky and proper motion point lists we compiled in episode 4.

columns = 'source_id, ra, dec, pmra, pmdec'

candidate_coord_pm_query = candidate_coord_pm_query_base.format(columns=columns,
                            sky_point_list=point_series['sky_point_list'],
                            pm_point_list=point_series['pm_point_list'])

print(candidate_coord_pm_query)
SELECT 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON( -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176))

We run it to make sure we are starting with a working query.

candidate_coord_pm_job = Gaia.launch_job_async(query=candidate_coord_pm_query)
INFO: Query finished. [astroquery.utils.tap.core]
candidate_coord_pm_results = candidate_coord_pm_job.get_results()
candidate_coord_pm_results
<Table length=7345>
    source_id              ra         ...        pmdec       
                          deg         ...       mas / yr     
      int64             float64       ...       float64      
------------------ ------------------ ... -------------------
635559124339440000 137.58671691646745 ... -12.490481778113859
635860218726658176  138.5187065217173 ... -11.346409129876392
635674126383965568  138.8428741026386 ... -12.702779525389634
635535454774983040  137.8377518255436 ... -14.492308604905652
635497276810313600  138.0445160213759 ... -12.291499169815987
635614168640132864 139.59219748145836 ... -13.708904908478631
[Output truncated]

Exercise (15 minutes)

Create a new query base called candidate_join_query_base that combines the WHERE clauses from the previous query with the JOIN clauses for the best neighbor and Pan-STARRS tables. Format the query base using the column names in column_list, and call the result candidate_join_query.

Hint: Make sure you use qualified column names everywhere!

Run your query and download the results. The table you get should have 3725 rows and 9 columns.

Solution

candidate_join_query_base = """
SELECT 
{columns}
FROM gaiadr2.gaia_source as gaia
JOIN gaiadr2.panstarrs1_best_neighbour as best
  ON gaia.source_id = best.source_id
JOIN gaiadr2.panstarrs1_original_valid as ps
  ON best.original_ext_source_id = ps.obj_id
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(gaia.ra, gaia.dec), 
                   POLYGON({sky_point_list}))
  AND 1 = CONTAINS(POINT(gaia.pmra, gaia.pmdec),
                   POLYGON({pm_point_list}))
"""

columns = ', '.join(column_list)

candidate_join_query = candidate_join_query_base.format(columns=columns,
                            sky_point_list=point_series['sky_point_list'],
                            pm_point_list=point_series['pm_point_list'])
print(candidate_join_query)


candidate_join_job = Gaia.launch_job_async(query=candidate_join_query)
candidate_table = candidate_join_job.get_results()
candidate_table

Checking the match

To get more information about the matching process, we can inspect best_neighbour_multiplicity, which indicates for each star in Gaia how many stars in Pan-STARRS are equally likely matches.

candidate_table['best_neighbour_multiplicity']
<MaskedColumn name='best_neighbour_multiplicity' dtype='int16' description='Number of neighbours with same probability as best neighbour' length=3725>
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
[Output truncated]

Most of the values are 1, which is good; that means that for each candidate star we have identified exactly one source in Pan-STARRS that is likely to be the same star.

To check whether there are any values other than 1, we can convert this column to a Pandas Series and use describe, which we saw in in episode 3.

multiplicity = pd.Series(candidate_table['best_neighbour_multiplicity'])
multiplicity.describe()
count    3725.0
mean        1.0
std         0.0
min         1.0
25%         1.0
50%         1.0
75%         1.0
max         1.0
dtype: float64

In fact, 1 is the only value in the Series, so every candidate star has a single best match.

Similarly, number_of_mates indicates the number of other stars in Gaia that match with the same star in Pan-STARRS.

mates = pd.Series(candidate_table['number_of_mates'])
mates.describe()
count    3725.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
dtype: float64

All values in this column are 0, which means that for each match we found in Pan-STARRS, there are no other stars in Gaia that also match.

Number of neighbors

The table also contains number_of_neighbors which is the number of stars in Pan-STARRS that match in terms of position, before using other criteria to choose the most likely match. But we are more interested in the final match, using both criteria.

Saving the DataFrame

We can make a DataFrame from our Astropy Table and save our results so we can pick up where we left off without running this query again. Once again, we will make use of our make_dataframe function.

candidate_df = make_dataframe(candidate_table)

The HDF5 file should already exist, so we’ll add candidate_df to it.

filename = 'gd1_data.hdf'

candidate_df.to_hdf(filename, 'candidate_df')

We can use getsize to confirm that the file exists and check the size:

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
3.5835609436035156

Another file format - CSV

Pandas can write a variety of other formats, which you can read about here. We won’t cover all of them, but one other important one is CSV, which stands for “comma-separated values”.

CSV is a plain-text format that can be read and written by pretty much any tool that works with data. In that sense, it is the “least common denominator” of data formats.

However, it has an important limitation: some information about the data gets lost in translation, notably the data types. If you read a CSV file from someone else, you might need some additional information to make sure you are getting it right.

Also, CSV files tend to be big, and slow to read and write.

With those caveats, here is how to write one:

candidate_df.to_csv('gd1_data.csv')

We can check the file size like this:

getsize('gd1_data.csv') / MB
0.7606849670410156

We can see the first few lines like this:

head('gd1_data.csv')
,source_id,ra,dec,pmra,pmdec,best_neighbour_multiplicity,number_of_mates,g_mean_psf_mag,i_mean_psf_mag,phi1,phi2,pm_phi1,pm_phi2 
0,635860218726658176,138.5187065217173,19.09233926905897,-5.941679495793577,-11.346409129876392,1,0,17.8978004455566,17.5174007415771,-59.247329893833296,-2.016078400820631,-7.527126084640531,1.7487794924176672

1,635674126383965568,138.8428741026386,19.031798198627634,-3.8970011609340207,-12.702779525389634,1,0,19.2873001098633,17.6781005859375,-59.13339098769217,-2.306900745179831,-7.560607655557415,-0.7417999555980248

The CSV file contains the names of the columns, but not the data types.

We can read the CSV file back like this:

read_back_csv = pd.read_csv('gd1_data.csv')

We will compare the first few rows of candidate_df and read_back_csv

candidate_df.head(3)
            source_id          ra        dec      pmra      pmdec  \
0  635860218726658176  138.518707  19.092339 -5.941679 -11.346409   
1  635674126383965568  138.842874  19.031798 -3.897001 -12.702780   
2  635535454774983040  137.837752  18.864007 -4.335041 -14.492309   
  
   best_neighbour_multiplicity  number_of_mates  g_mean_psf_mag  \
0                            1                0         17.8978   
1                            1                0         19.2873   
2                            1                0         16.9238   

   i_mean_psf_mag       phi1      phi2   pm_phi1   pm_phi2  
[Output truncated]
read_back_csv.head(3)
   Unnamed: 0           source_id          ra        dec      pmra      pmdec  \
0           0  635860218726658176  138.518707  19.092339 -5.941679 -11.346409   
1           1  635674126383965568  138.842874  19.031798 -3.897001 -12.702780   
2           2  635535454774983040  137.837752  18.864007 -4.335041 -14.492309   

   best_neighbour_multiplicity  number_of_mates  g_mean_psf_mag  \
0                            1                0         17.8978   
1                            1                0         19.2873   
2                            1                0         16.9238   

   i_mean_psf_mag       phi1      phi2   pm_phi1   pm_phi2  
[Output truncated]

Notice that the index in candidate_df has become an unnamed column in read_back_csv and a new index has been created. The Pandas functions for writing and reading CSV files provide options to avoid that problem, but this is an example of the kind of thing that can go wrong with CSV files.

Summary

In this episode, we used database JOIN operations to select photometry data for the stars we’ve identified as candidates to be in GD-1.

In the next episode, we will use this data for a second round of selection, identifying stars that have photometry data consistent with GD-1.

Key Points

  • Use JOIN operations to combine data from multiple tables in a database, using some kind of identifier to match up records from one table with records from another. This is another example of a practice we saw in the previous notebook, moving the computation to the data.

  • For most applications, saving data in FITS or HDF5 is better than CSV. FITS and HDF5 are binary formats, so the files are usually smaller, and they store metadata, so you don’t lose anything when you read the file back.

  • On the other hand, CSV is a ‘least common denominator’ format; that is, it can be read by practically any application that works with data.


Photometry

Overview

Teaching: 55 min
Exercises: 0 min
Questions
  • How do we use Matplotlib to define a polygon and select points that fall inside it?

Objectives
  • Use isochrone data to specify a polygon and determine which points fall inside it.

  • Use Matplotlib features to customize the appearance of figures.

In the previous episode we downloaded photometry data from Pan-STARRS, which is available from the same server we have been using to get Gaia data.

The next step in the analysis is to select candidate stars based on the photometry data. The following figure from the paper is a color-magnitude diagram showing the stars we previously selected based on proper motion:

Color-magnitude diagram for the stars selected based on proper motion, from Price-Whelan and Bonaca paper.

In red is a theoretical isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster.

By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from mostly younger background stars.

Outline

  1. We will reload the data from the previous episode and make a color-magnitude diagram.

  2. We will use an isochrone computed by MIST to specify a polygonal region in the color-magnitude diagram and select the stars inside of it.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

In the previous episode, we selected stars in GD-1 based on proper motion and downloaded the spatial, proper motion, and photometry information by joining the Gaia and PanSTARRs datasets. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need run the following lines of code.

This imports previously imported functions:

from os.path import getsize

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_data.hdf')

filename = 'gd1_data.hdf'
candidate_df = pd.read_hdf(filename, 'candidate_df')

Plotting photometry data

Now that we have photometry data from Pan-STARRS, we can produce a color-magnitude diagram to replicate the diagram from the original paper:

Color-magnitude diagram for the stars selected based on proper motion, from Price-Whelan and Bonaca paper.

The y-axis shows the apparent magnitude of each source with the g filter.

The x-axis shows the difference in apparent magnitude between the g and i filters, which indicates color.

Stars with lower values of (g-i) are “bluer” since they are brighter in the g-band than in the i-band, compared to other stars. To a first order approximation, the color of a star is related to the star’s temperature, with bluer stars indicating higher temperatures and redder stars indicating lower temperatures. An important second order effect involves the metallicity, or the amount of metals (elements heavier than helium, in this case) that are present in a star’s atmosphere. Higher metallicity leads to redder stars and lower metallicity leads to bluer stars.

Stars in the lower-left quadrant of this diagram are faintest and bluest suggesting they are have lower metallicity than the other stars, which means they are likely to be older.

Since we expect the stars in GD-1 to be older than the foreground and background stars, and farther away, the stars in the lower-left are more likely to be in GD-1.

With the photometry we downloaded from the PanSTARRS table into candidate_df we can now recreate this plot.

x = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag']
y = candidate_df['g_mean_psf_mag']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

plt.ylabel('Magnitude (g)')
plt.xlabel('Color (g-i)')

Color magnitude diagram of our selected stars showing all of the stars selected

We have assigned the color and magnitude to variables x and y, respectively.
We have done this out of convenience and to keep the code readable since the table variables and column names are long and x includes an operation between two columns.

We can zoom in on the region of interest by setting the range of x and y values displayed with the xlim and ylim functions. If we put the higher value first in the ylim call, this will invert the y-axis, putting fainter magnitudes at the bottom.

x = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag']
y = candidate_df['g_mean_psf_mag']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

plt.ylabel('Magnitude (g)')
plt.xlabel('Color (g-i)')

plt.xlim([0, 1.5])
plt.ylim([22, 14])

Color magnitude diagram of our selected stars showing overdense region in lower left.

Our figure does not look exactly like the one in the paper because we are working with a smaller region of the sky, so we have fewer stars. But the main sequence of GD-1 appears as an overdense region in the lower left.

We want to be able to make this plot again, with any selection of PanSTARRs photometry, so this is a natural time to put it into a function that accepts as input an Astropy Table or Pandas DataFrame, as long as it has columns named g_mean_psf_mag and i_mean_psf_mag. To do this we will change our variable name from candidate_df to the more generic table.

def plot_cmd(table):
    """Plot a color magnitude diagram.
    
    table: Table or DataFrame with photometry data
    """
    y = table['g_mean_psf_mag']
    x = table['g_mean_psf_mag'] - table['i_mean_psf_mag']

    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlim([0, 1.5])
    plt.ylim([22, 14])

    plt.ylabel('Magnitude (g)')
    plt.xlabel('Color (g-i)')

Here are the results:

plot_cmd(candidate_df)
<Figure size 432x288 with 1 Axes>

Color magnitude diagram of our selected stars showing overdense region in lower left.

In the next section we will use an isochrone to specify a polygon that contains this overdense region.

Isochrone

Given our understanding of the age, metallicity, and distance to GD-1 we can overlay a theoretical isochrone for GD-1 from the MESA Isochrones and Stellar Tracks and better identify the main sequence of GD-1.

Calculating Isochrone

In fact, we can use MESA Isochrones & Stellar Tracks (MIST) to compute it for us. Using the MIST Version 1.2 web interface, we computed an isochrone with the following parameters:

  • Rotation initial v/v_crit = 0.4
  • Single age, linear scale = 12e9
  • Composition [Fe/H] = -1.35
  • Synthetic Photometry, PanStarrs
  • Extinction av = 0

Making a polygon

The MIST isochrone files available on the website above can not be directly plotted over our data. We have selected the relevant part of the isochrone, the filters we are interested in, and scaled the photometry to the distance of GD-1 (details here). Now we can read in the results which you downloaded as part of the setup instructions:

filename = 'gd1_isochrone.hdf5'
iso_df = pd.read_hdf(filename, 'iso_df')
iso_df.head()
       mag_g  color_g_i
0  28.294743   2.195021
1  28.189718   2.166076
2  28.051761   2.129312
3  27.916194   2.093721
4  27.780024   2.058585

Here is what the isochrone looks like on the color-magnitude diagram.

plot_cmd(candidate_df)
plt.plot(iso_df['color_g_i'], iso_df['mag_g']);
<Figure size 432x288 with 1 Axes>

Color magnitude diagram of our selected stars with theoretical isochrone overlaid as blue curve.

In the bottom half of the figure, the isochrone passes through the overdense region where the stars are likely to belong to GD-1.

Although some stars near the top half of the isochrone likely belong to GD-1, these represent stars that have evolved off the main sequence. The density of GD-1 stars in this region is therefore much less and the contamination with other stars much greater. So to get the purest sample of GD-1 stars we will select only stars on the main sequence.

So we will select the part of the isochrone that lies in the overdense region.

g_mask is a Boolean Series that is True where g is between 18.0 and 21.5.

g_all = iso_df['mag_g']

g_mask = (g_all > 18.0) & (g_all < 21.5)
g_mask.sum()
117

We can use it to select the corresponding rows in iso_df:

iso_masked = iso_df[g_mask]
iso_masked.head()
        mag_g  color_g_i
94  21.411746   0.692171
95  21.322466   0.670238
96  21.233380   0.648449
97  21.144427   0.626924
98  21.054549   0.605461

Now, to select the stars in the overdense region, we have to define a polygon that includes stars near the isochrone.

g = iso_masked['mag_g']
left_color = iso_masked['color_g_i'] - 0.06
right_color = iso_masked['color_g_i'] + 0.12

Here is our plot with these boundaries:

plot_cmd(candidate_df)

plt.plot(left_color, g, label='left color')
plt.plot(right_color, g, label='right color')

plt.legend();
<Figure size 432x288 with 1 Axes>

Color magnitude diagram of our selected stars showing left boundary as blue curve and right boundary as orange curve.

Which points are in the polygon?

Matplotlib provides a Polygon object that we can use to check which points fall in the polygon we just constructed.

To make a Polygon, we need to assemble g, left_color, and right_color into a loop, so the points in left_color are connected to the points of right_color in reverse order.

We will use a “slice index” to reverse the elements of right_color. As explained in the NumPy documentation, a slice index has three parts separated by colons:

reverse_right_color = right_color[::-1]

In this example, start and stop are omitted, which means all elements are selected.

And step is -1, which means the elements are in reverse order.

To combine the left_color and right_color arrays we will use the NumPy append function which takes two arrays as input, and outputs them combined into a single array. By writing the function we can use the same code to create the x-values for the loop (colors) and the y-values for the loop (g-band magnitudes).

combined_array = np.append(left_color, reverse_right_color)

We can combine these steps into the following function, which takes two arrays and joins them front-to-back:

def front_to_back(first, second):
    """Join two arrays front to back."""
    return np.append(first, second[::-1])

We can use front_to_back to make a loop that includes the elements of left_color and right_color:

color_loop = front_to_back(left_color, right_color)
color_loop.shape
(234,)

And a corresponding loop with the elements of g in forward and reverse order.

mag_loop = front_to_back(g, g)
mag_loop.shape
(234,)

Here is the loop on our plot:

plot_cmd(candidate_df)
plt.plot(color_loop, mag_loop);
<Figure size 432x288 with 1 Axes>

Color magnitude diagram of our selected stars showing polygon defined by boundaries as blue curve.

To make a Polygon, it will be useful to put color_loop and mag_loop into a DataFrame. This is convenient for two reasons - first, Polygon is expecting an Nx2 array and the DataFrame provides an easy way for us to pass that in that is also descriptive for us. Secondly, for reproducibility of our work, we may want to save the region we use to select stars, and the DataFrame, as we have already seen, allows us to save into a variety of formats.

loop_df = pd.DataFrame()
loop_df['color_loop'] = color_loop
loop_df['mag_loop'] = mag_loop
loop_df.head()
   color_loop   mag_loop
0    0.632171  21.411746
1    0.610238  21.322466
2    0.588449  21.233380
3    0.566924  21.144427
4    0.545461  21.054549

Now we can pass loop_df to Polygon:

from matplotlib.patches import Polygon

polygon = Polygon(loop_df)
polygon
<matplotlib.patches.Polygon at 0x7f439d33fdf0>

The result is a Polygon object , which provides contains_points, which figures out which points are inside the polygon.

When we encounter a new object, it is good to create a toy example to test that it does what we think it does. We will create two points, one that we expect to be inside the polygon and one that we expect to be outside the polygon, and check that we get the results we expect from contains_points.

test_points = [(0.4, 20), 
          (0.4, 16)]

Now we can make sure contains_points does what we expect.

test_inside_mask = polygon.contains_points(test_points)
test_inside_mask
array([ True, False])

The result is an array of Boolean values.

We are almost ready to select stars whose photometry data falls in this polygon. But first we need to do some data cleaning.

Save the polygon

Reproducibile research is “the idea that … the full computational environment used to produce the results in the paper such as the code, data, etc. can be used to reproduce the results and create new work based on the research.”

This lesson is an example of reproducible research because it contains all of the code needed to reproduce the results, including the database queries that download the data and the analysis.

In this episode, we used an isochrone to derive a polygon, which we used to select stars based on photometry. So it is important to record the polygon as part of the data analysis pipeline.

Here is how we can save it in an HDF5 file.

filename = 'gd1_data.hdf'
loop_df.to_hdf(filename, 'loop_df')

Selecting based on photometry

Now we will check how many of the candidate stars are inside the polygon we chose. contains_points expects a list of (x,y) pairs. As with creating the Polygon, DataFrames are a convenient way to pass the colors and magnitudes for all of our stars in candidates_df to our Polygon to see which candidates are inside the polygon. We will start by putting color and magnitude data from candidate_df into a new DataFrame.

cmd_df = pd.DataFrame()

cmd_df['color'] = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag']
cmd_df['mag'] = candidate_df['g_mean_psf_mag']

cmd_df.head()
    color      mag
0  0.3804  17.8978
1  1.6092  19.2873
2  0.4457  16.9238
3  1.5902  19.9242
4  1.4853  16.1516

Which we can pass to contains_points:

inside_mask = polygon.contains_points(cmd_df)
inside_mask
array([False, False, False, ..., False, False, False])

The result is a Boolean array. We can use sum to see how many stars fall inside the polygon.

inside_mask.sum()
454

Now we can use inside_mask as a mask to select stars that fall inside the polygon.

winner_df = candidate_df[inside_mask]

We will make a color-magnitude plot one more time, highlighting the selected stars with green markers.

plot_cmd(candidate_df)
plt.plot(iso_df['color_g_i'], iso_df['mag_g'])
plt.plot(color_loop, mag_loop)

x = winner_df['g_mean_psf_mag'] - winner_df['i_mean_psf_mag']
y = winner_df['g_mean_psf_mag']
plt.plot(x, y, 'go', markersize=0.5, alpha=0.5);
<Figure size 432x288 with 1 Axes>

Color magnitude diagram showing our selected stars in green, inside our polygon.

The selected stars are, in fact, inside the polygon, which means they have photometry data consistent with GD-1.

Finally, we can plot the coordinates of the selected stars:

fig = plt.figure(figsize=(10,2.5))

x = winner_df['phi1']
y = winner_df['phi2']
plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9)

plt.xlabel('$\phi_1$ [deg]]')
plt.ylabel('$\phi_2$ [deg]')
plt.title('Proper motion + photometry selection', fontsize='medium')

plt.axis('equal');
<Figure size 720x180 with 1 Axes>

Right ascension and declination of selected stars in GD-1 frame after selecting for both proper motion and photometry.

This example includes the new Matplotlib command figure, which creates the larger canvas that the subplots are placed on. In previous examples, we didn’t have to use this function; the figure was created automatically. But when we call it explicitly, we can provide arguments like figsize, which sets the size of the figure. It also returns a figure object which we will use to further customize our plotting in the next episode.

In the example above we also used TeX markup in our axis labels so that they render as the Greek letter $\phi$ with subscripts for 1 and 2. Matplotlib also allows us to write basic TeX markup by wrapping the text we want rendered as TeX with $ and then using TeX commands inside. This basic rendering is performed with mathtext; more advanced rendering with LaTex can be done with the usetex option in rcParams which we will discuss in Episode 7.

In the next episode we are going to make this plot several more times, so it makes sense to make a function. As we have done with previous functions we can copy and paste what we just wrote, replacing the specific variable winner_df with the more generic df.

def plot_cmd_selection(df):
    x = df['phi1']
    y = df['phi2']

    plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9)

    plt.xlabel('$\phi_1$ [deg]')
    plt.ylabel('$\phi_2$ [deg]')
    plt.title('Proper motion + photometry selection', fontsize='medium')

    plt.axis('equal')

And here is how we use the function.

fig = plt.figure(figsize=(10,2.5))
plot_cmd_selection(winner_df)
<Figure size 1000x250 with 1 Axes>

png

Write the data

Finally, we will write the selected stars to a file.

filename = 'gd1_data.hdf'
winner_df.to_hdf(filename, 'winner_df')
MB = 1024 * 1024
getsize(filename) / MB
3.6441001892089844

Summary

In this episode, we used photometry data from Pan-STARRS to draw a color-magnitude diagram. We used an isochrone to define a polygon and select stars we think are likely to be in GD-1. Plotting the results, we have a clearer picture of GD-1, similar to Figure 1 in the original paper.

Key Points

  • Matplotlib provides operations for working with points, polygons, and other geometric entities, so it is not just for making figures.

  • Use Matplotlib options to control the size and aspect ratio of figures to make them easier to interpret.

  • Record every element of the data analysis pipeline that would be needed to replicate the results.


Visualization

Overview

Teaching: 55 min
Exercises: 30 min
Questions
  • How do we make a compelling visualization that tells a story?

Objectives
  • Design a figure that tells a compelling story.

  • Use Matplotlib features to customize the appearance of figures.

  • Generate a figure with multiple subplots.

In the previous episode, we selected photometry data from Pan-STARRS and used it to identify stars we think are likely to be in GD-1.

In this episode, we will take the results from previous episodes and use them to make a figure that tells a compelling scientific story.

Outline

  1. Starting with the figure from the previous episode, we will add annotations to present the results more clearly.

  2. Then we will learn several ways to customize figures to make them more appealing and effective.

  3. Finally, we will learn how to make a figure with multiple panels.

Starting from this episode

If you are starting a new notebook for this episode, expand this section for information you will need to get started.

Read me

In the previous episode, we selected stars in GD-1 based on proper motion and downloaded the spatial, proper motion, and photometry information by joining the Gaia and PanSTARRs datasets. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.

If you are starting this episode here or starting this episode in a new notebook, you will need to run the following lines of code.

This imports previously imported functions:

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.patches import Polygon

from episode_functions import *

The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g. filename = 'student_download/data/gd1_data.hdf')

filename = 'gd1_data.hdf'
winner_df = pd.read_hdf(filename, 'winner_df')

centerline_df = pd.read_hdf(filename, 'centerline_df')
candidate_df = pd.read_hdf(filename, 'candidate_df')
loop_df = pd.read_hdf(filename, 'loop_df')

This defines previously defined quantities:

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0

pm1_rect, pm2_rect = make_rectangle(
    pm1_min, pm1_max, pm2_min, pm2_max)

Making Figures That Tell a Story

The figures we have made so far have been “quick and dirty”. Mostly we have used Matplotlib’s default style, although we have adjusted a few parameters, like markersize and alpha, to improve legibility.

Now that the analysis is done, it is time to think more about:

  1. Making professional-looking figures that are ready for publication.

  2. Making figures that communicate a scientific result clearly and compellingly.

Not necessarily in that order.

We will start by reviewing Figure 1 from the original paper. We have seen the individual panels, but now we will look at the whole figure, along with the caption:

Figure 1 from Price-Whelan and Bonaca paper with four panels and caption. Caption reads: On-sky positions of likely GD-1 members in the GD-1 coordinate system. GD-1 is apparent as an overdensity in negative proper motions (top-right panel, orange box), so selecting on proper motion already reveals the stream in positions of individual stars (top-left panel). The stream also stands out in the color–magnitude diagram (bottom-right panel) as older and more metal-poor than the background. Selecting the main sequence of GD-1 (orange, shaded region in the bottom-right panel) along with proper motion cuts unveils the stream in unprecedented detail (bottom-left panel).

Exercise (5 minutes)

Think about the following questions:

  1. What is the primary scientific result of this work?

  2. What story is this figure telling?

  3. In the design of this figure, can you identify 1 or 2 choices the authors made that you think are effective? Think about big-picture elements, like the number of panels and how they are arranged, as well as details like the choice of typeface.

  4. Can you identify 1 or 2 elements that could be improved, or that you might have done differently?

Solution

Some topics that might come up in this discussion:

  1. The primary result is that the multiple stages of selection make it possible to separate likely candidates from the background more effectively than in previous work, which makes it possible to see the structure of GD-1 in “unprecedented detail”.

  2. The figure documents the selection process as a sequence of steps. Reading right-to-left, top-to-bottom, we see selection based on proper motion, the results of the first selection, selection based on color and magnitude, and the results of the second selection. So this figure documents the methodology and presents the primary result.

  3. The figure is mostly black and white, with minimal use of color, so it will work well in print. The annotations in the bottom left panel guide the reader to the most important results.
    It contains enough technical detail for a professional audience, but most of it is also comprehensible to a more general audience.
    The two left panels have the same dimensions and their axes are aligned.

  4. Since the panels represent a sequence, it might be better to arrange them left-to-right. The placement and size of the axis labels could be tweaked. The entire figure could be a little bigger to match the width and proportion of the caption.
    The top left panel has unused white space (but that leaves space for the annotations in the bottom left).

Plotting GD-1 with Annotations

The lower left panel in the paper uses three other features to present the results more clearly and compellingly:

Exercise (15 minutes)

Plot the selected stars in winner_df using the plot_cmd_function and then choose any or all of these features and add them to the figure:

Here is some additional information about text and arrows.

Solution

fig = plt.figure(figsize=(10,2.5))
plot_cmd_selection(winner_df)
plt.axvline(-55, ls='--', color='gray', 
            alpha=0.4, dashes=(6,4), lw=2)
plt.text(-60, 5.5, 'Previously\nundetected', 
         fontsize='small', ha='right', va='top');

arrowprops=dict(color='gray', shrink=0.05, width=1.5, 
                headwidth=6, headlength=8, alpha=0.4)

plt.annotate('Spur', xy=(-33, 2), xytext=(-35, 5.5),
             arrowprops=arrowprops,
             fontsize='small')

plt.annotate('Gap', xy=(-22, -1), xytext=(-25, -5.5),
             arrowprops=arrowprops,
             fontsize='small')

Customization

Matplotlib provides a default style that determines things like the colors of lines, the placement of labels and ticks on the axes, and many other properties.

There are several ways to override these defaults and customize your figures:

As a simple example, notice that Matplotlib puts ticks on the outside of the figures by default, and only on the left and bottom sides of the axes.

So far, everything we have wanted to do we could call directly from the pyplot module with plt.. As you do more and more customization you may need to run some methods on plotting objects themselves. To use the method that changes the direction of the ticks we need an axes object. So far, Matplotlib has implicitly created our axes object when we called plt.plot. To explicitly create an axes object we can first create our figure object and then add an axes object to it.

fig = plt.figure(figsize=(10,2.5))
ax = fig.add_subplot(1,1,1)

subplot and axes

Confusingly, in Matplotlib the objects subplot and axes are often used interchangeably. This is because a subplot is an axes object with additional methods and attributes.

You can use the add_subplot method to add more than one axes object to a figure. For this reason you have to specify the total number of columns, total number of rows, and which plot number you are creating (fig.add_subplot(ncols, nrows, pltnum)). The plot number starts in the upper left corner and goes left to right and then top to bottom. In the example above we have one column, one row, and we’re plotting into the first plot space.

Now we are ready to change the direction of the ticks to the inside of the axes using our new axes object.

fig = plt.figure(figsize=(10,2.5))
ax = fig.add_subplot(1,1,1)
ax.tick_params(direction='in')

Exercise (5 minutes)

Read the documentation of tick_params and use it to put ticks on the top and right sides of the axes.

Solution

fig = plt.figure(figsize=(10,2.5))
ax = fig.add_subplot(1,1,1)
ax.tick_params(top=True, right=True)

rcParams

If you want to make a customization that applies to all figures in a notebook, you can use rcParams. When you import Matplotlib, a dictionary is created with default values for everything you can change about your plot. This is what you are overriding with tick_params above.

Here is an example that reads the current font size from rcParams:

plt.rcParams['font.size']
10.0

And sets it to a new value:

plt.rcParams['font.size'] = 14

Exercise

Plot the previous figure again, and see what font sizes have changed. Look up any other element of rcParams, change its value, and check the effect on the figure.

When you import Matplotlib, plt.rcParams is populated from a matplotlibrc file. If you want to permanently change a setting for every plot you make, you can set that in your matplotlibrc file. To find out where your matplotlibrc file lives type:

import matplotlib as mpl
mpl.matplotlib_fname()

If the file doesn’t exist, you can download a sample matplotlibrc file to modify.

Style sheets

It is possible that you would like multiple sets of defaults, for example, one default for plots for scientific papers and another for talks or posters. Because the matplotlibrc file is read when you import Matplotlib, it is not easy to switch from one set of options to another.

The solution to this problem is style sheets, which you can read about here.

Matplotlib provides a set of predefined style sheets, or you can make your own. The style sheets reference shows a gallery of plots generated by common style sheets.

You can display a list of style sheets installed on your system.

plt.style.available
['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
[Output truncated]

Note that seaborn-paper, seaborn-talk and seaborn-poster are particularly intended to prepare versions of a figure with text sizes and other features that work well in papers, talks, and posters.

To use any of these style sheets, run plt.style.use like this:

plt.style.use('fivethirtyeight')

The style sheet you choose will affect the appearance of all figures you plot after calling use, unless you override any of the options or call use again.

Exercise

Choose one of the styles on the list and select it by calling use. Then go back and plot one of the figures above and see what changes in the figure’s appearance.

If you can’t find a style sheet that is exactly what you want, you can make your own. This repository includes a style sheet called az-paper-twocol.mplstyle, with customizations chosen by Azalee Bostroem for publication in astronomy journals.

You can use it like this:

plt.style.use('./az-paper-twocol.mplstyle')

The prefix ./ tells Matplotlib to look for the file in the current directory.

As an alternative, you can install a style sheet for your own use by putting it into a directory named stylelib/ in your configuration directory.
To find out where the Matplotlib configuration directory is, you can run the following command:

mpl.get_configdir()

Multiple panels

So far we have been working with one figure at a time, but the figure we are replicating contains multiple panels. We will create each of these panels as a different subplot. Matplotlib has multiple functions for making figures with multiple panels. We have already used add_subplot - however, this creates equal sized panels. For this reason, we will use subplot2grid which allows us to control the relative sizes of the panels.

Since we have already written functions that generate each panel of this figure, we can now create the full multi-panel figure by creating each subplot and then run our plotting function.

Like add_subplot, subplot2grid requires us to specify the total number of columns and rows in the grid (this time as a tuple called shape), and the location of the subplot (loc) - a tuple identifying the location in the grid we are about to fill.

In this example, shape is (2, 2) to create two rows and two columns.

For the first panel, loc is (0, 0), which indicates row 0 and column 0, which is the upper-left panel.

Here is how we use this function to draw the four panels.

fig = plt.figure()
shape = (2, 2)
ax1 = plt.subplot2grid(shape, (0, 0))
plot_pm_selection(candidate_df)

ax2 = plt.subplot2grid(shape, (0, 1))
plot_proper_motion(centerline_df)

ax3 = plt.subplot2grid(shape, (1, 0))
plot_cmd_selection(winner_df)

ax4 = plt.subplot2grid(shape, (1, 1))
plot_cmd(candidate_df)

plt.tight_layout()
<Figure size 640x480 with 4 Axes>

Four paneled plot showing our first recreation of figure 1 from the Price-Whelan and Bonaca paper.

We use plt.tight_layout at the end, which adjusts the sizes of the panels to make sure the titles and axis labels don’t overlap. Notice how convenient it is that we have written functions to plot each panel. This code is concise and readable: we can tell what is being plotted in each panel thanks to our explicit function names and we know what function to investigate if we want to see the mechanics of exactly how the plotting is done.

Exercise

What happens if you leave out tight_layout?

Solution

Without tight_layout the space between the panels is too small. In this situation, the titles from the lower plots overlap with the x-axis labels from the upper panels and the axis labels from the right-hand panels overlap with the plots in the left-hand panels.

Adjusting proportions

In the previous figure, the panels are all the same size. To get a better view of GD-1, we would like to stretch the panels on the left and compress the ones on the right.

To do that, we will use the colspan argument to make a panel that spans multiple columns in the grid. To do this we will need more columns so we will change the shape from (2,2) to (2,4).

The panels on the left span three columns, so they are three times wider than the panels on the right.

At the same time, we use figsize to adjust the aspect ratio of the whole figure.

plt.figure(figsize=(9, 4.5))

shape = (2, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=3)
plot_pm_selection(candidate_df)

ax2 = plt.subplot2grid(shape, (0, 3))
plot_proper_motion(centerline_df)

ax3 = plt.subplot2grid(shape, (1, 0), colspan=3)
plot_cmd_selection(winner_df)

ax4 = plt.subplot2grid(shape, (1, 3))
plot_cmd(candidate_df)

plt.tight_layout()
<Figure size 900x450 with 4 Axes>

Four paneled plot we created above with two left-hand panels increased in width.

This is looking more and more like the figure in the paper.

Exercise (5 minutes)

In this example, the ratio of the widths of the panels is 3:1. How would you adjust it if you wanted the ratio to be 3:2?

Solution


plt.figure(figsize=(9, 4.5))

shape = (2, 5)                                   # CHANGED
ax1 = plt.subplot2grid(shape, (0, 0), colspan=3)
plot_pm_selection(candidate_df)

ax2 = plt.subplot2grid(shape, (0, 3), colspan=2)       # CHANGED
plot_proper_motion(centerline_df)

ax3 = plt.subplot2grid(shape, (1, 0), colspan=3)
plot_cmd_selection(winner_df)

ax4 = plt.subplot2grid(shape, (1, 3), colspan=2)       # CHANGED
plot_cmd(candidate_df)

plt.tight_layout()

Adding the shaded regions

The one thing our figure is missing is the shaded regions showing the stars selected by proper motion and around the isochrone in the color magnitude diagram.

In episode 4 we defined a rectangle in proper motion space around the stars in GD-1. We stored the x-values of the vertices of this rectangle in pm1_rect and the y-values as pm2_rect.

To plot this rectangle, we will use the Matplotlib Polygon object which we used in episode 6 to check which points were inside the convex hull. However, this time we will be plotting the Polygon.

To create a Polygon, we have to put the coordinates of the rectangle in an array with x values in the first column and y values in the second column.

vertices = np.transpose([pm1_rect, pm2_rect])
vertices
array([[-8.9, -2.2],
       [-8.9,  1. ],
       [-6.9,  1. ],
       [-6.9, -2.2]])

We will now create the Polygon, specifying its display properties which will be used when it is plotted. We will specify closed=True to make sure the shape is closed, facecolor='orange to color the inside of the Polygon orange, and alpha=0.4 to make the Polygon semi-transparent.

poly = Polygon(vertices, closed=True, 
                   facecolor='orange', alpha=0.4)

Then to plot the Polygon we call the add_patch method. add_patch like tick_params must be called on an axes or subplot object, so we will create a subplot and then add the Patch to the subplot.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
poly = Polygon(vertices, closed=True, 
                   facecolor='orange', alpha=0.4)
ax.add_patch(poly)
ax.set_xlim(-10, 7.5)
ax.set_ylim(-10, 10)
<Figure size 900x450 with 4 Axes>

An orange rectangle at the coordinates used to select stars based on proper motion.

We can now call our plot_proper_motion function to plot the proper motion for each star, and the add a shaded Polygon to show the region we selected.

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
plot_proper_motion(centerline_df)
poly = Polygon(vertices, closed=True, 
               facecolor='C1', alpha=0.4)
ax.add_patch(poly)
<Figure size 900x450 with 4 Axes>

Proper motion with overlaid polygon showing our selected stars.

Exercise (5 minutes)

Add a few lines to be run after the plot_cmd function to show the polygon we selected as a shaded area.

Hint: pass loop_df as an argument to Polygon as we did in episode 6 and then plot it using add_patch.

Solution

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
poly_cmd = Polygon(loop_df, closed=True, 
              facecolor='C1', alpha=0.4)
ax.add_patch(poly_cmd)

Exercise

Add the Polygon patches you just created to the right panels of the four panel figure.

Solution

fig = plt.figure(figsize=(9, 4.5))

shape = (2, 4)
ax1 = plt.subplot2grid(shape, (0, 0), colspan=3)
plot_pm_selection(candidate_df)

ax2 = plt.subplot2grid(shape, (0, 3))
plot_proper_motion(centerline_df)
poly = Polygon(vertices, closed=True,
               facecolor='orange', alpha=0.4)
ax2.add_patch(poly)

ax3 = plt.subplot2grid(shape, (1, 0), colspan=3)
plot_cmd_selection(winner_df)

ax4 = plt.subplot2grid(shape, (1, 3))
plot_cmd(candidate_df)
poly_cmd = Polygon(loop_df, closed=True, 
               facecolor='orange', alpha=0.4)
ax4.add_patch(poly_cmd)

plt.tight_layout()
 <Figure size 900x450 with 4 Axes>

Four paneled plot we created above with two left-hand panels increased in width. {.solution}

Summary

In this episode, we reverse-engineered the figure we have been replicating, identifying elements that seem effective and others that could be improved.

We explored features Matplotlib provides for adding annotations to figures – including text, lines, arrows, and polygons – and several ways to customize the appearance of figures. And we learned how to create figures that contain multiple panels.

Key Points

  • The most effective figures focus on telling a single story clearly.

  • Consider using annotations to guide the reader’s attention to the most important elements of a figure.

  • The default Matplotlib style generates good quality figures, but there are several ways you can override the defaults.

  • If you find yourself making the same customizations on several projects, you might want to create your own style sheet.