This lesson is being piloted (Beta version)

Foundations of Astronomical Data Science

Basic queries

Overview

Teaching: 65 min
Exercises: 25 min
Questions
  • How can we select and download the data we want from the Gaia server?

Objectives
  • Compose a basic query in ADQL/SQL.

  • Use queries to explore a database and its tables.

  • Use queries to download data.

  • Develop, test, and debug a query incrementally.

1. Queries

This is the first in a series of lessons about working with astronomical data.

As a running example, we will replicate parts of the analysis in a recent paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian Price-Whelan and Ana Bonaca.

Outline

This lesson demonstrates the steps for selecting and downloading data from the Gaia Database:

  1. First we’ll make a connection to the Gaia server,

  2. We will explore information about the database and the tables it contains,

  3. We will write a query and send it to the server, and finally

  4. We will download the response from the server.

Query Language

In order to select data from a database, you have to compose a query, which is a program written in a “query language”. The query language we’ll use is ADQL, which stands for “Astronomical Data Query Language”.

ADQL is a dialect of SQL (Structured Query Language), which is by far the most commonly used query language. Almost everything you will learn about ADQL also works in SQL.

The reference manual for ADQL is here. But you might find it easier to learn from this ADQL Cookbook.

Using Jupyter

If you have not worked with Jupyter notebooks before, you might start with the tutorial on from Jupyter.org called “Try Classic Notebook”, or this tutorial from DataQuest.

There are two environments you can use to write and run notebooks:

For these lessons, you can use either one.

If you are too impatient for the tutorials, here are the most important things to know:

  1. Notebooks are made up of code cells and text cells (and a few other less common kinds). Code cells contain code; text cells, like this one, contain explanatory text written in Markdown.

  2. To run a code cell, click the cell to select it and press Shift-Enter. The output of the code should appear below the cell.

  3. In general, notebooks only run correctly if you run every code cell in order from top to bottom. If you run cells out of order, you are likely to get errors.

  4. You can modify existing cells, but then you have to run them again to see the effect.

  5. You can add new cells, but again, you have to be careful about the order you run them in.

  6. If you have added or modified cells and the behavior of the notebook seems strange, you can restart the “kernel”, which clears all of the variables and functions you have defined, and run the cells again from the beginning.

Before you go on, you might want to explore the other menus and the toolbar to see what else you can do.

Connecting to Gaia

The library we’ll use to get Gaia data is Astroquery. Astroquery provides Gaia, which is an object that represents a connection to the Gaia database.

We can connect to the Gaia database like this:

from astroquery.gaia import Gaia
Created TAP+ (v1.2.1) - Connection:
	Host: gea.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443
Created TAP+ (v1.2.1) - Connection:
	Host: geadata.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443

This import statement creates a TAP+ connection; TAP stands for “Table Access Protocol”, which is a network protocol for sending queries to the database and getting back the results.

Databases and Tables

What is a database, anyway? Most generally, it can be any collection of data, but when we are talking about ADQL or SQL:

We can use Gaia.load_tables to get the names of the tables in the Gaia database. With the option only_names=True, it loads information about the tables, called “metadata”, not the data itself.

tables = Gaia.load_tables(only_names=True)
INFO: Retrieving tables... [astroquery.utils.tap.core]
INFO: Parsing tables... [astroquery.utils.tap.core]
INFO: Done. [astroquery.utils.tap.core]

The following for loop prints the names of the tables.

for table in tables:
    print(table.name)
external.apassdr9
external.gaiadr2_geometric_distance
external.gaiaedr3_distance
external.galex_ais
external.ravedr5_com
external.ravedr5_dr5
external.ravedr5_gra
external.ravedr5_on
external.sdssdr13_photoprimary
external.skymapperdr1_master
external.skymapperdr2_master
[Output truncated]

So that’s a lot of tables. The ones we’ll use are:

We can use load_table (not load_tables) to get the metadata for a single table. The name of this function is misleading, because it only downloads metadata, not the contents of the table.

meta = Gaia.load_table('gaiadr2.gaia_source')
meta
Retrieving table 'gaiadr2.gaia_source'
Parsing table 'gaiadr2.gaia_source'...
Done.

<astroquery.utils.tap.model.taptable.TapTableMeta at 0x7f50edd2aeb0>

Jupyter shows that the result is an object of type TapTableMeta, but it does not display the contents.

To see the metadata, we have to print the object.

print(meta)
TAP Table name: gaiadr2.gaiadr2.gaia_source
Description: This table has an entry for every Gaia observed source as listed in the
Main Database accumulating catalogue version from which the catalogue
release has been generated. It contains the basic source parameters,
that is only final data (no epoch data) and no spectra (neither final
nor epoch).
Num. columns: 96

Columns

The following loop prints the names of the columns in the table.

for column in meta.columns:
    print(column.name)
solution_id
designation
source_id
random_index
ref_epoch
ra
ra_error
dec
dec_error
parallax
parallax_error
[Output truncated]

You can probably infer what many of these columns are by looking at the names, but you should resist the temptation to guess. To find out what the columns mean, read the documentation.

If you want to know what can go wrong when you don’t read the documentation, you might like this article.

Exercise (2 minutes)

One of the other tables we’ll use is gaiadr2.panstarrs1_original_valid. Use load_table to get the metadata for this table. How many columns are there and what are their names?

Solution

meta2 = Gaia.load_table('gaiadr2.panstarrs1_original_valid')
print(meta2)

for column in meta2.columns:
    print(column.name)

Writing queries

By now you might be wondering how we download these tables. With tables this big, you generally don’t. Instead, you use queries to select only the data you want.

A query is a string written in a query language like SQL; for the Gaia database, the query language is a dialect of SQL called ADQL.

Here’s an example of an ADQL query.

query1 = """SELECT 
TOP 10
source_id, ra, dec, parallax 
FROM gaiadr2.gaia_source
"""

Python note

We use a triple-quoted string here so we can include line breaks in the query, which makes it easier to read.

The words in uppercase are ADQL keywords:

The third line is a list of column names, indicating which columns we want.

In this example, the keywords are capitalized and the column names are lowercase. This is a common style, but it is not required. ADQL and SQL are not case-sensitive.

Also, the query is broken into multiple lines to make it more readable. This is a common style, but not required. Line breaks don’t affect the behavior of the query.

To run this query, we use the Gaia object, which represents our connection to the Gaia database, and invoke launch_job:

job = Gaia.launch_job(query1)
job
<astroquery.utils.tap.model.job.Job at 0x7f50edd2adc0>

The result is an object that represents the job running on a Gaia server.

If you print it, it displays metadata for the forthcoming results.

print(job)
<Table length=10>
   name    dtype  unit                            description                             n_bad
--------- ------- ---- ------------------------------------------------------------------ -----
source_id   int64      Unique source identifier (unique within a particular Data Release)     0
       ra float64  deg                                                    Right ascension     0
      dec float64  deg                                                        Declination     0
 parallax float64  mas                                                           Parallax     2
Jobid: None
Phase: COMPLETED
Owner: None
Output file: sync_20210315090602.xml.gz
[Output truncated]

Don’t worry about Results: None. That does not actually mean there are no results.

However, Phase: COMPLETED indicates that the job is complete, so we can get the results like this:

results = job.get_results()
type(results)
astropy.table.table.Table

The type function indicates that the result is an Astropy Table.

Repetition

Why is table repeated three times? The first is the name of the module, the second is the name of the submodule, and the third is the name of the class. Most of the time we only care about the last one. It’s like the Linnean name for gorilla, which is Gorilla gorilla gorilla.

An Astropy Table is similar to a table in an SQL database except:

Jupyter knows how to display the contents of a Table.

results
<Table length=10>
     source_id              ra                 dec               parallax      
                           deg                 deg                 mas         
       int64             float64             float64             float64       
------------------- ------------------ ------------------- --------------------
5887983246081387776   227.978818386372  -53.64996962450103   1.0493172163332998
5887971250213117952 228.32280834041364  -53.66270726203726  0.29455652682279093
5887991866047288704  228.1582047014091 -53.454724911639794  -0.5789179941669236
5887968673232040832 228.07420888099884   -53.8064612895961  0.41030970779603076
5887979844465854720 228.42547805195946  -53.48882284470035 -0.23379683441525864
5887978607515442688 228.23831627636855  -53.56328249482688  -0.9252161956789068
[Output truncated]

Each column has a name, units, and a data type.

For example, the units of ra and dec are degrees, and their data type is float64, which is a 64-bit floating-point number, used to store measurements with a fraction part.

This information comes from the Gaia database, and has been stored in the Astropy Table by Astroquery.

Exercise (3 minutes)

Read the documentation of this table and choose a column that looks interesting to you. Add the column name to the query and run it again. What are the units of the column you selected? What is its data type?

Solution

Let’s add radial_velocity : Radial velocity (double, Velocity[km/s] )

Spectroscopic radial velocity in the solar barycentric reference frame.

The radial velocity provided is the median value of the radial velocity measurements at all epochs.

query = """SELECT 
TOP 10
source_id, ra, dec, parallax, radial_velocity
FROM gaiadr2.gaia_source
"""

Asynchronous queries

launch_job asks the server to run the job “synchronously”, which normally means it runs immediately. But synchronous jobs are limited to 2000 rows. For queries that return more rows, you should run “asynchronously”, which mean they might take longer to get started.

If you are not sure how many rows a query will return, you can use the SQL command COUNT to find out how many rows are in the result without actually returning them. We’ll see an example in the next lesson.

The results of an asynchronous query are stored in a file on the server, so you can start a query and come back later to get the results. For anonymous users, files are kept for three days.

As an example, let’s try a query that’s similar to query1, with these changes:

query2 = """SELECT 
TOP 3000
source_id, ra, dec, pmra, pmdec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1
"""

A WHERE clause indicates which rows we want; in this case, the query selects only rows “where” parallax is less than 1. This has the effect of selecting stars with relatively low parallax, which are farther away. We’ll use this clause to exclude nearby stars that are unlikely to be part of GD-1.

WHERE is one of the most common clauses in ADQL/SQL, and one of the most useful, because it allows us to download only the rows we need from the database.

We use launch_job_async to submit an asynchronous query.

job = Gaia.launch_job_async(query2)
job
INFO: Query finished. [astroquery.utils.tap.core]

<astroquery.utils.tap.model.job.Job at 0x7f50edd40f40>

And here are the results.

results = job.get_results()
results
<Table length=10>
     source_id              ra         ...       parallax       radial_velocity
                           deg         ...         mas               km / s    
       int64             float64       ...       float64            float64    
------------------- ------------------ ... -------------------- ---------------
5895270396817359872 213.08433715252883 ...    2.041947005434917              --
5895272561481374080  213.2606587905109 ...  0.15693467895110133              --
5895247410183786368 213.38479712976664 ... -0.19017525742552605              --
5895249226912448000 213.41587389088238 ...                   --              --
5895261875598904576  213.5508930114549 ... -0.29471722363529257              --
5895258302187834624 213.87631129557286 ...   0.6468437015289753              --
[Output truncated]

You might notice that some values of parallax are negative. As this FAQ explains, “Negative parallaxes are caused by errors in the observations.” They have “no physical meaning,” but they can be a “useful diagnostic on the quality of the astrometric solution.”

Exercise (5 minutes)

The clauses in a query have to be in the right order. Go back and change the order of the clauses in query2 and run it again. The modified query should fail, but notice that you don’t get much useful debugging information.

For this reason, developing and debugging ADQL queries can be really hard. A few suggestions that might help:

  • Whenever possible, start with a working query, either an example you find online or a query you have used in the past.

  • Make small changes and test each change before you continue.

  • While you are debugging, use TOP to limit the number of rows in the result. That will make each test run faster, which reduces your development time.

  • Launching test queries synchronously might make them start faster, too.

Solution

In this example, the WHERE clause is in the wrong place.

query = """SELECT 
TOP 3000
WHERE parallax < 1
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
"""

Operators

In a WHERE clause, you can use any of the SQL comparison operators; here are the most common ones:

Symbol Operation
> greater than
< less than
>= greater than or equal
<= less than or equal
= equal
!= or <> not equal

Most of these are the same as Python, but some are not. In particular, notice that the equality operator is =, not ==. Be careful to keep your Python out of your ADQL!

You can combine comparisons using the logical operators:

Finally, you can use NOT to invert the result of a comparison.

Exercise (5 minutes)

Read about SQL operators here and then modify the previous query to select rows where bp_rp is between -0.75 and 2.

Solution

Here’s a solution using > and < operators

query = """SELECT 
TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp > -0.75 AND bp_rp < 2
"""

And here’s a solution using the BETWEEN operator

query = """SELECT 
TOP 10
source_id, ref_epoch, ra, dec, parallax
FROM gaiadr2.gaia_source
WHERE parallax < 1 
  AND bp_rp BETWEEN -0.75 AND 2
"""

bp_rp contains BP-RP color, which is the difference between two other columns, phot_bp_mean_mag and phot_rp_mean_mag. You can read about this variable here.

This Hertzsprung-Russell diagram shows the BP-RP color and luminosity of stars in the Gaia catalog (Copyright: ESA/Gaia/DPAC, CC BY-SA 3.0 IGO).

Selecting stars with bp-rp less than 2 excludes many class M dwarf stars, which are low temperature, low luminosity. A star like that at GD-1’s distance would be hard to detect, so if it is detected, it it more likely to be in the foreground.

Formatting queries

The queries we have written so far are string “literals”, meaning that the entire string is part of the program. But writing queries yourself can be slow, repetitive, and error-prone.

It is often better to write Python code that assembles a query for you. One useful tool for that is the string format method.

As an example, we’ll divide the previous query into two parts; a list of column names and a “base” for the query that contains everything except the column names.

Here’s the list of columns we’ll select.

columns = 'source_id, ra, dec, pmra, pmdec, parallax'

And here’s the base; it’s a string that contains at least one format specifier in curly brackets (braces).

query3_base = """SELECT 
TOP 10 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2
"""

This base query contains one format specifier, {columns}, which is a placeholder for the list of column names we will provide.

To assemble the query, we invoke format on the base string and provide a keyword argument that assigns a value to columns.

query3 = query3_base.format(columns=columns)

In this example, the variable that contains the column names and the variable in the format specifier have the same name. That’s not required, but it is a common style.

The result is a string with line breaks. If you display it, the line breaks appear as \n.

query3
'SELECT \nTOP 10 \nsource_id, ra, dec, pmra, pmdec\nFROM gaiadr2.gaia_source\nWHERE parallax < 1\n  AND bp_rp BETWEEN -0.75 AND 2\n'

But if you print it, the line breaks appear as… line breaks.

print(query3)
SELECT 
TOP 10 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2

Notice that the format specifier has been replaced with the value of columns.

Let’s run it and see if it works:

job = Gaia.launch_job(query3)
print(job)
<Table length=10>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: None
Phase: COMPLETED
Owner: None
[Output truncated]
results = job.get_results()
results
<Table length=10>
     source_id              ra         ...        pmdec        
                           deg         ...       mas / yr      
       int64             float64       ...       float64       
------------------- ------------------ ... --------------------
5895272561481374080  213.2606587905109 ...   1.2299266281737415
5895261875598904576  213.5508930114549 ...   -4.672602679543312
5895247444506644992 213.33215109206796 ...   -3.538080792097856
5895259470417635968 213.78815034206346 ...  -0.8163762113468646
5895265925746051584 213.17082359534547 ...  -4.8585444120179595
5895260913525974528 213.66936020541976 ...  -1.5566420086447643
[Output truncated]

Good so far.

Exercise (10 minutes)

This query always selects sources with parallax less than 1. But suppose you want to take that upper bound as an input.

Modify query3_base to replace 1 with a format specifier like {max_parallax}. Now, when you call format, add a keyword argument that assigns a value to max_parallax, and confirm that the format specifier gets replaced with the value you provide.

Solution

query_base = """SELECT 
TOP 10
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < {max_parallax} AND 
bp_rp BETWEEN -0.75 AND 2
"""

query = query_base.format(columns=columns,
                          max_parallax=0.5)
print(query)

Summary

This notebook demonstrates the following steps:

  1. Making a connection to the Gaia server,

  2. Exploring information about the database and the tables it contains,

  3. Writing a query and sending it to the server, and finally

  4. Downloading the response from the server as an Astropy Table.

In the next lesson we will extend these queries to select a particular region of the sky.

Key Points

  • If you can’t download an entire dataset (or it’s not practical) use queries to select the data you need.

  • Read the metadata and the documentation to make sure you understand the tables, their columns, and what they mean.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Use ADQL features like TOP and COUNT to test before you run a query that might return a lot of data.

  • If you know your query will return fewer than 3000 rows, you can run it synchronously, which might complete faster (but it doesn’t seem to make much difference). If it might return more than 3000 rows, you should run it asynchronously.

  • ADQL and SQL are not case-sensitive, so you don’t have to capitalize the keywords, but you should.

  • ADQL and SQL don’t require you to break a query into multiple lines, but you should.

  • Make each section of the notebook self-contained. Try not to use the same variable name in more than one section.

  • Keep notebooks short. Look for places where you can break your analysis into phases with one notebook per phase.


Coordinate Transformations

Overview

Teaching: 75 min
Exercises: 18 min
Questions
  • How do we transform celestial coordinates from one frame to another and save results in files?

Objectives
  • Use Python string formatting to compose more complex ADQL queries.

  • Work with coordinates and other quantities that have units.

  • Download the results of a query and store them in a file.

2. Coordinates and Units

In the previous lesson, we wrote ADQL queries and used them to select and download data from the Gaia server.

In this lesson, we’ll pick up where we left off and write a query to select stars from a particular region of the sky.

Outline

We’ll start with an example that does a “cone search”; that is, it selects stars that appear in a circular region of the sky.

Then, to select stars in the vicinity of GD-1, we’ll:

  • Use Quantity objects to represent measurements with units.

  • Use Astropy to convert coordinates from one frame to another.

  • Use the ADQL keywords POLYGON, CONTAINS, and POINT to select stars that fall within a polygonal region.

  • Submit a query and download the results.

  • Store the results in a FITS file.

Working with Units

The measurements we will work with are physical quantities, which means that they have two parts, a value and a unit. For example, the coordinate $30^{\circ}$ has value 30 and its units are degrees.

Until recently, most scientific computation was done with values only; units were left out of the program altogether, often with catastrophic results.

Astropy provides tools for including units explicitly in computations, which makes it possible to detect errors before they cause disasters.

To use Astropy units, we import them like this:

import astropy.units as u

u is an object that contains most common units and all SI units.

You can use dir to list them, but you should also read the documentation.

dir(u)
['A',
 'AA',
 'AB',
 'ABflux',
 'ABmag',
 'AU',
 'Angstrom',
 'B',
 'Ba',
 'Barye',
 'Bi',
[Output truncated]

To create a quantity, we multiply a value by a unit.

angle = 10 * u.degree
type(angle)
astropy.units.quantity.Quantity

The result is a Quantity object. Jupyter knows how to display Quantities like this:

angle
<Quantity 10. deg>

$10 \; \mathrm{^{\circ}}$

Quantities provide a method called to that converts to other units. For example, we can compute the number of arcminutes in angle:

angle_arcmin = angle.to(u.arcmin)
angle_arcmin
<Quantity 600. arcmin>

$600 \; \mathrm{^{\prime}}$

If you add quantities, Astropy converts them to compatible units, if possible:

angle + 30 * u.arcmin
<Quantity 10.5 deg>

$10.5 \; \mathrm{^{\circ}}$

If the units are not compatible, you get an error. For example:

angle + 5 * u.second

causes a UnitConversionError.

Exercise (3 minutes)

Create a quantity that represents 5 arcminutes and assign it to a variable called radius.

Then convert it to degrees.

Solution

radius = 5 * u.arcmin
print(radius)

radius.to(u.degree)

Selecting a Region

One of the most common ways to restrict a query is to select stars in a particular region of the sky. For example, here’s a query from the Gaia archive documentation that selects objects in a circular region centered at (88.8, 7.4) with a search radius of 5 arcmin (0.08333 deg).

query_cone = """SELECT 
TOP 10 
source_id
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

This query uses three keywords that are specific to ADQL (not SQL):

Here is the documentation of CONTAINS.

A query like this is called a cone search because it selects stars in a cone. Here’s how we run it.

from astroquery.gaia import Gaia

job = Gaia.launch_job(query_cone)
job
Created TAP+ (v1.2.1) - Connection:
	Host: gea.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443
Created TAP+ (v1.2.1) - Connection:
	Host: geadata.esac.esa.int
	Use HTTPS: True
	Port: 443
	SSL Port: 443

<astroquery.utils.tap.model.job.Job at 0x7f277785fa30>
results = job.get_results()
results
<Table length=10>
     source_id     
       int64       
-------------------
3322773965056065536
3322773758899157120
3322774068134271104
3322773930696320512
3322774377374425728
3322773724537891456
3322773724537891328
[Output truncated]

Exercise (5 minutes)

When you are debugging queries like this, you can use TOP to limit the size of the results, but then you still don’t know how big the results will be.

An alternative is to use COUNT, which asks for the number of rows that would be selected, but it does not return them.

In the previous query, replace TOP 10 source_id with COUNT(source_id) and run the query again. How many stars has Gaia identified in the cone we searched?

Solution

query = """SELECT 
COUNT(source_id)
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

job = Gaia.launch_job(query)
results = job.get_results()
results

Getting GD-1 Data

From the Price-Whelan and Bonaca paper, we will try to reproduce Figure 1, which includes this representation of stars likely to belong to GD-1:

The axes of this figure are defined so the x-axis is aligned with the stars in GD-1, and the y-axis is perpendicular.

Ideally, we would select all stars from this rectangle, but there are more than 10 million of them, so

So we’ll start by selecting stars in a smaller rectangle near the center of GD-1, from -55 to -45 degrees $\phi_1$ and -8 to 4 degrees $\phi_2$.

But first we let’s see how to represent these coordinates with Astropy.

Transforming coordinates

Astropy provides a SkyCoord object that represents sky coordinates relative to a specified frame.

The following example creates a SkyCoord object that represents the approximate coordinates of Betelgeuse (alf Ori) in the ICRS frame.

ICRS is the “International Celestial Reference System”, adopted in 1997 by the International Astronomical Union.

from astropy.coordinates import SkyCoord

ra = 88.8 * u.degree
dec = 7.4 * u.degree
coord_icrs = SkyCoord(ra=ra, dec=dec, frame='icrs')

coord_icrs
<SkyCoord (ICRS): (ra, dec) in deg
    (88.8, 7.4)>

SkyCoord provides a function that transforms to other frames. For example, we can transform coords_icrs to Galactic coordinates like this:

coord_galactic = coord_icrs.transform_to('galactic')
coord_galactic
<SkyCoord (Galactic): (l, b) in deg
    (199.79693102, -8.95591653)>

Notice that in the Galactic frame, the coordinates are called l and b, not ra and dec.

To transform to and from GD-1 coordinates, we’ll use a frame defined by Gala, which is an Astropy-affiliated library that provides tools for galactic dynamics.

Gala provides GD1Koposov10, which is “a Heliocentric spherical coordinate system defined by the orbit of the GD-1 stream”.

from gala.coordinates import GD1Koposov10

gd1_frame = GD1Koposov10()
gd1_frame
<GD1Koposov10 Frame>

We can use it to find the coordinates of Betelgeuse in the GD-1 frame, like this:

coord_gd1 = coord_icrs.transform_to(gd1_frame)
coord_gd1
<SkyCoord (GD1Koposov10): (phi1, phi2) in deg
    (-94.97222038, 34.5813813)>

Notice that the coordinates are called phi1 and phi2. These are the coordinates shown in the figure from the paper, above.

Exercise (10 minutes)

Let’s find the location of GD-1 in ICRS coordinates.

  1. Create a SkyCoord object at 0°, 0° in the GD-1 frame.

  2. Transform it to the ICRS frame.

Hint: Because ICRS is built into Astropy, you can specify it by name, icrs (as we did with galactic).

Solution

origin_gd1 = SkyCoord(0*u.degree, 0*u.degree, frame=gd1_frame)

# OR

origin_gd1 = SkyCoord(phi1=0*u.degree, 
                      phi2=0*u.degree, 
                      frame=gd1_frame)

# Note: because ICRS is built into Astropy, 
# we can identify it by string name
origin_gd1.transform_to('icrs')

# More formally, we could instantiate it
from astropy.coordinates import ICRS
icrs_frame = ICRS()
origin_gd1.transform_to(icrs_frame)

Notice that the origin of the GD-1 frame maps to ra=200, exactly, in ICRS. That’s by design.

Selecting a rectangle

Now we’ll use these coordinate transformations to define a rectangle in the GD-1 frame and transform it to ICRS.

The following variables define the boundaries of the rectangle in $\phi_1$ and $\phi_2$.

phi1_min = -55 * u.degree
phi1_max = -45 * u.degree
phi2_min = -8 * u.degree
phi2_max = 4 * u.degree

To create a rectangle, we’ll use the following function, which takes the lower and upper bounds as parameters.

def make_rectangle(x1, x2, y1, y2):
    """Return the corners of a rectangle."""
    xs = [x1, x1, x2, x2, x1]
    ys = [y1, y2, y2, y1, y1]
    return xs, ys

The return value is a tuple containing a list of coordinates in phi1 followed by a list of coordinates in phi2.

phi1_rect, phi2_rect = make_rectangle(
    phi1_min, phi1_max, phi2_min, phi2_max)

phi1_rect and phi2_rect contains the coordinates of the corners of a rectangle in the GD-1 frame.

In order to use them in a Gaia query, we have to convert them to ICRS. First we’ll put them into a SkyCoord object.

corners = SkyCoord(phi1=phi1_rect, phi2=phi2_rect, frame=gd1_frame)
corners
<SkyCoord (GD1Koposov10): (phi1, phi2) in deg
    [(-55., -8.), (-55.,  4.), (-45.,  4.), (-45., -8.), (-55., -8.)]>

Now we can use transform_to to convert to ICRS coordinates.

corners_icrs = corners.transform_to('icrs')
corners_icrs
<SkyCoord (ICRS): (ra, dec) in deg
    [(146.27533314, 19.26190982), (135.42163944, 25.87738723),
     (141.60264825, 34.3048303 ), (152.81671045, 27.13611254),
     (146.27533314, 19.26190982)]>

Notice that a rectangle in one coordinate system is not necessarily a rectangle in another. In this example, the result is a (non-rectangular) polygon.

Defining a polygon

In order to use this polygon as part of an ADQL query, we have to convert it to a string with a comma-separated list of coordinates, as in this example:

"""
POLYGON(143.65, 20.98, 
        134.46, 26.39, 
        140.58, 34.85, 
        150.16, 29.01)
"""

SkyCoord provides to_string, which produces a list of strings.

t = corners_icrs.to_string()
t
['146.275 19.2619',
 '135.422 25.8774',
 '141.603 34.3048',
 '152.817 27.1361',
 '146.275 19.2619']

We can use the Python string function join to join t into a single string (with spaces between the pairs):

s = ' '.join(t)
s
'146.275 19.2619 135.422 25.8774 141.603 34.3048 152.817 27.1361 146.275 19.2619'

That’s almost what we need, but we have to replace the spaces with commas.

s.replace(' ', ', ')
'146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619'

The following function combines these steps.

def skycoord_to_string(skycoord):
    """Convert SkyCoord to string."""
    t = skycoord.to_string()
    s = ' '.join(t)
    return s.replace(' ', ', ')

Here’s how we use it.

point_list = skycoord_to_string(corners_icrs)
point_list
'146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619'

Assembling the query

Now we’re ready to assemble the query. We need columns again (as we saw in the previous lesson).

columns = 'source_id, ra, dec, pmra, pmdec, parallax'

And here’s the query base we used in the previous lesson:

query3_base = """SELECT 
TOP 10 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2
"""

Now we’ll add a WHERE clause to select stars in the polygon we defined.

query4_base = """SELECT
TOP 10
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({point_list}))
"""

The query base contains format specifiers for columns and point_list.

We’ll use format to fill in these values.

query4 = query4_base.format(columns=columns, 
                          point_list=point_list)
print(query4)
SELECT
TOP 10
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619))


As always, we should take a minute to proof-read the query before we launch it.

job = Gaia.launch_job_async(query4)
print(job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=10>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1615815873808O
Phase: COMPLETED
[Output truncated]

Here are the results.

results = job.get_results()
results
<Table length=10>
    source_id              ra         ...        pmdec       
                          deg         ...       mas / yr     
      int64             float64       ...       float64      
------------------ ------------------ ... -------------------
637987125186749568 142.48301935991023 ...   2.941813096629439
638285195917112960 142.25452941346344 ... -12.165984395577347
638073505568978688 142.64528557468074 ...  -7.950659620550862
638086386175786752 142.57739430926034 ...  -2.584105480335548
638049655615392384 142.58913564478618 ...  -4.941079187010136
638267565075964032 141.81762228999614 ...  1.8838892877285924
[Output truncated]

Finally, we can remove TOP 10 run the query again.

The result is bigger than our previous queries, so it will take a little longer.

query5_base = """SELECT
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({point_list}))
"""
query5 = query5_base.format(columns=columns, 
                          point_list=point_list)
print(query5)
SELECT
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(146.275, 19.2619, 135.422, 25.8774, 141.603, 34.3048, 152.817, 27.1361, 146.275, 19.2619))
job = Gaia.launch_job_async(query5)
print(job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1615815886707O
Phase: COMPLETED
[Output truncated]
results = job.get_results()
len(results)
140339

There are more than 100,000 stars in this polygon, but that’s a manageable size to work with.

Saving results

This is the set of stars we’ll work with in the next step. But since we have a substantial dataset now, this is a good time to save it.

Storing the data in a file means we can shut down this notebook and pick up where we left off without running the previous query again.

Astropy Table objects provide write, which writes the table to disk.

filename = 'gd1_results.fits'
results.write(filename, overwrite=True)

Because the filename ends with fits, the table is written in the FITS format, which preserves the metadata associated with the table.

If the file already exists, the overwrite argument causes it to be overwritten.

We can use getsize to confirm that the file exists and check the size:

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
5.36407470703125

Summary

In this notebook, we composed more complex queries to select stars within a polygonal region of the sky. Then we downloaded the results and saved them in a FITS file.

In the next notebook, we’ll reload the data from this file and replicate the next step in the analysis, using proper motion to identify stars likely to be in GD-1.

Key Points

  • For measurements with units, use Quantity objects that represent units explicitly and check for errors.

  • Use the format function to compose queries; it is often faster and less error-prone.

  • Develop queries incrementally: start with something simple, test it, and add a little bit at a time.

  • Once you have a query working, save the data in a local file. If you shut down the notebook and come back to it later, you can reload the file; you don’t have to run the query again.


Plotting and Pandas

Overview

Teaching: 65 min
Exercises: 20 min
Questions
  • How do we make scatter plots in Matplotlib? How do we store data in a Pandas DataFrame?

Objectives
  • Select rows and columns from an Astropy Table.

  • Use Matplotlib to make a scatter plot.

  • Use Gala to transform coordinates.

  • Make a Pandas DataFrame and use a Boolean Series to select rows.

  • Save a DataFrame in an HDF5 file.

3. Proper Motion

In the previous lesson, we wrote a query to select stars from the region of the sky where we expect GD-1 to be, and saved the results in a FITS file.

Now we’ll read that data back and implement the next step in the analysis, identifying stars with the proper motion we expect for GD-1.

Outline

  1. We’ll read back the results from the previous lesson, which we saved in a FITS file.

  2. Then we’ll transform the coordinates and proper motion data from ICRS back to the coordinate frame of GD-1.

  3. We’ll put those results into a Pandas DataFrame, which we’ll use to select stars near the centerline of GD-1.

  4. Plotting the proper motion of those stars, we’ll identify a region of proper motion for stars that are likely to be in GD-1.

  5. Finally, we’ll select and plot the stars whose proper motion is in that region.

Reload the data

In the previous lesson, we ran a query on the Gaia server and downloaded data for roughly 140,000 stars. We saved the data in a FITS file so that now, picking up where we left off, we can read the data from a local file rather than running the query again.

If you ran the previous lesson successfully, you should already have a file called gd1_results.fits that contains the data we downloaded.

If not, you can download the file or run the following cell.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'data/gd1_results.fits')

Now here’s how we can read the data from the file back into an Astropy Table:

from astropy.table import Table

filename = 'gd1_results.fits'
results = Table.read(filename)

The result is an Astropy Table.

We can use info to refresh our memory of the contents.

results.info
<Table length=140339>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
 parallax float64      mas                                                           Parallax

Selecting rows and columns

In this section we’ll see operations for selecting columns and rows from an Astropy Table. You can find more information about these operations in the Astropy documentation.

We can get the names of the columns like this:

results.colnames
['source_id', 'ra', 'dec', 'pmra', 'pmdec', 'parallax']

And select an individual column like this:

results['ra']
<Column name='ra' dtype='float64' unit='deg' description='Right ascension' length=140339>
142.48301935991023
142.25452941346344
142.64528557468074
142.57739430926034
142.58913564478618
141.81762228999614
143.18339801317677
 142.9347319464589
142.26769745823267
142.89551292869012
[Output truncated]

The result is a Column object that contains the data, and also the data type, units, and name of the column.

type(results['ra'])
astropy.table.column.Column

The rows in the Table are numbered from 0 to n-1, where n is the number of rows. We can select the first row like this:

results[0]
<Row index=0>
    source_id              ra                dec                pmra              pmdec             parallax     
                          deg                deg              mas / yr           mas / yr             mas        
      int64             float64            float64            float64            float64            float64      
------------------ ------------------ ----------------- ------------------- ----------------- -------------------
637987125186749568 142.48301935991023 21.75771616932985 -2.5168384683875766 2.941813096629439 -0.2573448962333354

As you might have guessed, the result is a Row object.

type(results[0])
astropy.table.row.Row

Notice that the bracket operator selects both columns and rows. You might wonder how it knows which to select. If the expression in brackets is a string, it selects a column; if the expression is an integer, it selects a row.

If you apply the bracket operator twice, you can select a column and then an element from the column.

results['ra'][0]
142.48301935991023

Or you can select a row and then an element from the row.

results[0]['ra']
142.48301935991023

You get the same result either way.

Scatter plot

To see what the results look like, we’ll use a scatter plot. The library we’ll use is Matplotlib, which is the most widely-used plotting library for Python. The Matplotlib interface is based on MATLAB (hence the name), so if you know MATLAB, some of it will be familiar.

We’ll import like this.

import matplotlib.pyplot as plt

Pyplot is part of the Matplotlib library. It is conventional to import it using the shortened name plt.

Warning

In recent versions of Jupyter, plots appear “inline”; that is, they are part of the notebook. In some older versions, plots appear in a new window. In that case, you might want to run the following Jupyter magic command in a notebook cell:

%matplotlib inline

Pyplot provides two functions that can make scatterplots, plt.scatter and plt.plot.

Jake Vanderplas explains these differences in The Python Data Science Handbook.

Since we are plotting more than 100,000 points and they are all the same size and color, we’ll use plot.

Here’s a scatter plot with right ascension on the x-axis and declination on the y-axis, both ICRS coordinates in degrees.

x = results['ra']
y = results['dec']
plt.plot(x, y, 'ko')

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)');
<Figure size 432x288 with 1 Axes>

png

The arguments to plt.plot are x, y, and a string that specifies the style. In this case, the letters ko indicate that we want a black, round marker (k is for black because b is for blue). The functions xlabel and ylabel put labels on the axes.

Looking at this plot, we can see that the region we selected, which is a rectangle in GD-1 coordinates, is a non-rectanglar region in ICRS coordinates.

However, this scatter plot has a problem. It is “overplotted”, which means that there are so many overlapping points, we can’t distinguish between high and low density areas.

To fix this, we can provide optional arguments to control the size and transparency of the points.

Exercise (5 minutes)

In the call to plt.plot, use the keyword argument markersize to make the markers smaller.

Then add the keyword argument alpha to make the markers partly transparent.

Adjust these arguments until you think the figure shows the data most clearly.

Note: Once you have made these changes, you might notice that the figure shows stripes with lower density of stars. These stripes are caused by the way Gaia scans the sky, which you can read about here. The dataset we are using, Gaia Data Release 2, covers 22 months of observations; during this time, some parts of the sky were scanned more than others.

Solution

x = results['ra']
y = results['dec']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)');

Transform back

Remember that we selected data from a rectangle of coordinates in the GD1Koposov10 frame, then transformed them to ICRS when we constructed the query. The coordinates in results are in ICRS.

To plot them, we will transform them back to the GD1Koposov10 frame; that way, the axes of the figure are aligned with the orbit of GD-1, which is useful for two reasons:

To do the transformation, we’ll put the results into a SkyCoord object. In a previous lesson we created a SkyCoord object like this:

from astropy.coordinates import SkyCoord

skycoord = SkyCoord(ra=results['ra'], dec=results['dec'])

Now we’re going to do something similar, but in addition to ra and dec, we’ll also include:

import astropy.units as u

distance = 8 * u.kpc
radial_velocity= 0 * u.km/u.s

skycoord = SkyCoord(ra=results['ra'], 
                    dec=results['dec'],
                    pm_ra_cosdec=results['pmra'],
                    pm_dec=results['pmdec'], 
                    distance=distance, 
                    radial_velocity=radial_velocity)

For the first four arguments, we use columns from results.

For distance and radial_velocity we use constants, which we explain below.

The result is an Astropy SkyCoord object, which we can transform to the GD-1 frame.

from gala.coordinates import GD1Koposov10

gd1_frame = GD1Koposov10()
transformed = skycoord.transform_to(gd1_frame)

The result is another SkyCoord object, now in the GD1Koposov10 frame.

Reflex Correction

The next step is to correct the proper motion measurements for the effect of the motion of our solar system around the Galactic center.

When we created skycoord, we provided constant values for distance and radial_velocity rather than measurements from Gaia.

That might seem like a strange thing to do, but here’s the motivation:

With this preparation, we can use reflex_correct from Gala (documentation here) to correct for the motion of the solar system.

from gala.coordinates import reflex_correct

skycoord_gd1 = reflex_correct(transformed)

The result is a SkyCoord object that contains

We can select the coordinates and plot them like this:

x = skycoord_gd1.phi1
y = skycoord_gd1.phi2
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)');
<Figure size 432x288 with 1 Axes>

png

Remember that we started with a rectangle in the GD-1 frame. When transformed to the ICRS frame, it’s a non-rectangular region. Now, transformed back to the GD-1 frame, it’s a rectangle again.

Pandas DataFrame

At this point we have two objects containing different subsets of the data. results is the Astropy Table we downloaded from Gaia.

type(results)
astropy.table.table.Table

And skycoord_gd1 is a SkyCoord object that contains the transformed coordinates and proper motions.

type(skycoord_gd1)
astropy.coordinates.sky_coordinate.SkyCoord

On one hand, this division of labor makes sense because each object provides different capabilities. But working with multiple object types can be awkward.

It will be more convenient to choose one object and get all of the data into it. We’ll choose a Pandas DataFrame, for two reasons:

  1. It provides capabilities that (almost) a superset of the other data structures, so it’s the all-in-one solution.

  2. Pandas is a general-purpose tool that is useful in many domains, especially data science. If you are going to develop expertise in one tool, Pandas is a good choice.

However, compared to an Astropy Table, Pandas has one big drawback: it does not keep the metadata associated with the table, including the units for the columns.

It’s easy to convert a Table to a Pandas DataFrame.

import pandas as pd

results_df = results.to_pandas()

DataFrame provides shape, which shows the number of rows and columns.

results_df.shape
(140339, 6)

It also provides head, which displays the first few rows. head is useful for spot-checking large results as you go along.

results_df.head()
            source_id          ra        dec       pmra      pmdec  parallax
0  637987125186749568  142.483019  21.757716  -2.516838   2.941813 -0.257345
1  638285195917112960  142.254529  22.476168   2.662702 -12.165984  0.422728
2  638073505568978688  142.645286  22.166932  18.306747  -7.950660  0.103640
3  638086386175786752  142.577394  22.227920   0.987786  -2.584105 -0.857327
4  638049655615392384  142.589136  22.110783   0.244439  -4.941079  0.099625

Python detail: shape is an attribute, so we display its value without calling it as a function; head is a function, so we need the parentheses.

Now we can extract the columns we want from skycoord_gd1 and add them as columns in the DataFrame. phi1 and phi2 contain the transformed coordinates.

results_df['phi1'] = skycoord_gd1.phi1
results_df['phi2'] = skycoord_gd1.phi2
results_df.shape
(140339, 8)

pm_phi1_cosphi2 and pm_phi2 contain the components of proper motion in the transformed frame.

results_df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
results_df['pm_phi2'] = skycoord_gd1.pm_phi2
results_df.shape
(140339, 10)

Detail: If you notice that SkyCoord has an attribute called proper_motion, you might wonder why we are not using it.

We could have: proper_motion contains the same data as pm_phi1_cosphi2 and pm_phi2, but in a different format.

Exploring data

One benefit of using Pandas is that it provides functions for exploring the data and checking for problems. One of the most useful of these functions is describe, which computes summary statistics for each column.

results_df.describe()
          source_id             ra            dec           pmra  \
count  1.403390e+05  140339.000000  140339.000000  140339.000000   
mean   6.792399e+17     143.823122      26.780285      -2.484404   
std    3.792177e+16       3.697850       3.052592       5.913939   
min    6.214900e+17     135.425699      19.286617    -106.755260   
25%    6.443517e+17     140.967966      24.592490      -5.038789   
50%    6.888060e+17     143.734409      26.746261      -1.834943   
75%    6.976579e+17     146.607350      28.990500       0.452893   
max    7.974418e+17     152.777393      34.285481     104.319923   

               pmdec       parallax           phi1           phi2  \
[Output truncated]

Exercise (10 minutes)

Review the summary statistics in this table.

  • Do the values make sense based on what you know about the context?

  • Do you see any values that seem problematic, or evidence of other data issues?

Solution

The most noticeable issue is that some of the parallax values are negative, which is non-physical.

The reason is that parallax measurements are less accurate for stars that are far away.

Fortunately, we don’t use the parallax measurements in the analysis (one of the reasons we used constant distance for reflex correction).

Plot proper motion

Now we are ready to replicate one of the panels in Figure 1 of the Price-Whelan and Bonaca paper, the one that shows components of proper motion as a scatter plot:

In this figure, the shaded area identifies stars that are likely to be in GD-1 because:

By plotting proper motion in the GD-1 frame, we hope to find this cluster. Then we will use the bounds of the cluster to select stars that are more likely to be in GD-1.

The following figure is a scatter plot of proper motion, in the GD-1 frame, for the stars in results_df.

x = results_df['pm_phi1']
y = results_df['pm_phi2']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
    
plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)');
<Figure size 432x288 with 1 Axes>

png

Most of the proper motions are near the origin, but there are a few extreme values. Following the example in the paper, we’ll use xlim and ylim to zoom in on the region near the origin.

x = results_df['pm_phi1']
y = results_df['pm_phi2']
plt.plot(x, y, 'ko', markersize=0.1, alpha=0.1)
    
plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')

plt.xlim(-12, 8)
plt.ylim(-10, 10);
<Figure size 432x288 with 1 Axes>

png

There is a hint of an overdense region near (-7.5, 0), but if you didn’t know where to look, you would miss it.

To see the cluster more clearly, we need a sample that contains a higher proportion of stars in GD-1. We’ll do that by selecting stars close to the centerline.

Selecting the centerline

As we can see in the following figure, many stars in GD-1 are less than 1 degree from the line phi2=0.

Stars near this line have the highest probability of being in GD-1.

To select them, we will use a “Boolean mask”. We’ll start by selecting the phi2 column from the DataFrame:

phi2 = results_df['phi2']
type(phi2)
pandas.core.series.Series

The result is a Series, which is the structure Pandas uses to represent columns.

We can use a comparison operator, >, to compare the values in a Series to a constant.

phi2_min = -1.0 * u.degree
phi2_max = 1.0 * u.degree

mask = (phi2 > phi2_min)
type(mask)
pandas.core.series.Series

The result is a Series of Boolean values, that is, True and False.

mask.head()
0    False
1    False
2    False
3    False
4    False
Name: phi2, dtype: bool

To select values that fall between phi2_min and phi2_max, we’ll use the & operator, which computes “logical AND”. The result is true where elements from both Boolean Series are true.

mask = (phi2 > phi2_min) & (phi2 < phi2_max)

Logical operators

Python’s logical operators (and, or, and not) don’t work with NumPy or Pandas. Both libraries use the bitwise operators (&, |, and ~) to do elementwise logical operations (explanation here).

Also, we need the parentheses around the conditions; otherwise the order of operations is incorrect.

The sum of a Boolean Series is the number of True values, so we can use sum to see how many stars are in the selected region.

mask.sum()
25084

A Boolean Series is sometimes called a “mask” because we can use it to mask out some of the rows in a DataFrame and select the rest, like this:

centerline_df = results_df[mask]
type(centerline_df)
pandas.core.frame.DataFrame

centerline_df is a DataFrame that contains only the rows from results_df that correspond to True values in mask. So it contains the stars near the centerline of GD-1.

We can use len to see how many rows are in centerline_df:

len(centerline_df)
25084

And what fraction of the rows we’ve selected.

len(centerline_df) / len(results_df)
0.1787386257562046

There are about 25,000 stars in this region, about 18% of the total.

Plotting proper motion

Since we’ve plotted proper motion several times, let’s put that code in a function.

def plot_proper_motion(df):
    """Plot proper motion.
    
    df: DataFrame with `pm_phi1` and `pm_phi2`
    """
    x = df['pm_phi1']
    y = df['pm_phi2']
    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlabel('Proper motion phi1 (mas/yr GD1 frame)')
    plt.ylabel('Proper motion phi2 (mas/yr GD1 frame)')

    plt.xlim(-12, 8)
    plt.ylim(-10, 10)

And we can call it like this:

plot_proper_motion(centerline_df)
<Figure size 432x288 with 1 Axes>

png

Now we can see more clearly that there is a cluster near (-7.5, 0).

You might notice that our figure is less dense than the one in the paper. That’s because we started with a set of stars from a relatively small region. The figure in the paper is based on a region about 10 times bigger.

In the next lesson we’ll go back and select stars from a larger region. But first we’ll use the proper motion data to identify stars likely to be in GD-1.

Filtering based on proper motion

The next step is to select stars in the “overdense” region of proper motion, which are candidates to be in GD-1.

In the original paper, Price-Whelan and Bonaca used a polygon to cover this region, as shown in this figure.

We’ll use a simple rectangle for now, but in a later lesson we’ll see how to select a polygonal region as well.

Here are bounds on proper motion we chose by eye:

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0

To draw these bounds, we’ll use make_rectangle to make two lists containing the coordinates of the corners of the rectangle.

def make_rectangle(x1, x2, y1, y2):
    """Return the corners of a rectangle."""
    xs = [x1, x1, x2, x2, x1]
    ys = [y1, y2, y2, y1, y1]
    return xs, ys
pm1_rect, pm2_rect = make_rectangle(
    pm1_min, pm1_max, pm2_min, pm2_max)

Here’s what the plot looks like with the bounds we chose.

plot_proper_motion(centerline_df)
plt.plot(pm1_rect, pm2_rect, '-');
<Figure size 432x288 with 1 Axes>

png

Now that we’ve identified the bounds of the cluster in proper motion, we’ll use it to select rows from results_df.

We’ll use the following function, which uses Pandas operators to make a mask that selects rows where series falls between low and high.

def between(series, low, high):
    """Check whether values are between `low` and `high`."""
    return (series > low) & (series < high)

The following mask selects stars with proper motion in the region we chose.

pm1 = results_df['pm_phi1']
pm2 = results_df['pm_phi2']

pm_mask = (between(pm1, pm1_min, pm1_max) & 
           between(pm2, pm2_min, pm2_max))

Again, the sum of a Boolean series is the number of True values.

pm_mask.sum()
1049

Now we can use this mask to select rows from results_df.

selected_df = results_df[pm_mask]
len(selected_df)
1049

These are the stars we think are likely to be in GD-1. Let’s see what they look like, plotting their coordinates (not their proper motion).

x = selected_df['phi1']
y = selected_df['phi2']
plt.plot(x, y, 'ko', markersize=1, alpha=1)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)');
<Figure size 432x288 with 1 Axes>

png

Now that’s starting to look like a tidal stream!

Saving the DataFrame

At this point we have run a successful query and cleaned up the results; this is a good time to save the data.

To save a Pandas DataFrame, one option is to convert it to an Astropy Table, like this:

selected_table = Table.from_pandas(selected_df)
type(selected_table)
astropy.table.table.Table

Then we could write the Table to a FITS file, as we did in the previous lesson.

But Pandas provides functions to write DataFrames in other formats; to see what they are find the functions here that begin with to_.

One of the best options is HDF5, which is Version 5 of Hierarchical Data Format.

HDF5 is a binary format, so files are small and fast to read and write (like FITS, but unlike XML).

An HDF5 file is similar to an SQL database in the sense that it can contain more than one table, although in HDF5 vocabulary, a table is called a Dataset. (Multi-extension FITS files can also contain more than one table.)

And HDF5 stores the metadata associated with the table, including column names, row labels, and data types (like FITS).

Finally, HDF5 is a cross-language standard, so if you write an HDF5 file with Pandas, you can read it back with many other software tools (more than FITS).

We can write a Pandas DataFrame to an HDF5 file like this:

filename = 'gd1_data.hdf'

selected_df.to_hdf(filename, 'selected_df', mode='w')

Because an HDF5 file can contain more than one Dataset, we have to provide a name, or “key”, that identifies the Dataset in the file.

We could use any string as the key, but it will be convenient to give the Dataset in the file the same name as the DataFrame.

The argument mode='w' means that if the file already exists, we should overwrite it.

Exercise (5 minutes)

We’re going to need centerline_df later as well. Write a line of code to add it as a second Dataset in the HDF5 file.

Hint: Since the file already exists, you should not use mode='w'.

Solution

centerline_df.to_hdf(filename, 'centerline_df')

We can use getsize to confirm that the file exists and check the size:

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
2.2084197998046875

If you forget what the names of the Datasets in the file are, you can read them back like this:

with pd.HDFStore(filename) as hdf:
    print(hdf.keys())
['/centerline_df', '/selected_df']

Context Managers

We use a with statement here to open the file before the print statement and (automatically) close it after. Read more about context managers.

The keys are the names of the Datasets. Notice that they start with /, which indicates that they are at the top level of the Dataset hierarchy, and not in a named “group”.

In future lessons we will add a few more Datasets to this file, but not so many that we need to organize them into groups.

Summary

In this lesson, we re-loaded the Gaia data we saved from a previous query.

We transformed the coordinates and proper motion from ICRS to a frame aligned with the orbit of GD-1, and stored the results in a Pandas DataFrame.

Then we replicated the selection process from the Price-Whelan and Bonaca paper:

So far, we have used data from a relatively small region of the sky. In the next lesson, we’ll write a query that selects stars based on proper motion, which will allow us to explore a larger region.

Key Points

  • When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.

  • For simple scatter plots in Matplotlib, plot is faster than scatter.

  • An Astropy Table and a Pandas DataFrame are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.

  • To store data from a Pandas DataFrame, a good option is an HDF file, which can contain multiple Datasets.


Transform and Select

Overview

Teaching: 60 min
Exercises: 15 min
Questions
  • How do we move the computation to the data?

Objectives
  • Transform proper motions from one frame to another.

  • Compute the convex hull of a set of points.

  • Write an ADQL query that selects based on proper motion.

  • Save data in CSV format.

4. Transformation and Selection

In the previous lesson, we identified stars with the proper motion we expect for GD-1.

Now we’ll do the same selection in an ADQL query, which will make it possible to work with a larger region of the sky and still download less data.

Outline

  1. Using data from the previous lesson, we’ll identify the values of proper motion for stars likely to be in GD-1.

  2. Then we’ll compose an ADQL query that selects stars based on proper motion, so we can download only the data we need.

That will make it possible to search a bigger region of the sky in a single query. We’ll also see how to write the results to a CSV file.

Reload the data

You can download the data from the previous lesson or run the following cell, which downloads it if necessary.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'data/gd1_data.hdf')

Now we can reload centerline_df and selected_df.

import pandas as pd

filename = 'gd1_data.hdf'
centerline_df = pd.read_hdf(filename, 'centerline_df')
selected_df = pd.read_hdf(filename, 'selected_df')

Selection by proper motion

Let’s review how we got to this point.

  1. We made an ADQL query to the Gaia server to get data for stars in the vicinity of GD-1.

  2. We transformed the coordinates to the GD1Koposov10 frame so we could select stars along the centerline of GD-1.

  3. We plotted the proper motion of the centerline stars to identify the bounds of the overdense region.

  4. We made a mask that selects stars whose proper motion is in the overdense region.

At this point we have downloaded data for a relatively large number of stars (more than 100,000) and selected a relatively small number (around 1000).

It would be more efficient to use ADQL to select only the stars we need. That would also make it possible to download data covering a larger region of the sky.

However, the selection we did was based on proper motion in the GD1Koposov10 frame. In order to do the same selection in ADQL, we have to work with proper motions in ICRS.

As a reminder, here’s the rectangle we selected based on proper motion in the GD1Koposov10 frame.

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0
def make_rectangle(x1, x2, y1, y2):
    """Return the corners of a rectangle."""
    xs = [x1, x1, x2, x2, x1]
    ys = [y1, y2, y2, y1, y1]
    return xs, ys
pm1_rect, pm2_rect = make_rectangle(
    pm1_min, pm1_max, pm2_min, pm2_max)

Since we’ll need to plot proper motion several times, we’ll use the following function.

import matplotlib.pyplot as plt

def plot_proper_motion(df):
    """Plot proper motion.
    
    df: DataFrame with `pm_phi1` and `pm_phi2`
    """
    x = df['pm_phi1']
    y = df['pm_phi2']
    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlabel('Proper motion phi1 (GD1 frame)')
    plt.ylabel('Proper motion phi2 (GD1 frame)')

    plt.xlim(-12, 8)
    plt.ylim(-10, 10)

The following figure shows:

plot_proper_motion(centerline_df)

plt.plot(pm1_rect, pm2_rect)

x = selected_df['pm_phi1']
y = selected_df['pm_phi2']
plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3);
<Figure size 432x288 with 1 Axes>

png

Now we’ll make the same plot using proper motions in the ICRS frame, which are stored in columns pmra and pmdec.

x = centerline_df['pmra']
y = centerline_df['pmdec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

x = selected_df['pmra']
y = selected_df['pmdec']
plt.plot(x, y, 'gx', markersize=1, alpha=0.3)
    
plt.xlabel('Proper motion ra (ICRS frame)')
plt.ylabel('Proper motion dec (ICRS frame)')

plt.xlim([-10, 5])
plt.ylim([-20, 5]);
<Figure size 432x288 with 1 Axes>

png

The proper motions of the selected stars are more spread out in this frame, which is why it was preferable to do the selection in the GD-1 frame.

But now we can define a polygon that encloses the proper motions of these stars in ICRS, and use that polygon as a selection criterion in an ADQL query.

Convex Hull

SciPy provides a function that computes the convex hull of a set of points, which is the smallest convex polygon that contains all of the points.

To use it, we’ll select columns pmra and pmdec and convert them to a NumPy array.

import numpy as np

points = selected_df[['pmra','pmdec']].to_numpy()
points.shape
(1049, 2)

Note

If you are using an older version of Pandas, you might not have to_numpy(); you can use values instead, like this:

points = selected_df[['pmra','pmdec']].values

We’ll pass the points to ConvexHull, which returns an object that contains the results.

from scipy.spatial import ConvexHull

hull = ConvexHull(points)
hull
<scipy.spatial.qhull.ConvexHull at 0x7ff6207866a0>

hull.vertices contains the indices of the points that fall on the perimeter of the hull.

hull.vertices
array([ 692,  873,  141,  303,   42,  622,   45,   83,  127,  182, 1006,
        971,  967, 1001,  969,  940], dtype=int32)

We can use them as an index into the original array to select the corresponding rows.

pm_vertices = points[hull.vertices]
pm_vertices
array([[ -4.05037121, -14.75623261],
       [ -3.41981085, -14.72365546],
       [ -3.03521988, -14.44357135],
       [ -2.26847919, -13.7140236 ],
       [ -2.61172203, -13.24797471],
       [ -2.73471401, -13.09054471],
       [ -3.19923146, -12.5942653 ],
       [ -3.34082546, -12.47611926],
       [ -5.67489413, -11.16083338],
       [ -5.95159272, -11.10547884],
       [ -6.42394023, -11.05981295],
[Output truncated]

To plot the resulting polygon, we have to pull out the x and y coordinates.

pmra_poly, pmdec_poly = np.transpose(pm_vertices)

This use of transpose is a useful NumPy idiom. Because pm_vertices has two columns, its matrix transpose has two rows, which are assigned to the two variables pmra_poly and pmdec_poly.

The following figure shows proper motion in ICRS again, along with the convex hull we just computed.

x = centerline_df['pmra']
y = centerline_df['pmdec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

x = selected_df['pmra']
y = selected_df['pmdec']
plt.plot(x, y, 'gx', markersize=0.3, alpha=0.3)

plt.plot(pmra_poly, pmdec_poly)
    
plt.xlabel('Proper motion phi1 (ICRS frame)')
plt.ylabel('Proper motion phi2 (ICRS frame)')

plt.xlim([-10, 5])
plt.ylim([-20, 5]);
<Figure size 432x288 with 1 Axes>

png

So pm_vertices represents the polygon we want to select. The next step is to use it as part of an ADQL query.

Assembling the query

In Lesson 2 we used the following query to select stars in a polygonal region.

query5_base = """SELECT
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({point_list}))
"""

In this lesson we’ll make two changes:

  1. We’ll select stars with coordinates in a larger region.

  2. We’ll add another clause to select stars whose proper motion is in the polygon we just computed, pm_vertices.

Here are the coordinates of the larger rectangle in the GD-1 frame.

import astropy.units as u

phi1_min = -70 * u.degree
phi1_max = -20 * u.degree
phi2_min = -5 * u.degree
phi2_max = 5 * u.degree

We selected these bounds by trial and error, defining the largest region we can process in a single query.

phi1_rect, phi2_rect = make_rectangle(
    phi1_min, phi1_max, phi2_min, phi2_max)

Here’s how we transform it to ICRS, as we saw in Lesson 2.

from gala.coordinates import GD1Koposov10
from astropy.coordinates import SkyCoord

gd1_frame = GD1Koposov10()
corners = SkyCoord(phi1=phi1_rect, 
                   phi2=phi2_rect, 
                   frame=gd1_frame)

corners_icrs = corners.transform_to('icrs')

To use corners_icrs as part of an ADQL query, we have to convert it to a string. Here’s the function from Lesson 2 we used to do that.

def skycoord_to_string(skycoord):
    """Convert SkyCoord to string."""
    t = skycoord.to_string()
    s = ' '.join(t)
    return s.replace(' ', ', ')
point_list = skycoord_to_string(corners_icrs)
point_list
'135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862'

Here are the columns we want to select.

columns = 'source_id, ra, dec, pmra, pmdec'

Now we have everything we need to assemble the query.

query5 = query5_base.format(columns=columns, 
                            point_list=point_list)
print(query5)
SELECT
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862))

But don’t try to run that query. Because it selects a larger region, there are too many stars to handle in a single query. Until we select by proper motion, that is.

Selecting proper motion

Now we’re ready to add a WHERE clause to select stars whose proper motion falls in the polygon defined by pm_vertices.

To use pm_vertices as part of an ADQL query, we have to convert it to a string. Using flatten and array2string, we can almost get the format we need.

s = np.array2string(pm_vertices.flatten(), 
                    max_line_width=1000,
                    separator=',')
s
'[ -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176]'

We just have to remove the brackets.

pm_point_list = s.strip('[]')
pm_point_list
' -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176'

Exercise (10 minutes)

Define query6_base, starting with query5_base and adding a new clause to select stars whose coordinates of proper motion, pmra and pmdec, fall within the polygon defined by pm_point_list.

Solution

query6_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({point_list}))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON({pm_point_list}))
"""

Exercise (5 minutes)

Use format to format query6_base and define query6, filling in the values of columns, point_list, and pm_point_list.

Solution

query6 = query6_base.format(columns=columns, 
                            point_list=point_list,
                            pm_point_list=pm_point_list)
print(query6)

Now we can run the query like this:

from astroquery.gaia import Gaia

job = Gaia.launch_job_async(query6)
print(job)
INFO: Query finished. [astroquery.utils.tap.core]
<Table length=7345>
   name    dtype    unit                              description                            
--------- ------- -------- ------------------------------------------------------------------
source_id   int64          Unique source identifier (unique within a particular Data Release)
       ra float64      deg                                                    Right ascension
      dec float64      deg                                                        Declination
     pmra float64 mas / yr                         Proper motion in right ascension direction
    pmdec float64 mas / yr                             Proper motion in declination direction
Jobid: 1616771462206O
Phase: COMPLETED
[Output truncated]

And get the results.

candidate_table = job.get_results()
len(candidate_table)
7345

We call the results candidate_table because it contains stars that are good candidates for GD-1.

For the next lesson, we’ll need point_list and pm_point_list again, so we should save them in a file. There are several ways we could do that, but since we are already storing data in an HDF file, let’s do the same with these variables.

We’ve seen how to save a DataFrame in an HDF file. We can do the same thing with a Pandas Series. To make one, we’ll start with a dictionary:

d = dict(point_list=point_list, pm_point_list=pm_point_list)
d
{'point_list': '135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862',
 'pm_point_list': ' -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176'}

And use it to initialize a Series.

point_series = pd.Series(d)
point_series
point_list       135.306, 8.39862, 126.51, 13.4449, 163.017, 54...
pm_point_list     -4.05037121,-14.75623261, -3.41981085,-14.723...
dtype: object

Now we can save it in the usual way.

filename = 'gd1_data.hdf'
point_series.to_hdf(filename, 'point_series')

Plotting one more time

Let’s see what the results look like.

x = candidate_table['ra']
y = candidate_table['dec']
plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

plt.xlabel('ra (degree ICRS)')
plt.ylabel('dec (degree ICRS)');
<Figure size 432x288 with 1 Axes>

png

Here we can see why it was useful to transform these coordinates. In ICRS, it is more difficult to identity the stars near the centerline of GD-1.

So let’s transform the results back to the GD-1 frame. Here’s the code we used to transform the coordinates and make a Pandas DataFrame, wrapped in a function.

from gala.coordinates import reflex_correct

def make_dataframe(table):
    """Transform coordinates from ICRS to GD-1 frame.
    
    table: Astropy Table
    
    returns: Pandas DataFrame
    """
    skycoord = SkyCoord(
               ra=table['ra'], 
               dec=table['dec'],
               pm_ra_cosdec=table['pmra'],
               pm_dec=table['pmdec'], 
               distance=8*u.kpc, 
               radial_velocity=0*u.km/u.s)

    gd1_frame = GD1Koposov10()
    transformed = skycoord.transform_to(gd1_frame)
    skycoord_gd1 = reflex_correct(transformed)

    df = table.to_pandas()
    df['phi1'] = skycoord_gd1.phi1
    df['phi2'] = skycoord_gd1.phi2
    df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
    df['pm_phi2'] = skycoord_gd1.pm_phi2
    return df

Here’s how we use it:

candidate_df = make_dataframe(candidate_table)

And let’s see the results.

x = candidate_df['phi1']
y = candidate_df['phi2']
plt.plot(x, y, 'ko', markersize=0.5, alpha=0.5)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)');
<Figure size 432x288 with 1 Axes>

png

We’re starting to see GD-1 more clearly. We can compare this figure with this panel from Figure 1 from the original paper:

This panel shows stars selected based on proper motion only, so it is comparable to our figure (although notice that it covers a wider region).

In the next lesson, we will use photometry data from Pan-STARRS to do a second round of filtering, and see if we can replicate this panel.

Later we’ll see how to add annotations like the ones in the figure and customize the style of the figure to present the results clearly and compellingly.

Summary

In the previous lesson we downloaded data for a large number of stars and then selected a small fraction of them based on proper motion.

In this lesson, we improved this process by writing a more complex query that uses the database to select stars based on proper motion. This process requires more computation on the Gaia server, but then we’re able to either:

  1. Search the same region and download less data, or

  2. Search a larger region while still downloading a manageable amount of data.

In the next lesson, we’ll learn about the database JOIN operation and use it to download photometry data from Pan-STARRS.

Key Points

  • When possible, ‘move the computation to the data’; that is, do as much of the work as possible on the database server before downloading the data.

  • For most applications, saving data in FITS or HDF5 is better than CSV. FITS and HDF5 are binary formats, so the files are usually smaller, and they store metadata, so you don’t lose anything when you read the file back.

  • On the other hand, CSV is a ‘least common denominator’ format; that is, it can be read by practically any application that works with data.


Join

Overview

Teaching: 50 min
Exercises: 35 min
Questions
  • How do we use JOIN to combine information from multiple tables?

Objectives
  • Write ADQL queries involving JOIN operations.

5. Joining Tables

This is the fifth in a series of notebooks related to astronomy data.

As a continuing example, we will replicate part of the analysis in a recent paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian M. Price-Whelan and Ana Bonaca.

Picking up where we left off, the next step in the analysis is to select candidate stars based on photometry data. The following figure from the paper is a color-magnitude diagram for the stars selected based on proper motion:

In red is a stellar isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster.

By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from younger background stars.

Outline

  1. We’ll reload the candidate stars we identified in the previous notebook.

  2. Then we’ll run a query on the Gaia server that uploads the table of candidates and uses a JOIN operation to select photometry data for the candidate stars.

  3. We’ll write the results to a file for use in the next notebook.

Getting photometry data

The Gaia dataset contains some photometry data, including the variable bp_rp, which contains BP-RP color (the difference in mean flux between the BP and RP bands). We use this variable to select stars with bp_rp between -0.75 and 2, which excludes many class M dwarf stars.

Now, to select stars with the age and metal richness we expect in GD-1, we will use g-i color and apparent g-band magnitude, which are available from the Pan-STARRS survey.

Conveniently, the Gaia server provides data from Pan-STARRS as a table in the same database we have been using, so we can access it by making ADQL queries.

In general, choosing a star from the Gaia catalog and finding the corresponding star in the Pan-STARRS catalog is not easy. This kind of cross matching is not always possible, because a star might appear in one catalog and not the other. And even when both stars are present, there might not be a clear one-to-one relationship between stars in the two catalogs.

Fortunately, smart people have worked on this problem, and the Gaia database includes cross-matching tables that suggest a best neighbor in the Pan-STARRS catalog for many stars in the Gaia catalog.

This document describes the cross matching process. Briefly, it uses a cone search to find possible matches in approximately the right position, then uses attributes like color and magnitude to choose pairs of observations most likely to be the same star.

The best neighbor table

So the hard part of cross-matching has been done for us. Using the results is a little tricky, but it gives us a chance to learn about one of the most important tools for working with databases: “joining” tables.

In general, a “join” is an operation where you match up records from one table with records from another table using as a “key” a piece of information that is common to both tables, usually some kind of ID code.

In this example:

For each candidate star we have selected so far, we have the source_id; the goal is to find the obj_id for the same star (we hope) in the Pan-STARRS catalog.

To do that we will:

  1. Use the JOIN operator to look up each source_id in the panstarrs1_best_neighbour table, which contains the obj_id of the best match for each star in the Gaia catalog; then

  2. Use the JOIN operator again to look up each obj_id in the panstarrs1_original_valid table, which contains the Pan-STARRS photometry data we want.

Before we get to the JOIN operation, let’s explore these tables. Here’s the metadata for panstarrs1_best_neighbour.

from astroquery.gaia import Gaia

meta = Gaia.load_table('gaiadr2.panstarrs1_best_neighbour')
Retrieving table 'gaiadr2.panstarrs1_best_neighbour'
Parsing table 'gaiadr2.panstarrs1_best_neighbour'...
Done.
print(meta)
TAP Table name: gaiadr2.gaiadr2.panstarrs1_best_neighbour
Description: Pan-STARRS1 BestNeighbour table lists each matched Gaia object with its
best neighbour in the external catalogue.
There are 1 327 157 objects in the filtered version of Pan-STARRS1 used
to compute this cross-match that have too early epochMean.
Num. columns: 7

And here are the columns.

for column in meta.columns:
    print(column.name)
source_id
original_ext_source_id
angular_distance
number_of_neighbours
number_of_mates
best_neighbour_multiplicity
gaia_astrometric_params

Here’s the documentation for these variables.

The ones we’ll use are:

Ideally, number_of_neighbours should be 1 and number_of_mates should be 0; in that case, there is a one-to-one match between the source in Gaia and the corresponding source in Pan-STARRS.

Here’s a query that selects these columns and returns the first 5 rows.

query = """SELECT 
TOP 5
source_id, number_of_neighbours, number_of_mates, original_ext_source_id
FROM gaiadr2.panstarrs1_best_neighbour
"""
job = Gaia.launch_job_async(query=query)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=5>
     source_id      number_of_neighbours number_of_mates original_ext_source_id
       int64               int32              int16              int64         
------------------- -------------------- --------------- ----------------------
6745938972433480704                    1               0      69742925668851205
6030466788955954048                    1               0      69742509325691172
6756488099308169600                    1               0      69742879438541228
6700154994715046016                    1               0      69743055581721207
6757061941303252736                    1               0      69742856540241198

The Pan-STARRS table

Here’s the metadata for the table that contains the Pan-STARRS data.

meta = Gaia.load_table('gaiadr2.panstarrs1_original_valid')
Retrieving table 'gaiadr2.panstarrs1_original_valid'
Parsing table 'gaiadr2.panstarrs1_original_valid'...
Done.
print(meta)
TAP Table name: gaiadr2.gaiadr2.panstarrs1_original_valid
Description: The Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) is
a system for wide-field astronomical imaging developed and operated by
the Institute for Astronomy at the University of Hawaii. Pan-STARRS1
(PS1) is the first part of Pan-STARRS to be completed and is the basis
for Data Release 1 (DR1). The PS1 survey used a 1.8 meter telescope and
its 1.4 Gigapixel camera to image the sky in five broadband filters (g,
r, i, z, y).

The current table contains a filtered subsample of the 10 723 304 629
entries listed in the original ObjectThin table.
[Output truncated]

And here are the columns.

for column in meta.columns:
    print(column.name)
obj_name
obj_id
ra
dec
ra_error
dec_error
epoch_mean
g_mean_psf_mag
g_mean_psf_mag_error
g_flags
r_mean_psf_mag
[Output truncated]

Here’s the documentation for these variables .

The ones we’ll use are:

Here’s a query that selects these variables and returns the first 5 rows.

query = """SELECT 
TOP 5
obj_id, g_mean_psf_mag, i_mean_psf_mag 
FROM gaiadr2.panstarrs1_original_valid
"""
job = Gaia.launch_job_async(query=query)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=5>
      obj_id      g_mean_psf_mag  i_mean_psf_mag 
                                       mag       
      int64          float64         float64     
----------------- -------------- ----------------
67130655389101425             -- 20.3516006469727
67553305590067819             --  19.779899597168
67551423248967849             -- 19.8889007568359
67132026238911331             -- 20.9062995910645
67553513677687787             -- 21.2831001281738

The following figure shows how these tables are related.

There’s no guarantee that the corresponding rows of these tables are in the same order, so the JOIN operation involves some searching. However, ADQL/SQL databases are implemented in a way that makes this kind of source efficient. If you are curious, you can read more about it.

Joining tables

Now let’s get to the details of performing a JOIN operation. As a starting place, let’s go all the way back to the cone search from Lesson 2.

query_cone = """SELECT 
TOP 10 
source_id
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

And let’s run it, to make sure we have a working query to build on.

from astroquery.gaia import Gaia

job = Gaia.launch_job_async(query=query_cone)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=10>
     source_id     
       int64       
-------------------
3322773965056065536
3322773758899157120
3322774068134271104
3322773930696320512
3322774377374425728
3322773724537891456
3322773724537891328
[Output truncated]

Now we can start adding features. First, let’s replace source_id with a format specifier, columns:

query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

Here are the columns we want from the Gaia table, again.

columns = 'source_id, ra, dec, pmra, pmdec'

query = query_base.format(columns=columns)
print(query)
SELECT 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE 1=CONTAINS(
  POINT(ra, dec),
  CIRCLE(88.8, 7.4, 0.08333333))

And let’s run the query again.

job = Gaia.launch_job_async(query=query)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=594>
     source_id              ra        ...        pmdec       
                           deg        ...       mas / yr     
       int64             float64      ...       float64      
------------------- ----------------- ... -------------------
3322773965056065536 88.78178020183375 ... -2.5057036964736907
3322773758899157120 88.83227057144585 ...                  --
3322774068134271104  88.8206092188033 ... -1.5260889445858044
3322773930696320512 88.80843339290348 ... -0.9292104395445717
3322774377374425728 88.86806108182265 ... -3.8676624830902435
3322773724537891456 88.81308602813434 ... -33.078133430952086
[Output truncated]

Adding the best neighbor table

Now we’re ready for the first join. The join operation requires two clauses:

In this example, we join with gaiadr2.panstarrs1_best_neighbour AS best, which means we can refer to the best neighbor table with the abbreviated name best.

And the ON clause indicates that we’ll match up the source_id column from the Gaia table with the source_id column from the best neighbor table.

query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source AS gaia
JOIN gaiadr2.panstarrs1_best_neighbour AS best
  ON gaia.source_id = best.source_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

SQL detail

In this example, the ON column has the same name in both tables, so we could replace the ON clause with a simpler USINGclause:

USING(source_id)

Now that there’s more than one table involved, we can’t use simple column names any more; we have to use qualified column names. In other words, we have to specify which table each column is in. Here’s the complete query, including the columns we want from the Gaia and best neighbor tables.

column_list = ['gaia.source_id',
               'gaia.ra',
               'gaia.dec',
               'gaia.pmra',
               'gaia.pmdec',
               'best.best_neighbour_multiplicity',
               'best.number_of_mates',
              ]
columns = ', '.join(column_list)

query = query_base.format(columns=columns)
print(query)
SELECT 
gaia.source_id, gaia.ra, gaia.dec, gaia.pmra, gaia.pmdec, best.best_neighbour_multiplicity, best.number_of_mates
FROM gaiadr2.gaia_source AS gaia
JOIN gaiadr2.panstarrs1_best_neighbour AS best
  ON gaia.source_id = best.source_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
job = Gaia.launch_job_async(query=query)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=490>
     source_id              ra        ... number_of_mates
                           deg        ...                
       int64             float64      ...      int16     
------------------- ----------------- ... ---------------
3322773965056065536 88.78178020183375 ...               0
3322774068134271104  88.8206092188033 ...               0
3322773930696320512 88.80843339290348 ...               0
3322774377374425728 88.86806108182265 ...               0
3322773724537891456 88.81308602813434 ...               0
3322773724537891328 88.81570329208743 ...               0
[Output truncated]

Notice that this result has fewer rows than the previous result. That’s because there are sources in the Gaia table with no corresponding source in the Pan-STARRS table.

By default, the result of the join only includes rows where the same source_id appears in both tables. This default is called an “inner” join because the results include only the intersection of the two tables. You can read about the other kinds of join here.

Adding the Pan-STARRS table

Exercise (10 minutes)

Now we’re ready to bring in the Pan-STARRS table. Starting with the previous query, add a second JOIN clause that joins with gaiadr2.panstarrs1_original_valid, gives it the abbreviated name ps, and matches original_ext_source_id from the best neighbor table with obj_id from the Pan-STARRS table.

Add g_mean_psf_mag and i_mean_psf_mag to the column list, and run the query. The result should contain 490 rows and 9 columns.

Solution

query_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source as gaia
JOIN gaiadr2.panstarrs1_best_neighbour as best
  ON gaia.source_id = best.source_id
JOIN gaiadr2.panstarrs1_original_valid as ps
  ON best.original_ext_source_id = ps.obj_id
WHERE 1=CONTAINS(
  POINT(gaia.ra, gaia.dec),
  CIRCLE(88.8, 7.4, 0.08333333))
"""

column_list = ['gaia.source_id',
               'gaia.ra',
               'gaia.dec',
               'gaia.pmra',
               'gaia.pmdec',
               'best.best_neighbour_multiplicity',
               'best.number_of_mates',
               'ps.g_mean_psf_mag',
               'ps.i_mean_psf_mag']

columns = ', '.join(column_list)

query = query_base.format(columns=columns)
print(query)

job = Gaia.launch_job_async(query=query)
results = job.get_results()
results

Selecting by coordinates and proper motion

Now let’s bring in the WHERE clause from the previous lesson, which selects sources based on parallax, BP-RP color, sky coordinates, and proper motion.

Here’s query6_base from the previous lesson.

query6_base = """SELECT 
{columns}
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON({point_list}))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON({pm_point_list}))
"""

Let’s reload the Pandas Series that contains point_list and pm_point_list.

import pandas as pd

filename = 'gd1_data.hdf'
point_series = pd.read_hdf(filename, 'point_series')
point_series
point_list       135.306, 8.39862, 126.51, 13.4449, 163.017, 54...
pm_point_list     -4.05037121,-14.75623261, -3.41981085,-14.723...
dtype: object

Now we can assemble the query.

columns = 'source_id, ra, dec, pmra, pmdec'

query6 = query6_base.format(columns=columns,
                            point_list=point_series['point_list'],
                            pm_point_list=point_series['pm_point_list'])

print(query6)
SELECT 
source_id, ra, dec, pmra, pmdec
FROM gaiadr2.gaia_source
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(ra, dec), 
                   POLYGON(135.306, 8.39862, 126.51, 13.4449, 163.017, 54.2424, 172.933, 46.4726, 135.306, 8.39862))
  AND 1 = CONTAINS(POINT(pmra, pmdec),
                   POLYGON( -4.05037121,-14.75623261, -3.41981085,-14.72365546, -3.03521988,-14.44357135, -2.26847919,-13.7140236 , -2.61172203,-13.24797471, -2.73471401,-13.09054471, -3.19923146,-12.5942653 , -3.34082546,-12.47611926, -5.67489413,-11.16083338, -5.95159272,-11.10547884, -6.42394023,-11.05981295, -7.09631023,-11.95187806, -7.30641519,-12.24559977, -7.04016696,-12.88580702, -6.00347705,-13.75912098, -4.42442296,-14.74641176))

Again, let’s run it to make sure we are starting with a working query.

job = Gaia.launch_job_async(query=query6)
INFO: Query finished. [astroquery.utils.tap.core]
results = job.get_results()
results
<Table length=7345>
    source_id              ra         ...        pmdec       
                          deg         ...       mas / yr     
      int64             float64       ...       float64      
------------------ ------------------ ... -------------------
635559124339440000 137.58671691646745 ... -12.490481778113859
635860218726658176  138.5187065217173 ... -11.346409129876392
635674126383965568  138.8428741026386 ... -12.702779525389634
635535454774983040  137.8377518255436 ... -14.492308604905652
635497276810313600  138.0445160213759 ... -12.291499169815987
635614168640132864 139.59219748145836 ... -13.708904908478631
[Output truncated]

Exercise (15 minutes)

Create a new query base called query7_base that combines the WHERE clauses from the previous query with the JOIN clauses for the best neighbor and Pan-STARRS tables. Format the query base using the column names in column_list, and call the result query7.

Hint: Make sure you use qualified column names everywhere!

Run your query and download the results. The table you get should have 3725 rows and 9 columns.

Solution

query7_base = """
SELECT 
{columns}
FROM gaiadr2.gaia_source as gaia
JOIN gaiadr2.panstarrs1_best_neighbour as best
  ON gaia.source_id = best.source_id
JOIN gaiadr2.panstarrs1_original_valid as ps
  ON best.original_ext_source_id = ps.obj_id
WHERE parallax < 1
  AND bp_rp BETWEEN -0.75 AND 2 
  AND 1 = CONTAINS(POINT(gaia.ra, gaia.dec), 
                   POLYGON({point_list}))
  AND 1 = CONTAINS(POINT(gaia.pmra, gaia.pmdec),
                   POLYGON({pm_point_list}))
"""

columns = ', '.join(column_list)

query7 = query7_base.format(columns=columns,
                            point_list=point_series['point_list'],
                            pm_point_list=point_series['pm_point_list'])
print(query7)


job = Gaia.launch_job_async(query=query7)
results = job.get_results()
results

Checking the match

To get more information about the matching process, we can inspect best_neighbour_multiplicity, which indicates for each star in Gaia how many stars in Pan-STARRS are equally likely matches.

results['best_neighbour_multiplicity']
<MaskedColumn name='best_neighbour_multiplicity' dtype='int16' description='Number of neighbours with same probability as best neighbour' length=3725>
  1
  1
  1
  1
  1
  1
  1
  1
  1
  1
[Output truncated]

It looks like most of the values are 1, which is good; that means that for each candidate star we have identified exactly one source in Pan-STARRS that is likely to be the same star.

To check whether there are any values other than 1, we can convert this column to a Pandas Series and use describe, which we saw in in Lesson 3.

import pandas as pd

multiplicity = pd.Series(results['best_neighbour_multiplicity'])
multiplicity.describe()
count    3725.0
mean        1.0
std         0.0
min         1.0
25%         1.0
50%         1.0
75%         1.0
max         1.0
dtype: float64

In fact, 1 is the only value in the Series, so every candidate star has a single best match.

Similarly, number_of_mates indicates the number of other stars in Gaia that match with the same star in Pan-STARRS.

mates = pd.Series(results['number_of_mates'])
mates.describe()
count    3725.0
mean        0.0
std         0.0
min         0.0
25%         0.0
50%         0.0
75%         0.0
max         0.0
dtype: float64

All values in this column are 0, which means that for each match we found in Pan-STARRS, there are no other stars in Gaia that also match.

Detail: The table also contains number_of_neighbors which is the number of stars in Pan-STARRS that match in terms of position, before using other criteria to choose the most likely match. But we are more interested in the final match, using both criteria.

Transforming coordinates

Here’s the function we’ve used to transform the results from ICRS to GD-1 coordinates.

import astropy.units as u
from astropy.coordinates import SkyCoord
from gala.coordinates import GD1Koposov10
from gala.coordinates import reflex_correct

def make_dataframe(table):
    """Transform coordinates from ICRS to GD-1 frame.
    
    table: Astropy Table
    
    returns: Pandas DataFrame
    """
    skycoord = SkyCoord(
               ra=table['ra'], 
               dec=table['dec'],
               pm_ra_cosdec=table['pmra'],
               pm_dec=table['pmdec'], 
               distance=8*u.kpc, 
               radial_velocity=0*u.km/u.s)

    gd1_frame = GD1Koposov10()
    transformed = skycoord.transform_to(gd1_frame)
    skycoord_gd1 = reflex_correct(transformed)

    df = table.to_pandas()
    df['phi1'] = skycoord_gd1.phi1
    df['phi2'] = skycoord_gd1.phi2
    df['pm_phi1'] = skycoord_gd1.pm_phi1_cosphi2
    df['pm_phi2'] = skycoord_gd1.pm_phi2
    return df

Now can transform the result from the last query.

candidate_df = make_dataframe(results)

And see how it looks.

import matplotlib.pyplot as plt

x = candidate_df['phi1']
y = candidate_df['phi2']
plt.plot(x, y, 'ko', markersize=0.5, alpha=0.5)

plt.xlabel('phi1 (degree GD1)')
plt.ylabel('phi2 (degree GD1)');
<Figure size 432x288 with 1 Axes>

png

The result is similar to what we saw in the previous lesson, except that have fewer stars now, because we did not find photometry data for all of the candidate sources.

Saving the DataFrame

Let’s save this DataFrame so we can pick up where we left off without running this query again. The HDF file should already exist, so we’ll add candidate_df to it.

filename = 'gd1_data.hdf'

candidate_df.to_hdf(filename, 'candidate_df')

We can use getsize to confirm that the file exists and check the size:

from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
3.5835609436035156

Summary

In this notebook, we used database JOIN operations to select photometry data for the stars we’ve identified as candidates to be in GD-1.

In the next notebook, we’ll use this data for a second round of selection, identifying stars that have photometry data consistent with GD-1.

But before you go on, you might be interested in another file format, CSV.

CSV

Pandas can write a variety of other formats, which you can read about here. We won’t cover all of them, but one other important one is CSV, which stands for “comma-separated values”.

CSV is a plain-text format that can be read and written by pretty much any tool that works with data. In that sense, it is the “least common denominator” of data formats.

However, it has an important limitation: some information about the data gets lost in translation, notably the data types. If you read a CSV file from someone else, you might need some additional information to make sure you are getting it right.

Also, CSV files tend to be big, and slow to read and write.

With those caveats, here’s how to write one:

candidate_df.to_csv('gd1_data.csv')

We can check the file size like this:

getsize('gd1_data.csv') / MB
0.7606849670410156

We can see the first few lines like this:

def head(filename, n=3):
    """Print the first `n` lines of a file."""
    with open(filename) as fp:
        for i in range(n):
            print(next(fp))
head('gd1_data.csv')
,source_id,ra,dec,pmra,pmdec,best_neighbour_multiplicity,number_of_mates,g_mean_psf_mag,i_mean_psf_mag,phi1,phi2,pm_phi1,pm_phi2

0,635860218726658176,138.5187065217173,19.09233926905897,-5.941679495793577,-11.346409129876392,1,0,17.8978004455566,17.5174007415771,-59.247329893833296,-2.016078400820631,-7.527126084640531,1.7487794924176672

1,635674126383965568,138.8428741026386,19.031798198627634,-3.8970011609340207,-12.702779525389634,1,0,19.2873001098633,17.6781005859375,-59.13339098769217,-2.306900745179831,-7.560607655557415,-0.7417999555980248

The CSV file contains the names of the columns, but not the data types.

We can read the CSV file back like this:

read_back_csv = pd.read_csv('gd1_data.csv')

Let’s compare the first few rows of candidate_df and read_back_csv

candidate_df.head(3)
            source_id          ra        dec      pmra      pmdec  \
0  635860218726658176  138.518707  19.092339 -5.941679 -11.346409   
1  635674126383965568  138.842874  19.031798 -3.897001 -12.702780   
2  635535454774983040  137.837752  18.864007 -4.335041 -14.492309   

   best_neighbour_multiplicity  number_of_mates  g_mean_psf_mag  \
0                            1                0         17.8978   
1                            1                0         19.2873   
2                            1                0         16.9238   

   i_mean_psf_mag       phi1      phi2   pm_phi1   pm_phi2  
[Output truncated]
read_back_csv.head(3)
   Unnamed: 0           source_id          ra        dec      pmra      pmdec  \
0           0  635860218726658176  138.518707  19.092339 -5.941679 -11.346409   
1           1  635674126383965568  138.842874  19.031798 -3.897001 -12.702780   
2           2  635535454774983040  137.837752  18.864007 -4.335041 -14.492309   

   best_neighbour_multiplicity  number_of_mates  g_mean_psf_mag  \
0                            1                0         17.8978   
1                            1                0         19.2873   
2                            1                0         16.9238   

   i_mean_psf_mag       phi1      phi2   pm_phi1   pm_phi2  
[Output truncated]

Notice that the index in candidate_df has become an unnamed column in read_back_csv. The Pandas functions for writing and reading CSV files provide options to avoid that problem, but this is an example of the kind of thing that can go wrong with CSV files.

Key Points

  • Use JOIN operations to combine data from multiple tables in a database, using some kind of identifier to match up records from one table with records from another. This is another example of a practice we saw in the previous notebook, moving the computation to the data.

  • For most applications, saving data in FITS or HDF5 is better than CSV. FITS and HDF5 are binary formats, so the files are usually smaller, and they store metadata, so you don’t lose anything when you read the file back.

  • On the other hand, CSV is a ‘least common denominator’ format; that is, it can be read by practically any application that works with data.


Photometry

Overview

Teaching: 55 min
Exercises: 0 min
Questions
  • How do we use Matplotlib to define a polygon and select points that fall inside it?

Objectives
  • Use isochrone data to specify a polygon and determine which points fall inside it.

  • Use Matplotlib features to customize the appearance of figures.

6. Photometry

This is the sixth in a series of notebooks related to astronomy data.

As a continuing example, we will replicate part of the analysis in a recent paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian M. Price-Whelan and Ana Bonaca.

In the previous lesson we downloaded photometry data from Pan-STARRS, which is available from the same server we’ve been using to get Gaia data.

The next step in the analysis is to select candidate stars based on the photometry data. The following figure from the paper is a color-magnitude diagram showing the stars we previously selected based on proper motion:

In red is a theoretical isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster.

By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from mostly younger background stars.

Outline

  1. We’ll reload the data from the previous notebook and make a color-magnitude diagram.

  2. We’ll use an isochrone computed by MIST to specify a polygonal region in the color-magnitude diagram and select the stars inside it.

Reload the data

You can download the data from the previous lesson or run the following cell, which downloads it if necessary.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'data/gd1_data.hdf')

Now we can reload candidate_df.

import pandas as pd

filename = 'gd1_data.hdf'
candidate_df = pd.read_hdf(filename, 'candidate_df')

Plotting photometry data

Now that we have photometry data from Pan-STARRS, we can replicate the color-magnitude diagram from the original paper:

The y-axis shows the apparent magnitude of each source with the g filter.

The x-axis shows the difference in apparent magnitude between the g and i filters, which indicates color.

Stars with lower values of (g-i) are brighter in g-band than in i-band, compared to other stars, which means they are bluer.

Stars in the lower-left quadrant of this diagram are less bright than the others, and have lower metallicity, which means they are likely to be older.

Since we expect the stars in GD-1 to be older than the background stars, the stars in the lower-left are more likely to be in GD-1.

The following function takes a table containing photometry data and draws a color-magnitude diagram. The input can be an Astropy Table or Pandas DataFrame, as long as it has columns named g_mean_psf_mag and i_mean_psf_mag.

import matplotlib.pyplot as plt

def plot_cmd(table):
    """Plot a color magnitude diagram.
    
    table: Table or DataFrame with photometry data
    """
    y = table['g_mean_psf_mag']
    x = table['g_mean_psf_mag'] - table['i_mean_psf_mag']

    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlim([0, 1.5])
    plt.ylim([14, 22])
    plt.gca().invert_yaxis()

    plt.ylabel('$Magnitude (g)$')
    plt.xlabel('$Color (g-i)$')

plot_cmd uses a new function, invert_yaxis, to invert the y axis, which is conventional when plotting magnitudes, since lower magnitude indicates higher brightness.

plt.gca().invert_yaxis()

gca stands for “get current axis”. It returns an object that represents the axes of the current figure, and that object provides invert_yaxis.

Warning

invert_yaxis is a little different from the other functions we’ve used. You can’t call it like this:

plt.invert_yaxis()

The most likely reason for this inconsistency in the interface is that invert_yaxis is a lesser-used function, so it’s not made available at the top level of the interface.

Here’s what the results look like.

plot_cmd(candidate_df)
<Figure size 432x288 with 1 Axes>

png

Our figure does not look exactly like the one in the paper because we are working with a smaller region of the sky, so we don’t have as many stars. But we can see an overdense region in the lower left that contains stars with the photometry we expect for GD-1.

In the next section we’ll use an isochrone to specify a polygon that contains this overdense regioin.

Isochrone

Based on our best estimates for the ages of the stars in GD-1 and their metallicity, we can compute a stellar isochrone that predicts the relationship between their magnitude and color.

In fact, we can use MESA Isochrones & Stellar Tracks (MIST) to compute it for us.

Using the MIST Version 1.2 web interface, we computed an isochrone with the following parameters:

The following cell downloads the results:

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'data/MIST_iso_5fd2532653c27.iso.cmd')

To read this file we’ll download a Python module from this repository.

download('https://github.com/jieunchoi/MIST_codes/raw/master/scripts/' +
         'read_mist_models.py')

Now we can read the file:

import read_mist_models

filename = 'MIST_iso_5fd2532653c27.iso.cmd'
iso = read_mist_models.ISOCMD(filename)
Reading in: MIST_iso_5fd2532653c27.iso.cmd

The result is an ISOCMD object.

type(iso)
read_mist_models.ISOCMD

It contains a list of arrays, one for each isochrone.

type(iso.isocmds)
list

We only got one isochrone.

len(iso.isocmds)
1

So we can select it like this:

iso_array = iso.isocmds[0]

It’s a NumPy array:

type(iso_array)
numpy.ndarray

But it’s an unusual NumPy array, because it contains names for the columns.

iso_array.dtype
dtype([('EEP', '<i4'), ('isochrone_age_yr', '<f8'), ('initial_mass', '<f8'), ('star_mass', '<f8'), ('log_Teff', '<f8'), ('log_g', '<f8'), ('log_L', '<f8'), ('[Fe/H]_init', '<f8'), ('[Fe/H]', '<f8'), ('PS_g', '<f8'), ('PS_r', '<f8'), ('PS_i', '<f8'), ('PS_z', '<f8'), ('PS_y', '<f8'), ('PS_w', '<f8'), ('PS_open', '<f8'), ('phase', '<f8')])

Which means we can select columns using the bracket operator:

iso_array['phase']
array([0., 0., 0., ..., 6., 6., 6.])

We can use phase to select the part of the isochrone for stars in the main sequence and red giant phases.

phase_mask = (iso_array['phase'] >= 0) & (iso_array['phase'] < 3)
phase_mask.sum()
354
main_sequence = iso_array[phase_mask]
len(main_sequence)
354

The other two columns we’ll use are PS_g and PS_i, which contain simulated photometry data for stars with the given age and metallicity, based on a model of the Pan-STARRS sensors.

We’ll use these columns to superimpose the isochrone on the color-magnitude diagram, but first we have to use a distance modulus to scale the isochrone based on the estimated distance of GD-1.

We can use the Distance object from Astropy to compute the distance modulus.

import astropy.coordinates as coord
import astropy.units as u

distance = 7.8 * u.kpc
distmod = coord.Distance(distance).distmod.value
distmod
14.4604730134524

Now we can compute the scaled magnitude and color of the isochrone.

mag_g = main_sequence['PS_g'] + distmod
color_g_i = main_sequence['PS_g'] - main_sequence['PS_i']

Now we can plot it on the color-magnitude diagram like this.

plot_cmd(candidate_df)
plt.plot(color_g_i, mag_g);
<Figure size 432x288 with 1 Axes>

png

The theoretical isochrone passes through the overdense region where we expect to find stars in GD-1.

Let’s save this result so we can reload it later without repeating the steps in this section.

So we can save the data in an HDF5 file, we’ll put it in a Pandas DataFrame first:

import pandas as pd

iso_df = pd.DataFrame()
iso_df['mag_g'] = mag_g
iso_df['color_g_i'] = color_g_i

iso_df.head()
       mag_g  color_g_i
0  28.294743   2.195021
1  28.189718   2.166076
2  28.051761   2.129312
3  27.916194   2.093721
4  27.780024   2.058585

And then save it.

filename = 'gd1_isochrone.hdf5'
iso_df.to_hdf(filename, 'iso_df')

Making a polygon

The following cell downloads the isochrone we made in the previous section, if necessary.

download('https://github.com/AllenDowney/AstronomicalData/raw/main/data/' +
         'gd1_isochrone.hdf5')

Now we can read it back in.

filename = 'gd1_isochrone.hdf5'
iso_df = pd.read_hdf(filename, 'iso_df')
iso_df.head()
       mag_g  color_g_i
0  28.294743   2.195021
1  28.189718   2.166076
2  28.051761   2.129312
3  27.916194   2.093721
4  27.780024   2.058585

Here’s what the isochrone looks like on the color-magnitude diagram.

plot_cmd(candidate_df)
plt.plot(iso_df['color_g_i'], iso_df['mag_g']);
<Figure size 432x288 with 1 Axes>

png

In the bottom half of the figure, the isochrone passes through the overdense region where the stars are likely to belong to GD-1.

In the top half, the isochrone passes through other regions where the stars have higher magnitude and metallicity than we expect for stars in GD-1.

So we’ll select the part of the isochrone that lies in the overdense region.

g_mask is a Boolean Series that is True where g is between 18.0 and 21.5.

g = iso_df['mag_g']

g_mask = (g > 18.0) & (g < 21.5)
g_mask.sum()
117

We can use it to select the corresponding rows in iso_df:

iso_masked = iso_df[g_mask]
iso_masked.head()
        mag_g  color_g_i
94  21.411746   0.692171
95  21.322466   0.670238
96  21.233380   0.648449
97  21.144427   0.626924
98  21.054549   0.605461

Now, to select the stars in the overdense region, we have to define a polygon that includes stars near the isochrone.

The original paper uses the following formulas to define the left and right boundaries.

g = iso_masked['mag_g']
left_color = iso_masked['color_g_i'] - 0.4 * (g/28)**5
right_color = iso_masked['color_g_i'] + 0.8 * (g/28)**5

The intention is to define a polygon that gets wider as g increases, to reflect increasing uncertainty.

But we can do about as well with a simpler formula:

g = iso_masked['mag_g']
left_color = iso_masked['color_g_i'] - 0.06
right_color = iso_masked['color_g_i'] + 0.12

Here’s what these boundaries look like:

plot_cmd(candidate_df)

plt.plot(left_color, g, label='left color')
plt.plot(right_color, g, label='right color')

plt.legend();
<Figure size 432x288 with 1 Axes>

png

Which points are in the polygon?

Matplotlib provides a Polygon object that we can use to check which points fall in the polygon we just constructed.

To make a Polygon, we need to assemble g, left_color, and right_color into a loop, so the points in left_color are connected to the points of right_color in reverse order.

We’ll use the following function, which takes two arrays and joins them front-to-back:

import numpy as np

def front_to_back(first, second):
    """Join two arrays front to back."""
    return np.append(first, second[::-1])

front_to_back uses a “slice index” to reverse the elements of second.

As explained in the NumPy documentation, a slice index has three parts separated by colons:

In this example, start and stop are omitted, which means all elements are selected.

And step is -1, which means the elements are in reverse order.

We can use front_to_back to make a loop that includes the elements of left_color and right_color:

color_loop = front_to_back(left_color, right_color)
color_loop.shape
(234,)

And a corresponding loop with the elements of g in forward and reverse order.

mag_loop = front_to_back(g, g)
mag_loop.shape
(234,)

Here’s what the loop looks like.

plot_cmd(candidate_df)
plt.plot(color_loop, mag_loop);
<Figure size 432x288 with 1 Axes>

png

To make a Polygon, it will be convenient to put color_loop and mag_loop into a DataFrame:

loop_df = pd.DataFrame()
loop_df['color_loop'] = color_loop
loop_df['mag_loop'] = mag_loop
loop_df.head()
   color_loop   mag_loop
0    0.632171  21.411746
1    0.610238  21.322466
2    0.588449  21.233380
3    0.566924  21.144427
4    0.545461  21.054549

Now we can pass loop_df to Polygon:

from matplotlib.patches import Polygon

polygon = Polygon(loop_df)
polygon
<matplotlib.patches.Polygon at 0x7f439d33fdf0>

The result is a Polygon object , which provides contains_points, which figures out which points are inside the polygon.

To test it, we’ll create a list with two points, one inside the polygon and one outside.

points = [(0.4, 20), 
          (0.4, 16)]

Now we can make sure contains_points does what we expect.

inside = polygon.contains_points(points)
inside
array([ True, False])

The result is an array of Boolean values.

We are almost ready to select stars whose photometry data falls in this polygon. But first we need to do some data cleaning.

Save the polygon

Reproducibile research is “the idea that … the full computational environment used to produce the results in the paper such as the code, data, etc. can be used to reproduce the results and create new work based on the research.”

This Jupyter notebook is an example of reproducible research because it contains all of the code needed to reproduce the results, including the database queries that download the data and and analysis.

In this lesson we used an isochrone to derive a polygon, which we used to select stars based on photometry. So it is important to record the polygon as part of the data analysis pipeline.

Here’s how we can save it in an HDF file.

filename = 'gd1_data.hdf'
loop_df.to_hdf(filename, 'loop_df')

Selecting based on photometry

Now let’s see how many of the candidate stars are inside the polygon we chose. We’ll put color and magnitude data from candidate_df into a new DataFrame:

points = pd.DataFrame()

points['color'] = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag']
points['mag'] = candidate_df['g_mean_psf_mag']

points.head()
    color      mag
0  0.3804  17.8978
1  1.6092  19.2873
2  0.4457  16.9238
3  1.5902  19.9242
4  1.4853  16.1516

Which we can pass to contains_points:

inside = polygon.contains_points(points)
inside
array([False, False, False, ..., False, False, False])

The result is a Boolean array. We can use sum to see how many stars fall in the polygon.

inside.sum()
454

Now we can use inside as a mask to select stars that fall inside the polygon.

winner_df = candidate_df[inside]

Let’s make a color-magnitude plot one more time, highlighting the selected stars with green markers.

plot_cmd(candidate_df)
plt.plot(color_g_i, mag_g)
plt.plot(color_loop, mag_loop)

x = winner_df['g_mean_psf_mag'] - winner_df['i_mean_psf_mag']
y = winner_df['g_mean_psf_mag']
plt.plot(x, y, 'go', markersize=0.5, alpha=0.5);
<Figure size 432x288 with 1 Axes>

png

It looks like the selected stars are, in fact, inside the polygon, which means they have photometry data consistent with GD-1.

Finally, we can plot the coordinates of the selected stars:

plt.figure(figsize=(10,2.5))

x = winner_df['phi1']
y = winner_df['phi2']
plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9)

plt.xlabel('ra (degree GD1)')
plt.ylabel('dec (degree GD1)')

plt.axis('equal');
<Figure size 720x180 with 1 Axes>

png

This example includes two new Matplotlib commands:

In an example like this, where x and y represent coordinates in space, equal axes ensures that the distance between points is represented accurately.

Write the data

Finally, let’s write the selected stars to a file.

filename = 'gd1_data.hdf'
winner_df.to_hdf(filename, 'winner_df')
from os.path import getsize

MB = 1024 * 1024
getsize(filename) / MB
3.6441001892089844

Summary

In this lesson, we used photometry data from Pan-STARRS to draw a color-magnitude diagram. We used an isochrone to define a polygon and select stars we think are likely to be in GD-1. Plotting the results, we have a clearer picture of GD-1, similar to Figure 1 in the original paper.

Key Points

  • Matplotlib provides operations for working with points, polygons, and other geometric entities, so it’s not just for making figures.

  • Use Matplotlib options to control the size and aspect ratio of figures to make them easier to interpret.

  • Record every element of the data analysis pipeline that would be needed to replicate the results.


Visualization

Overview

Teaching: 55 min
Exercises: 30 min
Questions
  • How do we make a compelling visualization that tells a story?

Objectives
  • Design a figure that tells a compelling story.

  • Use Matplotlib features to customize the appearance of figures.

  • Generate a figure with multiple subplots.

7. Visualization

This is the seventh in a series of notebooks related to astronomy data.

As a continuing example, we will replicate part of the analysis in a recent paper, “Off the beaten path: Gaia reveals GD-1 stars outside of the main stream” by Adrian M. Price-Whelan and Ana Bonaca.

In the previous notebook we selected photometry data from Pan-STARRS and used it to identify stars we think are likely to be in GD-1

In this notebook, we’ll take the results from previous lessons and use them to make a figure that tells a compelling scientific story.

Outline

  1. Starting with the figure from the previous notebook, we’ll add annotations to present the results more clearly.

  2. The we’ll see several ways to customize figures to make them more appealing and effective.

  3. Finally, we’ll see how to make a figure with multiple panels or subplots.

Making Figures That Tell a Story

So far the figure we’ve made have been “quick and dirty”. Mostly we have used Matplotlib’s default style, although we have adjusted a few parameters, like markersize and alpha, to improve legibility.

Now that the analysis is done, it’s time to think more about:

  1. Making professional-looking figures that are ready for publication, and

  2. Making figures that communicate a scientific result clearly and compellingly.

Not necessarily in that order.

Let’s start by reviewing Figure 1 from the original paper. We’ve seen the individual panels, but now let’s look at the whole thing, along with the caption:

Exercise (5 minutes)

Think about the following questions:

  1. What is the primary scientific result of this work?

  2. What story is this figure telling?

  3. In the design of this figure, can you identify 1-2 choices the authors made that you think are effective? Think about big-picture elements, like the number of panels and how they are arranged, as well as details like the choice of typeface.

  4. Can you identify 1-2 elements that could be improved, or that you might have done differently?

Solution

Some topics that might come up in this discussion:

  1. The primary result is that the multiple stages of selection make it possible to separate likely candidates from the background more effectively than in previous work, which makes it possible to see the structure of GD-1 in “unprecedented detail”.

  2. The figure documents the selection process as a sequence of steps. Reading right-to-left, top-to-bottom, we see selection based on proper motion, the results of the first selection, selection based on color and magnitude, and the results of the second selection. So this figure documents the methodology and presents the primary result.

  3. It’s mostly black and white, with minimal use of color, so it will work well in print. The annotations in the bottom left panel guide the reader to the most important results.
    It contains enough technical detail for a professional audience, but most of it is also comprehensible to a more general audience.
    The two left panels have the same dimensions and their axes are aligned.

  4. Since the panels represent a sequence, it might be better to arrange them left-to-right. The placement and size of the axis labels could be tweaked. The entire figure could be a little bigger to match the width and proportion of the caption.
    The top left panel has unnused white space (but that leaves space for the annotations in the bottom left).

Plotting GD-1

Let’s start with the panel in the lower left. You can download the data from the previous lesson or run the following cell, which downloads it if necessary.

from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'data/gd1_data.hdf')

Now we can reload winner_df

import pandas as pd

filename = 'gd1_data.hdf'
winner_df = pd.read_hdf(filename, 'winner_df')
import matplotlib.pyplot as plt

def plot_second_selection(df):
    x = df['phi1']
    y = df['phi2']

    plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9)

    plt.xlabel('$\phi_1$ [deg]')
    plt.ylabel('$\phi_2$ [deg]')
    plt.title('Proper motion + photometry selection', fontsize='medium')

    plt.axis('equal')

And here’s what it looks like.

plt.figure(figsize=(10,2.5))
plot_second_selection(winner_df)
<Figure size 1000x250 with 1 Axes>

png

Annotations

The figure in the paper uses three other features to present the results more clearly and compellingly:

Exercise (15 minutes)

Choose any or all of these features and add them to the figure:

And here is some additional information about text and arrows.

Solution

plt.axvline(-55, ls='--', color='gray', 
            alpha=0.4, dashes=(6,4), lw=2)
plt.text(-60, 5.5, 'Previously\nundetected', 
         fontsize='small', ha='right', va='top');

arrowprops=dict(color='gray', shrink=0.05, width=1.5, 
                headwidth=6, headlength=8, alpha=0.4)

plt.annotate('Spur', xy=(-33, 2), xytext=(-35, 5.5),
             arrowprops=arrowprops,
             fontsize='small')

plt.annotate('Gap', xy=(-22, -1), xytext=(-25, -5.5),
             arrowprops=arrowprops,
             fontsize='small')

Customization

Matplotlib provides a default style that determines things like the colors of lines, the placement of labels and ticks on the axes, and many other properties.

There are several ways to override these defaults and customize your figures:

As a simple example, notice that Matplotlib puts ticks on the outside of the figures by default, and only on the left and bottom sides of the axes.

To change this behavior, you can use gca() to get the current axes and tick_params to change the settings.

Here’s how you can put the ticks on the inside of the figure:

plt.gca().tick_params(direction='in')

Exercise (5 minutes)

Read the documentation of tick_params and use it to put ticks on the top and right sides of the axes.

Solution

plt.gca().tick_params(top=True, right=True)

rcParams

If you want to make a customization that applies to all figures in a notebook, you can use rcParams.

Here’s an example that reads the current font size from rcParams:

plt.rcParams['font.size']
10.0

And sets it to a new value:

plt.rcParams['font.size'] = 14

As an exercise, plot the previous figure again, and see what font sizes have changed. Look up any other element of rcParams, change its value, and check the effect on the figure.

If you find yourself making the same customizations in several notebooks, you can put changes to rcParams in a matplotlibrc file, which you can read about here.

Style sheets

The matplotlibrc file is read when you import Matplotlib, so it is not easy to switch from one set of options to another.

The solution to this problem is style sheets, which you can read about here.

Matplotlib provides a set of predefined style sheets, or you can make your own.

The following cell displays a list of style sheets installed on your system.

plt.style.available
['Solarize_Light2',
 '_classic_test_patch',
 'bmh',
 'classic',
 'dark_background',
 'fast',
 'fivethirtyeight',
 'ggplot',
 'grayscale',
 'seaborn',
 'seaborn-bright',
[Output truncated]

Note that seaborn-paper, seaborn-talk and seaborn-poster are particularly intended to prepare versions of a figure with text sizes and other features that work well in papers, talks, and posters.

To use any of these style sheets, run plt.style.use like this:

plt.style.use('fivethirtyeight')

The style sheet you choose will affect the appearance of all figures you plot after calling use, unless you override any of the options or call use again.

As an exercise, choose one of the styles on the list and select it by calling use. Then go back and plot one of the figures above and see what effect it has.

If you can’t find a style sheet that’s exactly what you want, you can make your own. This repository includes a style sheet called az-paper-twocol.mplstyle, with customizations chosen by Azalee Bostroem for publication in astronomy journals.

You can download the style sheet or run the following cell, which downloads it if necessary.

download('https://github.com/AllenDowney/AstronomicalData/raw/main/' +
         'az-paper-twocol.mplstyle')

You can use it like this:

plt.style.use('./az-paper-twocol.mplstyle')

The prefix ./ tells Matplotlib to look for the file in the current directory.

As an alternative, you can install a style sheet for your own use by putting it in your configuration directory. To find out where that is, you can run the following command:

import matplotlib as mpl

mpl.get_configdir()

LaTeX fonts

When you include mathematical expressions in titles, labels, and annotations, Matplotlib uses mathtext to typeset them. mathtext uses the same syntax as LaTeX, but it provides only a subset of its features.

If you need features that are not provided by mathtext, or you prefer the way LaTeX typesets mathematical expressions, you can customize Matplotlib to use LaTeX.

In matplotlibrc or in a style sheet, you can add the following line:

text.usetex        : true

Or in a notebook you can run the following code.

plt.rcParams['text.usetex'] = True
plt.rcParams['text.usetex'] = True

If you go back and draw the figure again, you should see the difference.

Warning

If you get an error message like

LaTeX Error: File `type1cm.sty' not found.

You might have to install a package that contains the fonts LaTeX needs. On some systems, the packages texlive-latex-extra or cm-super might be what you need. See here for more help with this.

In case you are curious, cm stands for Computer Modern, the font LaTeX uses to typeset math.

Before we go on, let’s put things back where we found them.

plt.rcParams['text.usetex'] = False
plt.style.use('default')

Multiple panels

So far we’ve been working with one figure at a time, but the figure we are replicating contains multiple panels, also known as “subplots”.

Confusingly, Matplotlib provides three functions for making figures like this: subplot, subplots, and subplot2grid.

* subplot is simple and similar to MATLAB, so if you are familiar with that interface, you might like subplot

* subplots is more object-oriented, which some people prefer.

* subplot2grid is most convenient if you want to control the relative sizes of the subplots.

So we’ll use subplot2grid.

All of these functions are easier to use if we put the code that generates each panel in a function.

Upper right

To make the panel in the upper right, we have to reload centerline_df.

filename = 'gd1_data.hdf'
centerline_df = pd.read_hdf(filename, 'centerline_df')

And define the coordinates of the rectangle we selected.

pm1_min = -8.9
pm1_max = -6.9
pm2_min = -2.2
pm2_max =  1.0

pm1_rect = [pm1_min, pm1_min, pm1_max, pm1_max]
pm2_rect = [pm2_min, pm2_max, pm2_max, pm2_min]

To plot this rectangle, we’ll use a feature we have not seen before: Polygon, which is provided by Matplotlib.

To create a Polygon, we have to put the coordinates in an array with x values in the first column and y values in the second column.

import numpy as np

vertices = np.transpose([pm1_rect, pm2_rect])
vertices
array([[-8.9, -2.2],
       [-8.9,  1. ],
       [-6.9,  1. ],
       [-6.9, -2.2]])

The following function takes a DataFrame as a parameter, plots the proper motion for each star, and adds a shaded Polygon to show the region we selected.

from matplotlib.patches import Polygon

def plot_proper_motion(df):
    pm1 = df['pm_phi1']
    pm2 = df['pm_phi2']

    plt.plot(pm1, pm2, 'ko', markersize=0.3, alpha=0.3)
    
    poly = Polygon(vertices, closed=True, 
                   facecolor='C1', alpha=0.4)
    plt.gca().add_patch(poly)
    
    plt.xlabel('$\mu_{\phi_1} [\mathrm{mas~yr}^{-1}]$')
    plt.ylabel('$\mu_{\phi_2} [\mathrm{mas~yr}^{-1}]$')

    plt.xlim(-12, 8)
    plt.ylim(-10, 10)

Notice that add_patch is like invert_yaxis; in order to call it, we have to use gca to get the current axes.

Here’s what the new version of the figure looks like. We’ve changed the labels on the axes to be consistent with the paper.

plot_proper_motion(centerline_df)
<Figure size 640x480 with 1 Axes>

png

Upper left

Now let’s work on the panel in the upper left. We have to reload candidates.

filename = 'gd1_data.hdf'
candidate_df = pd.read_hdf(filename, 'candidate_df')

Here’s a function that takes a DataFrame of candidate stars and plots their positions in GD-1 coordindates.

def plot_first_selection(df):
    x = df['phi1']
    y = df['phi2']

    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlabel('$\phi_1$ [deg]')
    plt.ylabel('$\phi_2$ [deg]')
    plt.title('Proper motion selection', fontsize='medium')

    plt.axis('equal')

And here’s what it looks like.

plot_first_selection(candidate_df)
<Figure size 640x480 with 1 Axes>

png

Lower right

For the figure in the lower right, we’ll use this function to plots the color-magnitude diagram.

import matplotlib.pyplot as plt

def plot_cmd(table):
    """Plot a color magnitude diagram.
    
    table: Table or DataFrame with photometry data
    """
    y = table['g_mean_psf_mag']
    x = table['g_mean_psf_mag'] - table['i_mean_psf_mag']

    plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3)

    plt.xlim([0, 1.5])
    plt.ylim([14, 22])
    plt.gca().invert_yaxis()

    plt.ylabel('$Magnitude (g)$')
    plt.xlabel('$Color (g-i)$')

Here’s what it looks like.

plot_cmd(candidate_df)
<Figure size 640x480 with 1 Axes>

png

And here’s how we read it back.

filename = 'gd1_data.hdf'
loop_df = pd.read_hdf(filename, 'loop_df')
loop_df.head()
   color_loop   mag_loop
0    0.632171  21.411746
1    0.610238  21.322466
2    0.588449  21.233380
3    0.566924  21.144427
4    0.545461  21.054549
color_loop mag_loop
0 0.632171 21.411746
1 0.610238 21.322466
2 0.588449 21.233380
3 0.566924 21.144427
4 0.545461 21.054549

Exercise (5 minutes)

Add a few lines to plot_cmd to show the polygon we selected as a shaded area.

Hint: pass coords as an argument to Polygon and plot it using add_patch.

Solution

poly = Polygon(loop_df, closed=True, 
              facecolor='C1', alpha=0.4)
plt.gca().add_patch(poly)

Subplots

Now we’re ready to put it all together. To make a figure with four subplots, we’ll use subplot2grid, which requires two arguments:

In this example, shape is (2, 2) to create two rows and two columns.

For the first panel, loc is (0, 0), which indicates row 0 and column 0, which is the upper-left panel.

Here’s how we use it to draw the four panels.

shape = (2, 2)
plt.subplot2grid(shape, (0, 0))
plot_first_selection(candidate_df)

plt.subplot2grid(shape, (0, 1))
plot_proper_motion(centerline_df)

plt.subplot2grid(shape, (1, 0))
plot_second_selection(winner_df)

plt.subplot2grid(shape, (1, 1))
plot_cmd(candidate_df)
poly = Polygon(loop_df, closed=True, 
               facecolor='C1', alpha=0.4)
plt.gca().add_patch(poly)

plt.tight_layout()
<Figure size 640x480 with 4 Axes>

png

We use plt.tight_layout at the end, which adjusts the sizes of the panels to make sure the titles and axis labels don’t overlap.

As an exercise, see what happens if you leave out tight_layout.

Adjusting proportions

In the previous figure, the panels are all the same size. To get a better view of GD-1, we’d like to stretch the panels on the left and compress the ones on the right.

To do that, we’ll use the colspan argument to make a panel that spans multiple columns in the grid.

In the following example, shape is (2, 4), which means 2 rows and 4 columns.

The panels on the left span three columns, so they are three times wider than the panels on the right.

At the same time, we use figsize to adjust the aspect ratio of the whole figure.

plt.figure(figsize=(9, 4.5))

shape = (2, 4)
plt.subplot2grid(shape, (0, 0), colspan=3)
plot_first_selection(candidate_df)

plt.subplot2grid(shape, (0, 3))
plot_proper_motion(centerline_df)

plt.subplot2grid(shape, (1, 0), colspan=3)
plot_second_selection(winner_df)

plt.subplot2grid(shape, (1, 3))
plot_cmd(candidate_df)
poly = Polygon(loop_df, closed=True, 
               facecolor='C1', alpha=0.4)
plt.gca().add_patch(poly)

plt.tight_layout()
<Figure size 900x450 with 4 Axes>

png

This is looking more and more like the figure in the paper.

Exercise (5 minutes)

In this example, the ratio of the widths of the panels is 3:1. How would you adjust it if you wanted the ratio to be 3:2?

Solution


plt.figure(figsize=(9, 4.5))

shape = (2, 5)                                   # CHANGED
plt.subplot2grid(shape, (0, 0), colspan=3)
plot_first_selection(candidate_df)

plt.subplot2grid(shape, (0, 3), colspan=2)       # CHANGED
plot_proper_motion(centerline_df)

plt.subplot2grid(shape, (1, 0), colspan=3)
plot_second_selection(winner_df)

plt.subplot2grid(shape, (1, 3), colspan=2)       # CHANGED
plot_cmd(candidate_df)
poly = Polygon(coords, closed=True, 
               facecolor='C1', alpha=0.4)
plt.gca().add_patch(poly)

plt.tight_layout()

Summary

In this notebook, we reverse-engineered the figure we’ve been replicating, identifying elements that seem effective and others that could be improved.

We explored features Matplotlib provides for adding annotations to figures – including text, lines, arrows, and polygons – and several ways to customize the appearance of figures. And we learned how to create figures that contain multiple panels.

Key Points

  • The most effective figures focus on telling a single story clearly.

  • Consider using annotations to guide the reader’s attention to the most important elements of a figure.

  • The default Matplotlib style generates good quality figures, but there are several ways you can override the defaults.

  • If you find yourself making the same customizations on several projects, you might want to create your own style sheet.