OverviewTeaching: 30 min
Exercises: 10 minQuestions
How do we use Matplotlib to define a polygon and select points that fall inside it?Objectives
Use isochrone data to specify a polygon and determine which points fall inside it.
Use Matplotlib features to customize the appearance of figures.
In the previous episode we downloaded photometry data from Pan-STARRS, which is available from the same server we have been using to get Gaia data.
The next step in the analysis is to select candidate stars based on the photometry data. The following figure from the paper is a color-magnitude diagram showing the stars we previously selected based on proper motion:
In red is a theoretical isochrone, showing where we expect the stars in GD-1 to fall based on the metallicity and age of their original globular cluster.
By selecting stars in the shaded area, we can further distinguish the main sequence of GD-1 from mostly younger background stars.
We will reload the data from the previous episode and make a color-magnitude diagram.
We will use an isochrone computed by MIST to specify a polygonal region in the color-magnitude diagram and select the stars inside of it.
Starting from this episode
If you are starting a new notebook for this episode, expand this section for information you will need to get started.
In the previous episode, we selected stars in GD-1 based on proper motion and downloaded the spatial, proper motion, and photometry information by joining the Gaia and PanSTARRs datasets. We will use that data for this episode. Whether you are working from a new notebook or coming back from a checkpoint, reloading the data will save you from having to run the query again.
If you are starting this episode here or starting this episode in a new notebook, you will need run the following lines of code.
This imports previously imported functions:
from os.path import getsize import pandas as pd import numpy as np from matplotlib import pyplot as plt from episode_functions import *
The following code loads in the data (instructions for downloading data can be found in the setup instructions). You may need to add a the path to the filename variable below (e.g.
filename = 'student_download/data/gd1_data.hdf')
filename = 'gd1_data.hdf' candidate_df = pd.read_hdf(filename, 'candidate_df')
Plotting photometry data
Now that we have photometry data from Pan-STARRS, we can produce a color-magnitude diagram to replicate the diagram from the original paper:
The y-axis shows the apparent magnitude of each source with the g filter.
The x-axis shows the difference in apparent magnitude between the g and i filters, which indicates color.
A slight detour for non-domain experts
If you are wondering (as a non-domain expert) how to interpret this color-magnitude diagram, please expand the section below.
The color magnitude diagram of GD-1 and foreground stars
As a pathologist can easily point to tumor on a biopsy slide, so too can astronomers who study stellar populations see two stellar main groups of stars in this color magnitude diagram, one from an old star cluster (presumably, GD-1), and the other, stars much closer, but at every distance between the Earth and the cluster (“foreground”). The color magnitude diagram is a technique developed to separate these features out just as pathologists have techniques to contrast human tissue. The color of a star is primarily related to the star’s surface temperature, with bluer stars indicating higher temperatures and redder stars indicating lower temperatures. This logic is not too far off from the color at the bottom of a match flame compared to the top.
Foreground Stars: To know the temperature of a star, you first need to know its distance and to account for the dust between us and the star. You can guess the effect of distance. A star farther away will be fainter (lower y-axis value) than the same star closer (think of car headlights as they approach). Dust will remove light from the star’s path to our telescopes, which makes the star seem like it has less energy than it otherwise would have, which makes it do two things on this diagram: 1) look fainter (lower y-axis value) and 2) look cooler (higher x-axis value). The stars spread throughout the diagram are all stars bright enough to be detected in our Milky Way between GD-1 and us but made fainter and redder (spread to the lower-right) by their spread in distance from us and the amount dust in the line of sight.
GD-1: The pile up of stars in the lower-left quadrant of this diagram are interesting because it suggests something is at the same distance with the same amount of dust in the way. When we use our knowledge of theoretical astrophysics (independently calculated outside this work) to estimate how bright a population of old stars would be if it were at the distance of GD-1, we get that solid red line. The exact values of age and metallicity ([Fe/H] value) is a variable needed to reproduce the theoretical isochrone, but frankly, the choice could vary a lot and still would fit the data well.
With the photometry we downloaded from the PanSTARRS table into
candidate_df we can now recreate this plot.
x = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag'] y = candidate_df['g_mean_psf_mag'] plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3) plt.ylabel('Magnitude (g)') plt.xlabel('Color (g-i)')
We have assigned the color and magnitude to variables
We have done this out of convenience and to keep the code readable since the table variables and column names are long and
x includes an operation
between two columns.
We can zoom in on the region of interest by setting the range of
x and y values displayed with the
If we put the higher value first in the
ylim call, this will invert
the y-axis, putting fainter magnitudes at the bottom.
x = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag'] y = candidate_df['g_mean_psf_mag'] plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3) plt.ylabel('Magnitude (g)') plt.xlabel('Color (g-i)') plt.xlim([0, 1.5]) plt.ylim([22, 14])
Our figure does not look exactly like the one in the paper because we are working with a smaller region of the sky, so we have fewer stars. But the main sequence of GD-1 appears as an overdense region in the lower left.
We want to be able to make this plot again, with any selection of PanSTARRs photometry,
so this is a natural time to put it into a function that accepts as input
Table or Pandas
DataFrame, as long as
it has columns named
i_mean_psf_mag. To do this we will change
our variable name from
candidate_df to the more generic
def plot_cmd(table): """Plot a color magnitude diagram. table: Table or DataFrame with photometry data """ y = table['g_mean_psf_mag'] x = table['g_mean_psf_mag'] - table['i_mean_psf_mag'] plt.plot(x, y, 'ko', markersize=0.3, alpha=0.3) plt.xlim([0, 1.5]) plt.ylim([22, 14]) plt.ylabel('Magnitude (g)') plt.xlabel('Color (g-i)')
Here are the results:
<Figure size 432x288 with 1 Axes>
In the next section we will use an isochrone to specify a polygon that contains this overdense region.
Given our understanding of the age, metallicity, and distance to GD-1 we can overlay a theoretical isochrone for GD-1 from the MESA Isochrones and Stellar Tracks and better identify the main sequence of GD-1.
- Rotation initial v/v_crit = 0.4
- Single age, linear scale = 12e9
- Composition [Fe/H] = -1.35
- Synthetic Photometry, PanStarrs
- Extinction av = 0
Making a polygon
The MIST isochrone files available on the website above can not be directly plotted over our data. We have selected the relevant part of the isochrone, the filters we are interested in, and scaled the photometry to the distance of GD-1 (details here). Now we can read in the results which you downloaded as part of the setup instructions:
filename = 'gd1_isochrone.hdf5' iso_df = pd.read_hdf(filename, 'iso_df') iso_df.head()
mag_g color_g_i 0 28.294743 2.195021 1 28.189718 2.166076 2 28.051761 2.129312 3 27.916194 2.093721 4 27.780024 2.058585
Here is what the isochrone looks like on the color-magnitude diagram.
plot_cmd(candidate_df) plt.plot(iso_df['color_g_i'], iso_df['mag_g']);
<Figure size 432x288 with 1 Axes>
In the bottom half of the figure, the isochrone passes through the overdense region where the stars are likely to belong to GD-1.
Although some stars near the top half of the isochrone likely belong to GD-1, these represent stars that have evolved off the main sequence. The density of GD-1 stars in this region is therefore much less and the contamination with other stars much greater. So to get the purest sample of GD-1 stars we will select only stars on the main sequence.
So we will select the part of the isochrone that lies in the overdense region.
g_mask is a Boolean Series that is
g is between 18.0 and 21.5.
g_all = iso_df['mag_g'] g_mask = (g_all > 18.0) & (g_all < 21.5) g_mask.sum()
We can use it to select the corresponding rows in
iso_masked = iso_df[g_mask] iso_masked.head()
mag_g color_g_i 94 21.411746 0.692171 95 21.322466 0.670238 96 21.233380 0.648449 97 21.144427 0.626924 98 21.054549 0.605461
Now, to select the stars in the overdense region, we have to define a polygon that includes stars near the isochrone.
g = iso_masked['mag_g'] left_color = iso_masked['color_g_i'] - 0.06 right_color = iso_masked['color_g_i'] + 0.12
Here is our plot with these boundaries:
plot_cmd(candidate_df) plt.plot(left_color, g, label='left color') plt.plot(right_color, g, label='right color') plt.legend();
<Figure size 432x288 with 1 Axes>
Which points are in the polygon?
Matplotlib provides a
Polygon object that we can use to check which
points fall in the polygon we just constructed.
To make a
Polygon, we need to assemble
right_color into a loop, so the points in
left_color are connected
to the points of
right_color in reverse order.
We will use a “slice index” to reverse the elements of
As explained in the NumPy
a slice index has three parts separated by colons:
start: The index of the element where the slice starts.
stop: The index of the element where the slice ends.
step: The step size between elements.
reverse_right_color = right_color[::-1]
In this example,
stop are omitted, which means all
elements are selected.
-1, which means the elements are in reverse order.
To combine the
right_color arrays we will use the NumPy
which takes two arrays as input, and outputs them combined into a single array. By writing the
function we can use the same code to create the x-values for the loop (colors) and the y-values
for the loop (g-band magnitudes).
combined_array = np.append(left_color, reverse_right_color)
We can combine these steps into the following function, which takes two arrays and joins them front-to-back:
def front_to_back(first, second): """Join two arrays front to back.""" return np.append(first, second[::-1])
We can use
front_to_back to make a loop that includes the elements
color_loop = front_to_back(left_color, right_color) color_loop.shape
And a corresponding loop with the elements of
g in forward and reverse order.
mag_loop = front_to_back(g, g) mag_loop.shape
Here is the loop on our plot:
plot_cmd(candidate_df) plt.plot(color_loop, mag_loop);
<Figure size 432x288 with 1 Axes>
To make a
Polygon, it will be useful to put
mag_loop into a
DataFrame. This is convenient for two reasons - first,
is expecting an Nx2 array and the
DataFrame provides an easy way for us to pass that
in that is also descriptive for us. Secondly, for reproducibility of our work, we may want
to save the region we use to select stars, and the
DataFrame, as we have already seen, allows us to save into a variety of formats.
loop_df = pd.DataFrame() loop_df['color_loop'] = color_loop loop_df['mag_loop'] = mag_loop loop_df.head()
color_loop mag_loop 0 0.632171 21.411746 1 0.610238 21.322466 2 0.588449 21.233380 3 0.566924 21.144427 4 0.545461 21.054549
Now we can pass
from matplotlib.patches import Polygon polygon = Polygon(loop_df) polygon
<matplotlib.patches.Polygon at 0x7f439d33fdf0>
The result is a
Polygon object , which provides
which figures out which points are inside the polygon.
Exercise (5 minutes)
When we encounter a new object, it is good to create a toy example to test that it does what we think it does. Define an array of two points, one that should be inside the polygon and one that should be outside the polygon. Use
contains_pointsto verify that the results are as expected.
test_points = [(0.4, 20), (0.4, 16)]
test_inside_mask = polygon.contains_points(test_points) test_inside_mask
array([ True, False])
The result is an array of Boolean values, and is as expected.
We are almost ready to select stars whose photometry data falls in this polygon. But first we need to do some data cleaning.
Save the polygon
Reproducibile research is “the idea that … the full computational environment used to produce the results in the paper such as the code, data, etc. can be used to reproduce the results and create new work based on the research.”
This lesson is an example of reproducible research because it contains all of the code needed to reproduce the results, including the database queries that download the data and the analysis.
In this episode, we used an isochrone to derive a polygon, which we used to select stars based on photometry. So it is important to record the polygon as part of the data analysis pipeline.
Here is how we can save it in an HDF5 file.
filename = 'gd1_data.hdf' loop_df.to_hdf(filename, 'loop_df')
Selecting based on photometry
Now we will check how many of the candidate stars are inside the polygon we chose.
contains_points expects a list of (x,y) pairs. As with creating the
a convenient way to pass the colors and magnitudes for all of our stars in
candidates_df to our
Polygon to see
which candidates are inside the polygon. We will start by putting color and magnitude data from
candidate_df into a new
cmd_df = pd.DataFrame() cmd_df['color'] = candidate_df['g_mean_psf_mag'] - candidate_df['i_mean_psf_mag'] cmd_df['mag'] = candidate_df['g_mean_psf_mag'] cmd_df.head()
color mag 0 0.3804 17.8978 1 1.6092 19.2873 2 0.4457 16.9238 3 1.5902 19.9242 4 1.4853 16.1516
Which we can pass to
inside_mask = polygon.contains_points(cmd_df) inside_mask
array([False, False, False, ..., False, False, False])
The result is a Boolean array.
Exercise (5 minutes)
Boolean values are stored as 0s and 1s.
FALSE= 0 and
TRUE= 1. Use this information to determine the number of stars that fall inside the polygon.
Now we can use
inside_mask as a mask to select stars that fall inside the polygon.
winner_df = candidate_df[inside_mask]
We will make a color-magnitude plot one more time, highlighting the selected stars with green markers.
plot_cmd(candidate_df) plt.plot(iso_df['color_g_i'], iso_df['mag_g']) plt.plot(color_loop, mag_loop) x = winner_df['g_mean_psf_mag'] - winner_df['i_mean_psf_mag'] y = winner_df['g_mean_psf_mag'] plt.plot(x, y, 'go', markersize=0.5, alpha=0.5);
<Figure size 432x288 with 1 Axes>
The selected stars are, in fact, inside the polygon, which means they have photometry data consistent with GD-1.
Finally, we can plot the coordinates of the selected stars:
fig = plt.figure(figsize=(10,2.5)) x = winner_df['phi1'] y = winner_df['phi2'] plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9) plt.xlabel('$\phi_1$ [deg]]') plt.ylabel('$\phi_2$ [deg]') plt.title('Proper motion + photometry selection', fontsize='medium') plt.axis('equal');
<Figure size 720x180 with 1 Axes>
This example includes the new Matplotlib command
figure, which creates the larger canvas that the subplots are placed on. In previous examples, we didn’t have
to use this function; the figure was created automatically. But when
we call it explicitly, we can provide arguments like
sets the size of the figure. It also returns a figure object which we will
use to further customize our plotting in the next episode.
In the example above we also used TeX markup in our axis labels so that they render as the
$\phi$ with subscripts for
Matplotlib also allows us to write basic TeX markup by wrapping the text we want
rendered as TeX with
$ and then using TeX commands inside. This basic rendering
is performed with mathtext;
more advanced rendering with LaTex can be done with the
usetex option in
which we will discuss in Episode 7.
In the next episode we are going to make this plot several more times, so it makes sense to
make a function. As we have done with previous functions we can copy and paste what we just wrote,
replacing the specific variable
winner_df with the more generic
def plot_cmd_selection(df): x = df['phi1'] y = df['phi2'] plt.plot(x, y, 'ko', markersize=0.7, alpha=0.9) plt.xlabel('$\phi_1$ [deg]') plt.ylabel('$\phi_2$ [deg]') plt.title('Proper motion + photometry selection', fontsize='medium') plt.axis('equal')
And here is how we use the function.
fig = plt.figure(figsize=(10,2.5)) plot_cmd_selection(winner_df)
<Figure size 1000x250 with 1 Axes>
Write the data
Finally, we will write the selected stars to a file.
filename = 'gd1_data.hdf' winner_df.to_hdf(filename, 'winner_df')
MB = 1024 * 1024 getsize(filename) / MB
In this episode, we used photometry data from Pan-STARRS to draw a color-magnitude diagram. We used an isochrone to define a polygon and select stars we think are likely to be in GD-1. Plotting the results, we have a clearer picture of GD-1, similar to Figure 1 in the original paper.
Matplotlib provides operations for working with points, polygons, and other geometric entities, so it is not just for making figures.
Use Matplotlib options to control the size and aspect ratio of figures to make them easier to interpret.
Record every element of the data analysis pipeline that would be needed to replicate the results.