Extracting row and columns
OverviewTeaching: 15 min
Exercises: 15 minQuestions
How can I extract specific rows and columns from a Dataframe?
How can I add or delete columns from a Dataframe?
How can I find and change missing values in a Dataframe?Objectives
Define indexing as it relates to data structures
Select specific columns from a data frame
Select specific rows from a data frame based on conditional expressions
Using indexes to access rows and columns
Copy a data frame
Add columns to a data frame
Analyse datasets having missing/null values
We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe.
import pandas as pd df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
Selecting rows and columns from a pandas Dataframe
If we know which columns we want before we read the data from the file we can tell
read_csv() to only import those columns by specifying columns either by their index number (starting at 0) as a list to the
usecols parameter. Alternatively we can also provide a list of column names.
df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175]) print(df_SN7577_some_cols.shape) print(df_SN7577_some_cols.columns) df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups']) print(df_SN7577_some_cols.columns)
(1286, 6) Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object') Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Let us assume for now that we read in the complete file which is now in the Dataframe
df_SN7577, how can we now refer to specific columns?
There are two ways of doing this using the column names (or labels):
# Both of these statements are the same print(df_SN7577['Q1']) # and print(df_SN7577.Q1)
0 1 1 3 2 10 3 9 ...
If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of
'Q1' we could also have provided a list of strings. Remember that lists are enclosed in
print(df_SN7577[['Q1', 'Q2', 'Q3']])
Q1 Q2 Q3 0 1 -1 1 1 3 -1 1 2 10 3 2 3 9 -1 10 ...
What happens if you:
- List the columns you want out of order from the way they appear in the file?
- Put the same column name in twice?
- Put in a non-existing column name? (a.k.a Typo)
print(df_SN7577[['Q3', 'Q2']]) print(df_SN7577[['Q3', 'Q2', 'Q3']]) print(df_SN7577[['Q33', 'Q2']])
Filtering by Rows
You can filter the Dataframe by rows by specifying a range in the form of
a is the first row and
b is one beyond the last row required.
# select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe) df_SN7577_some_rows = df_SN7577[1:4] df_SN7577_some_rows
What happens if we ask for a single row instead of a range?
You get an error if you only specify
1. You need to use
0:1to get the first row returned. The
:is always required. You can use
:by itself to return all of the rows.
Using criteria to filter rows
It is more likely that you will want to select rows from the Dataframe based on some criteria, such as “all rows where the value for Q2 is -1”.
df_SN7577_some_rows = df_SN7577[(df_SN7577.Q2 == -1)] df_SN7577_some_rows
The criteria can be more complex and isn’t limited to a single column’s values:
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)] df_SN7577_some_rows
We can combine the row selection with column selection:
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']] df_SN7577_some_rows
Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows.
There is however another way of selecting rows using the row index:
df_SN7577_some_rows = df_SN7577.iloc[1:4] df_SN7577_some_rows
iloc method gives the same results as our previous example.
However, now we can specify a single value and more importantly we can use the
range() function to indicate the records that we want. This can be useful for making pseudo-random selections of rows from across the Dataframe.
# Select the first row from the Dataframe df_SN7577_some_rows = df_SN7577.iloc df_SN7577_some_rows # select every 100th record from the Dataframe. df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)] df_SN7577_some_rows
You can also specify column ranges using the
iloc method again using the column index numbers:
# columns 0,1,2 and 3 df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4] df_SN7577_some_rows # columns 0,1,2,78 and 95 df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]] df_SN7577_some_rows
There is also a
loc method which allows you to use the column names.
# columns 0,1,2,78 and 95 using the column names and changing 'iloc' to 'loc' df_SN7577_some_rows = df_SN7577.loc[range(0, len(df_SN7577), 100),['Q1', 'Q2', 'Q3', 'Q18bii', 'access6' ]] df_SN7577_some_rows
Pandas does have a
sample method which allows you to extract a sample of the records from the Dataframe.
df_SN7577.sample(10, replace=False) # ten records, do not select same record twice (this is the default) df_SN7577.sample(frac=0.05, random_state=1) # 5% of records , same records if run again
Import specific columns when reading in a .csv with the
We easily can chain boolean conditions when filtering rows of a pandas dataframe
ilocmethods allow us to get rows with particular labels and at particular integer locations respectively
pandas has a handy
samplemethod which allows us to extract a sample of rows from a dataframe