Introduction to Python
Overview
Teaching: 15 min
Exercises: 0 minQuestions
Why learn Python?
What are Jupyter notebooks?
Objectives
Examine the Python interpreter
Recognize the advantage of using the Python programming language
Understand the concept and benefits of using notebooks for coding
Introducing the Python programming language
Python is a general purpose programming language. It is an interpreted language, which makes it suitable for rapid development and prototyping of programming segments or complete small programs.
Python’s main advantages:
- Open source software, supported by Python Software Foundation
- Available on all major platforms (Windows, macOS, Linux)
- It is a good language for new programmers to learn due to its straightforward, object-oriented style
- It is well-structured, which aids readability
- It is extensible (i.e. modifiable) and is supported by a large community who provide a comprehensive range of 3rd party packages
Interpreted vs. compiled languages
In any programming language, the code must be translated into “machine code” before running it. It is the machine code which is executed and produces results. In a language like C++ your code is translated into machine code and stored in a separate file, in a process referred to as compiling the code. You then execute the machine code from the file as a separate step. This is efficient if you intend to run the same machine code many times as you only have to compile it once and it is very fast to run the compiled machine code.
On the other hand, if you are experimenting, then your code will change often and you would have to compile it again every time before the machine can execute it. This is where interpreted languages have the advantage. You don’t need a complete compiled program to “run” what has been written so far and see the results. This rapid prototyping is helped further by use of a system called REPL.
REPL
REPL is an acronym which stands for Read, Evaluate, Print and Loop.
REPL allows you to write single statements of code, have them executed, and if there are any results to show, they are displayed and then the interpreter loops back to the beginning and waits for the next program statement.
In the example above, two variables a
and b
have been created, assigned to values
2
and 3
, and then multiplied together.
Every time you press Return, the line is interpreted. The assignment statements don’t produce any
output so you get only the standard >>>
prompt.
For the a*b
statement (it is more of an expression than a statement), because
the result is not being assigned to a variable, the REPL displays the result of
the calculation on screen and then waits for the next input.
The REPL system makes it very easy to try out small chunks of code.
You are not restricted to single line statements. If the Python interpreter
decides that what you have written on a line cannot be a complete statement it
will give you a continuation prompt of ...
until you complete the statement.
Introducing Jupyter notebooks
Jupyter originates from IPython, an effort to make Python development more interactive. Since its inception, the scope of the project has expanded to include Julia, Python, and R, so the name was changed to “Jupyter” as a reference to these core languages. Today, Jupyter supports even more languages, but we will be using it only for Python code. Specifically, we will be using Jupyter notebooks, which allows us to easily take notes about our analysis and view plots within the same document where we code. This facilitates sharing and reproducibility of analyses, and the notebook interface is easily accessible through any web browser. Jupyter notebooks are started from the terminal using
$ jupyter notebook
Your browser should start automatically and look something like this:
When you create a notebook from the New option, the new notebook will be displayed in a new browser tab and look like this.
Initially the notebook has no name other than ‘Untitled’. If you click on ‘Untitled’ you will be given the option of changing the name to whatever you want.
The notebook is divided into cells. Initially there will be a single input cell marked by In
[ ]:
.
You can type Python code directly into the cell. You can split the code across several lines as needed. Unlike the REPL we looked at before, the code is not interpreted line by line. To interpret the code in a cell, you can click the Run button in the toolbar or from the Cell menu option, or use keyboard shortcuts (e.g., Shift+Return). All of the code in that cell will then be executed.
The results are shown in a separate Out [1]:
cell immediately below. A new input
cell (In [ ]:
) is created for you automatically.
When a cell is run, it is given a number along with the corresponding output
cell. If you have a notebook with many cells in it you can run the cells in any
order and also run the same cell many times. The number on the left hand side of
the input cells increments, so you can always tell the order in which they were
run. For example, a cell marked In [5]:
was the fifth cell run in the sequence.
Although there is an option to do so on the toolbar, you do not have to manually save the notebook. This is done automatically by the Jupyter system.
Not only are the contents of the In [ ]:
cells saved, but so are the Out [ ]:
cells.
This allows you to create complete documents with both your code and the output
of the code in a single place. You can also change the cell type from
Python code to Markdown using the Cell > Cell Type option. Markdown is
a simple formatting system which allows you to create documentation for your
code, again all within the same notebook structure.
The notebook itself is stored as specially-formatted text file with an .ipynb
extension. These files can be opened and run by others with Jupyter installed. This allows you to
share your code inputs, outputs, and
Markdown documentation with others. You can also export the notebook to HTML, PDF, and
many other formats to make sharing even easier.
Key Points
Python is an interpreted language
The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments
Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared
Jupyter notebooks is a complete IDE (Integrated Development Environment)
Python basics
Overview
Teaching: 25 min
Exercises: 30 minQuestions
How do I assign values to variables?
How do I do arithmetic?
What is a built-in function?
How do I see results?
What data types are supported in Python?
Objectives
Create different cell types and show/hide output in Jupyter
Create variables and assign values to them
Check the type of a variable
Perform simple arithmetic operations
Specify parameters when using built-in functions
Get help for built-in functions and other aspects of Python
Define native data types in Python
Convert from one data type to another
Using the Jupyter environment
New cells
From the insert menu item you can insert a new cell anywhere in the notebook either above or below the current cell. You can also use the +
button on the toolbar to insert a new cell below.
Change cell type
By default new cells are created as code cells. From the cell menu item you can change the type of a cell from code to Markdown. Markdown is a markup language for formatting text, it has much of the power of HTML, but is specifically designed to be human-readable as well. You can use Markdown cells to insert formatted textual explanation and analysis into your notebook. For more information about Markdown, check out these resources:
- Jupyter Notebook Markdown Docs
- Markdown - a Visual Guide
- Mastering Markdown from Github
- Markdown official open source project
Hiding output
When you run cells of code the output is displayed immediately below the cell. In general this is convenient. The output is associated with the cell that produced it and remains a part of the notebook. So if you copy or move the notebook the output stays with the code.
However lots of output can make the notebook look cluttered and more difficult to move around. So there is an option available from the cell
menu item to ‘toggle’ or ‘clear’ the output associated either with an individual cell or all cells in the notebook.
Creating variables and assigning values
Variables and Types
In Python variables are created when you first assign values to them.
a = 2
b = 3.142
All variables have a data type associated with them.
The data type is an indication of the type of data contained in a variable.
If you want to know the type of a variable you can use the built-in type()
function.
print(type(a))
print(type(b))
s = "Hello World"
print(type(s))
<class 'int'>
<class 'float'>
<class 'str'>
There are many more data types available, a full list is available in the Python documentation. We will be looking a few of them later on.
Arithmetic operations
For now we will stick with the numeric types and do some arithmetic.
All of the usual arithmetic operators are available.
In the examples below we also introduce the Python comment symbol #
.
Anything to the right of the #
symbol is treated as a comment. To a large extent using Markdown cells in a notebook reduces the need for comments in the code in a notebook, but occasionally they can be useful.
We also make use of the built-in print()
function, which displays formatted text.
print("a =", a, "and b =" , b)
print(a + b) # addition
print(a * b) # multiplication
print(a - b) # subtraction
print(a / b) # division
print(b ** a) # exponentiation
print(a % b) # modulus - returns the remainder
print(2 * a % b) # modulus - returns the remainder
a = 2 and b = 3.142
5.1419999999999995
6.284
-1.142
0.6365372374283896
9.872164
2.0
0.8580000000000001
We need to use the print()
function because by default only the last output from a cell is displayed in the output cell.
In our example above, we pass four different parameters to the first call of print()
, each separated by a comma. A string "a = "
, followed by the variable a
, followed by the string "b = "
and then the variable b
.
The output is what you would probably have guessed at.
All of the other calls to print()
are only passed a single parameter. Although it may look like 2 or 3, the expressions are evaluated first and it is only the single result which is seen as the parameter value and printed.
In the last expression a
is multiplied by 2 and then the modulus of the result is taken. Had we wanted to calculate a % b and then multiply the result by two we could have done so by using brackets to make the order of calculation clear.
When we have more complex arithmetic expressions, we can use parentheses to be explicit about the order of evaluation:
print("a =", a, "and b =" , b)
print(a + 2*b) # add a to two times b
print(a + (2*b)) # same thing but explicit about order of evaluation
print((a + b)*2) # add a and b and then multiply by two
a = 2 and b = 3.142
8.283999999999999
8.283999999999999
10.283999999999999
Arithmetic expressions can be arbitrarily complex, but remember people have to read and understand them as well.
Exercise
Create a new cell and paste into it the assignments to the variables a and b and the contents of the code cell above with all of the print statements. Remove all of the calls to the print function so you only have the expressions that were to be printed and run the code. What is returned?
Now remove all but the first line (with the 4 items in it) and run the cell again. How does this output differ from when we used the print function?
Practice assigning values to variables using as many different operators as you can think of.
Create some expressions to be evaluated using parentheses to enforce the order of mathematical operations that you require
Solution
- Only the last result is printed.
- The 4 ‘items’ are printed by the REPL, but not in the same way as the print statement. The items in quotes are treated as separate strings, for the variables a and b the values are printed. All four items are treated as a ‘tuple’ which are shown in parentheses, a tuple is another data type in Python that allows you to group things together and treat as a unit. We can tell that it is a tuple because of the
()
A complete set of Python operators can be found in the official documentation . The documentation may appear a bit confusing as it initially talks about operators as functions whereas we generally use them as ‘in place’ operators. Section 10.3.1 provides a table which list all of the available operators, not all of which are relevant to basic arithmetic.
Using built-in functions
Python has a reasonable number of built-in functions. You can find a complete list in the official documentation.
Additional functions are provided by 3rd party packages which we will look at later on.
For any function, a common question to ask is: What parameters does this function take?
In order to answer this from Jupyter, you can type the function name and then type shift
+tab
and a pop-up window will provide you with various details about the function including the parameters.
Exercise
For the
print()
function find out what parameters can be providedSolution
Type ‘print’ into a code cell and then type
shift
+tab
. The following pop-up should appear.
Getting Help for Python
You can get help on any Python function by using the help function. It takes a single parameter of the function name for which you want the help.
help(print)
Help on built-in function print in module builtins:
print(...)
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file: a file-like object (stream); defaults to the current sys.stdout.
sep: string inserted between values, default a space.
end: string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.
There is a great deal of Python help and information as well as code examples available from the internet. One popular site is stackoverflow which specialises in providing programming help. They have dedicated forums not only for Python but also for many of the popular 3rd party Python packages. They also always provide code examples to illustrate answers to questions.
You can also get answers to your queries by simply inputting your question (or selected keywords) into any search engine.
A couple of things you may need to be wary of: There are currently 2 versions of Python in use, in most cases code examples will run in either but there are some exceptions. Secondly, some replies may assume a knowledge of Python beyond your own, making the answers difficult to follow. But for any given question there will be a whole range of suggested solutions so you can always move on to the next.
Data types and how Python uses them
Changing data types
The data type of a variable is assigned when you give a variable a value as we did above. If you re-assign the value of a variable, you can change the data type.
You can also explicitly change the type of a variable by casting
it using an appropriate Python builtin function. In this example we have changed a string
to a float
.
a = "3.142"
print(type(a))
a = float(a)
print(type(a))
<class 'str'>
<class 'float'>
Although you can always change an integer
to a float
, if you change a float
to an integer
then you can lose part of the value of the variable and you won’t get an error message.
a = 3.142
print(type(a))
a = 3
print(type(a))
a = a*1.0
print(type(a))
a = int(a)
print(type(a))
a = 3.142
a = int(a)
print(type(a))
print(a)
<class 'float'>
<class 'int'>
<class 'float'>
<class 'int'>
<class 'int'>
3
In some circumstances explicitly converting a data type makes no sense; you cannot change a string with alphabetic characters into a number.
b = "Hello World"
print(type(b))
b = int(b)
print(type(b))
<class 'str'>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-9f5f81a470f9> in <module>()
2 print(type(b))
3
----> 4 b = int(b)
5 print(type(b))
ValueError: invalid literal for int() with base 10: 'Hello World'
Strings
A string is a simple data type which holds a sequence of characters.
Strings are placed in quotes when they are being assigned, but the quotes don’t count as part of the string value.
If you need to use quotes as part of your string you can arbitrarily use either single or double quotes to indicate the start and end of the string.
mystring = "Hello World"
print(mystring)
name = "Peter"
mystring = 'Hello ' + name + ' How are you?'
print(mystring)
name = "Peter"
mystring = 'Hello this is ' + name + "'s code"
print(mystring)
Hello World
Hello Peter How are you?
Hello this is Peter's code
String functions
There are a variety of Python functions available for use with strings. In Python a string is an object. An object put simply is something which has data
, in the case of our string it is the contents of the string and methods
. methods
is another way of saying functions
.
Although methods
and functions
are very similar in practice, there is a difference in the way you call them.
One typical bit of information you might want to know about a string is its length for this we use the len()
function. For almost anything else you might want to do with strings, there is a method.
mystring = "Hello World"
print(len(mystring))
11
The official documentation says, ‘A method is a function that “belongs to” an object. In Python, the term method is not unique to class instances: other object types can have methods as well. For example, list objects have methods called append, insert, remove, sort, and so on.’.
If you want to see a list of all of the available methods for a string (or any other object) you can use the dir()
function.
print(dir(mystring))
['__add__', '__class__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', 'capitalize', 'casefold', 'center', 'count', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'format_map', 'index', 'isalnum', 'isalpha', 'isdecimal', 'isdigit', 'isidentifier', 'islower', 'isnumeric', 'isprintable', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'maketrans', 'partition', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill']
The methods starting with __
are special or magic methods which are not normally used.
Some examples of the methods are given below. We will use others when we start reading files.
myString = "The quick brown fox"
print(myString.startswith("The"))
print(myString.find("The")) # notice that string positions start with 0 like all indexing in Python
print(myString.upper()) # The contents of myString is not changed, if you wanted an uppercase version
print(myString) # you would have to assign it to a new variable
True
0
THE QUICK BROWN FOX
The quick brown fox
The methods starting with ‘is…’ return a boolean value of either True or False
print(myString.isalpha())
False
the example above returns False because the space character is not considered to be an Alphanumeric value.
In the example below, we can use the replace()
method to remove the spaces and then check to see if the result isalpha
chaining method in this way is quite common. The actions take place in a left to right manner. You can always avoid using chaining by using intermediary variables.
print(myString.replace(" ","").isalpha())
True
For example, the following is equivalent to the above
mystring_clean = myString.replace(" ","")
print(mystring_clean.isalpha())
True
If you need to refer to a specific element (character) in a string,
you can do so by specifying the index of the character in []
you can also use indexing to select a substring of the string. In Python,
indexes begin with 0
(for a visual, please see Strings and Character Data in Python: String Indexing or 9.4. Index Operator: Working with the Characters of a String).
myString = "The quick brown fox"
print(myString[0])
print(myString[12])
print(myString[18])
print(myString[0:3])
print(myString[0:]) # from index 0 to the end
print(myString[:9]) # from the beginning to one before index 9
print(myString[4:9])
T
o
x
The
The quick brown fox
The quick
quick
Basic Python data types
So far we have seen three basic Python data types; Integer, Float and String. There is another basic data type; Boolean. Boolean variables can only have the values of either True
or False
. (Remember, Python is case-sensitive, so be careful of your spelling.)
We can define variables to be of type boolean by setting their value accordingly. Boolean variables are a good way of coding anything that has a binary range (eg: yes/no), because it’s a type that computers know how to work with as we will see soon.
print(True)
print(False)
bool_val_t = True
print(type(bool_val_t))
print(bool_val_t)
bool_val_f = False
print(type(bool_val_f))
print(bool_val_f)
True
False
<class 'bool'>
True
<class 'bool'>
False
Following two lines of code will generate error because Python is case-sensitive. We need to use ‘True’ instead of ‘true’ and ‘False’ instead of ‘false’.
print(true)
print(false)
NameError Traceback (most recent call last)
<ipython-input-115-b5911eeae48b> in <module>
----> 1 print(true)
2 print(false)
NameError: name 'true' is not defined
We can also get values of Boolean type using comparison operators, basic ones in Python are ==
for “equal to”, !=
for “not equal to”, and >
, <
, or >=
, <=
.
print('hello' == 'HELLO')
print('hello' is 'hello')
print(3 != 77)
print(1 < 2)
print('four' > 'three')
False
True
True
True
False
Exercise
Imagine you are considering different ways of representing a boolean value in your data set and you need to see how python will behave based on the different choices. Fill in the blanks using the built in functions we’ve seen so far in following code excerpt to test how Python interprets text. Write some notes for your research team on how to code
True
andFalse
as they record the variable.bool_val1 = 'TRUE' print('read as type ',___(bool_val1)) print('value when cast to bool',___(bool_val1)) bool_val2 = 'FALSE' print('read as type ',___(bool_val2)) print('value when cast to bool',___(bool_val2)) bool_val3 = 1 print('read as type ',___(bool_val3)) print('value when cast to bool',___(bool_val3)) bool_val4 = 0 print('read as type ',___(bool_val4)) print('value when cast to bool',___(bool_val4)) bool_val5 = -1 print('read as type ',___(bool_val5)) print('value when cast to bool',___(bool_val5)) print(bool(bool_val5))
Solution
0 is represented as False and everything else, whether a number or string is counted as True
Structured data types
A structured data type is a data type which is made up of some combination of the base data types in a well defined but potentially arbitrarily complex way.
The list
A list is a set of values, of any type separated by commas and delimited by ‘[’ and ‘]’
list1 = [6, 54, 89 ]
print(list1)
print(type(list1))
list2 = [3.142, 2.71828, 9.8 ]
print(list2)
print(type(list2))
myname = "Peter"
list3 = ["Hello", 'to', myname ]
print(list3)
myname = "Fred"
print(list3)
print(type(list3))
list4 = [6, 5.4, "numbers", True ]
print(list4)
print(type(list4))
[6, 54, 89]
<class 'list'>
[3.142, 2.71828, 9.8]
<class 'list'>
['Hello', 'to', 'Peter']
['Hello', 'to', 'Peter']
<class 'list'>
[6, 5.4, 'numbers', True]
<class 'list'>
Exercise
We can index lists the same way we indexed strings before. Complete the code below and display the value of
last_num_in_list
which is 11 and values ofodd_from_list
which are 5 and 11 to check your work.num_list = [4,5,6,11] last_num_in_list = num_list[____] print(last_num_in_list) odd_from_list = [num_list[_____], ______] print(odd_from_list)
Solution
# Solution 1: Basic ways of solving this exercise using the core Python language num_list = [4,5,6,11] last_num_in_list = num_list[-1] print(last_num_in_list) odd_from_list = [num_list[1], num_list[3]] print(odd_from_list) # Solutions 2 and 3: Usually there are multiple ways of doing the same work. Once we learn about more advanced Python, we would be able to write more varieties codes like the followings to print the odd numbers: import numpy as np num_list = [4,5,6,11] # Converting `num_list` list to an advanced data structure: `numpy array` num_list_np_array = np.array(num_list) # Filtering the elements which produces a remainder of `1`, after dividing by `2` odd_from_list = num_list_np_array[num_list_np_array%2 == 1] print(odd_from_list) # or, Using a concept called `masking` # Create a boolean list `is_odd` of the same length of `num_list` with `True` at the position of the odd values. is_odd = [False, True, False, True] # Mask array odd_from_list = num_list_np_array[is_odd] # only the values at the position of `True` remain print(odd_from_list)
The range function
In addition to explicitly creating lists as we have above it is very common to create and populate them automatically using the range()
function in combination with the list()
function
list5 = list(range(5))
print(list5)
[0, 1, 2, 3, 4]
Unless told not to range()
returns a sequence which starts at 0, counts up by 1 and ends 1 before the value of the provided parameter.
This can be a cause of confusion. range(5)
above does indeed have 5 values, but rather than being 1,2,3,4,5 which you might naturally think, they are in fact 0,1,2,3,4. The range starts at 0 and stops one before the value of the single parameter we specified.
If you want different sequences, then you can modify the behavior of the range()
function by using additional parameters.
list6 = list(range(1, 9))
print(list6)
list7 = list(range(2, 11, 2))
print(list7)
[1, 2, 3, 4, 5, 6, 7, 8]
[2, 4, 6, 8, 10]
When you specify 3 parameters as we have for list(7); the first is start value, the second is one past the last value and the 3rd parameter is a step value by which to count. The step value can be negative
list7
produces the even numbers from 1 to 10.
Exercise
- What is produced if you change the step value in
list7
to -2 ? Is this what you expected?- Create a list using the
range()
function which contains the even number between 1 and 10 in reverse order ([10,8,6,4,2])Solution
list7 = list(range(2, 11, -2)) print(list7) list8 = list(range(10, 1, -2)) print(list8)
list7 will print nothing because starting at 2 and incrementing by -2 is the wrong direction to 11.
The other main structured data type is the Dictionary. We will introduce this in a later episode when we look at JSON.
Key Points
The Jupyter environment can be used to write code segments and display results
Data types in Python are implicit based on variable values
Basic data types are Integer, Float, String and Boolean
Lists and Dictionaries are structured data types
Arithmetic uses standard arithmetic operators, precedence can be changed using brackets
Help is available for builtin functions using the
help()
function further help and code examples are available onlineIn Jupyter you can get help on function parameters using shift+tab
Many functions are in fact methods associated with specific object types
Python control structures
Overview
Teaching: 20 min
Exercises: 25 minQuestions
What constructs are available for changing the flow of a program?
How can I repeat an action many times?
How can I perform the same task(s) on a set of items?
Objectives
Change program flow using available language constructs
Demonstrate how to execute a section of code a fixed number of times
Demonstrate how to conditionally execute a section of code
Demonstrate how to execute a section of code on a list of items
Programs are rarely linear
Most programs do not work by executing a simple sequential set of statements. The code is constructed so that decisions and different paths through the program can be taken based on changes in variable values.
To make this possible all programming language have a set of control structures which allow this to happen.
In this episode we are going to look at how we can create loops and branches in our Python code. Specifically we will look at three control structures, namely:
- if..else..
- while…
- for …
The if
statement and variants
The simple if
statement allows the program to branch based on the evaluation of an expression
The basic format of the if
statement is:
if expression :
statement 1
statement 2
...
statement n
statement always executed
If the expression evaluates to True
then the statements 1 to n will be executed followed by statement always executed
. If the expression is False
, only statement always executed
is executed. Python knows which lines of code are related to the if
statement by the indentation, no extra syntax is necessary.
Below are some examples:
print("\nExample 1\n")
value = 5
threshold= 4
print("value is", value, "threshold is ",threshold)
if value > threshold :
print(value, "is bigger than ", threshold)
print("\nExample 2\n")
high_threshold = 6
print("value is", value, "new threshold is ",high_threshold)
if value > high_threshold :
print(value , "is above ", high_threshold, "threshold")
print("\nExample 3\n")
mid_threshold = 5
print("value is", value, "final threshold is ",mid_threshold)
if value == mid_threshold :
print("value, ", value, " and threshold,", mid_threshold, ", are equal")
Example 1
value is 5 threshold is 4
5 is bigger than 4
Example 2
value is 5 new threshold is 6
Example 3
value is 5 final threshold is 5
value, 5, and threshold, 5, are equal
In the examples above there are three things to notice:
- The colon
:
at the end of theif
line. Leaving this out is a common error. - The indentation of the print statement. If you remembered the
:
on the line before, Jupyter (or any other Python IDE) will automatically do the indentation for you. All of the statements indented at this level are considered to be part of theif
statement. This is a feature fairly unique to Python, that it cares about the indentation. If there is too much, or too little indentation, you will get an error. - The
if
statement is ended by removing the indent. There is no explicit end to theif
statement as there is in many other programming languages
In the last example, notice that in Python the operator used to check equality is ==
.
Exercise
Add another if statement to example 2 that will check if b is greater than or equal to a
Solution
print("\nExample 2a\n") a= 3 b= 4 print("a is", a, "b is",b) if a > b : print(a, "is bigger than ", b) if a <= b : print(b, "is bigger than or equal to ", a)
Instead of using two separate if
statements to decide which is larger we can use the if ... else ...
construct
# if ... else ...
value = 4
threshold = 5
print("value = ", value, "and threshold = ", threshold)
if value > threshold :
print("above threshold")
else :
print("below threshold")
value = 4 and threshold = 5
below threshold
Exercise
Repeat above with different operators ‘<’ , ‘==’
A further extension of the if
statement is the if ... elif ...else
version.
The example below allows you to be more specific about the comparison of a and b.
# if ... elif ... else ... endIf
a = 5
b = 4
print("a = ", a, "and b = ", b)
if a > b :
print(a, " is greater than ", b)
elif a == b :
print(a, " equals ", b)
else :
print(a, " is less than ", b)
a = 5 and b = 4
5 is greater than 4
The overall structure is similar to the if ... else
statement. There are three additional things to notice:
- Each
elif
clause has its own test expression. - You can have as many
elif
clauses as you need - Execution of the whole statement stops after an
elif
expression is found to be True. Therefore the ordering of theelif
clause can be significant.
The while
loop
The while loop is used to repeatedly execute lines of code until some condition becomes False.
For the loop to terminate, there has to be something in the code which will potentially change the condition.
# while loop
n = 10
cur_sum = 0
# sum of n numbers
i = 1
while i <= n :
cur_sum = cur_sum + i
i = i + 1
print("The sum of the numbers from 1 to", n, "is ", cur_sum)
The sum of the numbers from 1 to 10 is 55
Points to note:
- The condition clause (i <= n) in the while statement can be anything which when evaluated would return a Boolean value of either True of False. Initially i has been set to 1 (before the start of the loop) and therefore the condition is
True
. - The clause can be made more complex by use of parentheses and
and
andor
operators amongst others - The statements after the while clause are only executed if the condition evaluates as True.
- Within the statements after the while clause there should be something which potentially will make the condition evaluate as
False
next time around. If not the loop will never end. - In this case the last statement in the loop changes the value of i which is part of the condition clause, so hopefully the loop will end.
- We called our variable
cur_sum
and notsum
becausesum
is a builtin function (try typing it in, notice the editor changes it to green). If we definesum = 0
now we can’t use the functionsum
in this Python session.
Exercise - Things that can go wrong with while loops
In the examples below, without running them try to decide why we will not get the required answer. Run each, one at a time, and then correct them. Remember that when the input next to a notebook cell is [*] your Python interpreter is still working.
# while loop - summing the numbers 1 to 10 n = 10 cur_sum = 0 # sum of n numbers i = 0 while i <= n : i = i + 1 cur_sum = cur_sum + i print("The sum of the numbers from 1 to", n, "is ", cur_sum)
# while loop - summing the numbers 1 to 10 n = 10 cur_sum = 0 boolvalue = False # sum of n numbers i = 0 while i <= n and boolvalue: cur_sum = cur_sum + i i = i + 1 print("The sum of the numbers from 1 to", n, "is ", cur_sum)
# while loop - summing the numbers 1 to 10 n = 10 cur_sum = 0 # sum of n numbers i = 0 while i != n : cur_sum = cur_sum + i i = i + 1 print("The sum of the numbers from 1 to", n, "is ", cur_sum)
# while loop - summing the numbers 1.1 to 9.9 i. steps of 1.1 n = 9.9 cur_sum = 0 # sum of n numbers i = 0 while i != n : cur_sum = cur_sum + i i = i + 1.1 print(i) print("The sum of the numbers from 1.1 to", n, "is ", sum)
Solution
- Because i is incremented before the sum, you are summing 1 to 11.
- The Boolean value is set to False the loop will never be executed.
- When i does equal 10 the expression is False and the loop does not execute so we have only summed 1 to 9
- Because you cannot guarantee the internal representation of Float, you should never try to compare them for equality. In this particular case the i never ‘equals’ n and so the loop never ends. - If you did try running this, you can stop it using Ctrl+c in a terminal or going to the kernel menu of a notebook and choosing interrupt.
The for
loop
The for loop, like the while loop repeatedly executes a set of statements. The difference is that in the for loop we know in at the outset how often the statements in the loop will be executed. We don’t have to rely on a variable being changed within the looping statements.
The basic format of the for
statement is
for variable_name in some_sequence :
statement1
statement2
...
statementn
The key part of this is the some_sequence
. The phrase used in the documentation is that it must be ‘iterable’. That means, you can count through the sequence, starting at the beginning and stopping at the end.
There are many examples of things which are iterable some of which we have already come across.
- Lists are iterable - they don’t have to contain numbers, you iterate over the elements in the list.
- The
range()
function - The characters in a string
print("\nExample 1\n")
for i in [1,2,3] :
print(i)
print("\nExample 2\n")
for name in ["Tom", "Dick", "Harry"] :
print(name)
print("\nExample 3\n")
for name in ["Tom", 42, 3.142] :
print(name)
print("\nExample 4\n")
for i in range(3) :
print(i)
print("\nExample 5\n")
for i in range(1,4) :
print(i)
print("\nExample 6\n")
for i in range(2, 11, 2) :
print(i)
print("\nExample 7\n")
for i in "ABCDE" :
print(i)
print("\nExample 8\n")
longString = "The quick brown fox jumped over the lazy sleeping dog"
for word in longString.split() :
print(word)
Example 1
1
2
3
Example 2
Tom
Dick
Harry
Example 3
Tom
42
3.142
Example 4
0
1
2
Example 5
1
2
3
Example 6
2
4
6
8
10
Example 7
A
B
C
D
E
Example 8
The
quick
brown
fox
jumped
over
the
lazy
sleeping
dog
Exercise
Suppose that we have a string containing a set of 4 different types of values separated by
,
like this:variablelist = "01/01/2010,34.5,Yellow,True"
Research the
split()
method and then rewrite example 8 from thefor
loop section above so that it prints the 4 components ofvariablelist
Solution
# From the for loop section above variablelist = "01/01/2010,34.5,Yellow,True" for word in variablelist.split(",") : print(word)
The format of
variablelist
is very much like that of a record in a csv file. In later episodes we will see how we can extract these values and assign them to variables for further processing rather than printing them out.
Key Points
Most programs will require ‘Loops’ and ‘Branching’ constructs.
The
if
,elif
,else
statements allow for branching in code.The
for
andwhile
statements allow for looping through sections of codeThe programmer must provide a condition to end a
while
loop.
Creating re-usable code
Overview
Teaching: 25 min
Exercises: 15 minQuestions
What are user defined functions?
How can I automate my code for re-use?
Objectives
Describe the syntax for a user defined function
Create and use simple functions
Explain the advantages of using functions
Defining a function
We have already made use of several Python builtin functions like print
, list
and range
.
In addition to the functions provided by Python, you can write your own functions.
Functions are used when a section of code needs to be repeated at various different points in a program. It saves you re-writing it all. In reality you rarely need to repeat the exact same code. Usually there will be some variation in variable values needed. Because of this, when you create a function you are allowed to specify a set of parameters
which represent variables in the function.
In our use of the print
function, we have provided whatever we want to print
, as a parameter
. Typically whenever we use the print
function, we pass a different parameter
value.
The ability to specify parameters make functions very flexible.
def get_item_count(items_str,sep):
'''
This function takes a string with a list of items and the character that they're separated by and returns the number of items
'''
items_list = items_str.split(sep)
num_items = len(items_list)
return num_items
items_owned = "bicycle;television;solar_panel;table"
print(get_item_count(items_owned,';'))
4
Points to note:
- The definition of a function (or procedure) starts with the def keyword and is followed by the name of the function with any parameters used by the function in parentheses.
- The definition clause is terminated with a
:
which causes indentation on the next and subsequent lines. All of these lines form the statements which make up the function. The function ends after the indentation is removed. - Within the function, the parameters behave as variables whose initial values will be those that they were given when the function was called.
- functions have a return statement which specifies the value to be returned. This is the value assigned to the variable on the left-hand side of the call to the function. (power in the example above)
- You call (run the code) of a function simply by providing its name and values for its parameters the same way you would for any builtin function.
- Once the definition of the function has been executed, it becomes part of Python for the current session and can be used anywhere.
- Like any other builtin function you can use
shift
+tab
in Jupyter to see the parameters. - At the beginning of the function code we have a multiline
comment
denoted by the'''
at the beginning and end. This kind of comment is known as adocstring
and can be used anywhere in Python code as a documentation aid. It is particularly common, and indeed best practice, to use them to give a brief description of the function at the beginning of a function definition in this way. This is because this description will be displayed along with the parameters when you use the help() function orshift
+tab
in Jupyter. - The variable
x
defined within the function only exists within the function, it cannot be used outside in the main program.
In our get_item_count
function we have two parameters which must be provided every time the function is used. You need to provide the parameters in the right order or to explicitly name the parameter you are referring to and use the =
sign to give it a value.
In many cases of functions we want to provide default values for parameters so the user doesn’t have to. We can do this in the following way
def get_item_count(items_str,sep=';'):
'''
This function takes a string with a list of items and the character that they're separated by and returns the number of items
'''
items_list = items_str.split(sep)
num_items = len(items_list)
return num_items
print(get_item_count(items_owned))
4
The only change we have made is to provide a default value for the sep
parameter. Now if the user does not provide a value, then the value of 2 will be used. Because items_str
is the first parameter we can specify its value by position. We could however have explicitly named the parameters we were referring to.
print(get_item_count(items_owned, sep = ','))
print(get_item_count(items_str = items_owned, sep=';'))
1
4
Volume of a cube
Write a function definition to calculate the volume of a cuboid. The function will use three parameters
h
,w
andl
and return the volume.Supposing that in addition to the volume I also wanted to calculate the surface area and the sum of all of the edges. Would I (or should I) have three separate functions or could I write a single function to provide all three values together?
Solution
- A function to calculate the volume of a cuboid could be:
def calculate_vol_cuboid(h, w, len): """ Calculates the volume of a cuboid. Takes in h, w, len, that represent height, width, and length of the cube. Returns the volume. """ volume = h * w * len return volume
- It depends. As a rule-of-thumb, we want our function to do one thing and one thing only, and to do it well. If we always have to calculate these three pieces of information, the ‘one thing’ could be ‘calculate the volume, surface area, and sum of all edges of a cube’. Our function would look like this:
# Method 1 - single function def calculate_cuboid(h, w, len): """ Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth). Returns the volume, surface area, and sum of edges of the cuboid. """ volume = h * w * len surface_area = 2 * (h * w + h * len + len * w) edges = 4 * (h + w + len) return volume, surface_area, edges
It may be better, however, to break down our function into separate ones - one for each piece of information we are calculating. Our functions would look like this:
# Method 2 - separate functions def calc_volume_of_cuboid(h, w, len): """ Calculates the volume of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). """ volume = h * w * len return volume def calc_surface_area_of_cuboid(h, w, len): """ Calculates the surface area of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). """ surface_area = 2 * (h * w + h * len + len * w) return surface_area def calc_sum_of_edges_of_cuboid(h, w, len): """ Calculates the sum of edges of a cuboid defined by the dimensions h(eight), w(idth), and len(gth). """ sum_of_edges = 4 * (h + w + len) return sum_of_edges
We could then rewrite our first solution:
def calculate_cuboid(h, w, len): """ Calculates information about a cuboid defined by the dimensions h(eight), w(idth), and len(gth). Returns the volume, surface area, and sum of edges of the cuboid. """ volume = calc_volume_of_cuboid(h, w, len) surface_area = calc_surface_area_of_cuboid(h, w, len) edges = calc_sum_of_edges_of_cuboid(h, w, len) return volume, surface_area, edges
Using libraries
The functions we have created above only exist for the duration of the session in which they have been defined. If you start a new Jupyter notebook you will have to run the code to define them again.
If all of your code is in a single file or notebook this isn’t really a problem.
There are however many (thousands) of useful functions which other people have written and have made available to all Python users by creating libraries (also referred to as packages or modules) of functions.
You can find out what all of these libraries are and their contents by visiting the main (python.org) site.
We need to go through a 2-step process before we can use them in our own programs.
Step 1. use the pip
command from the commandline. pip
is installed as part of the Python install and is used to fetch the package from the Internet and install it in your Python configuration.
$ pip install <package name>
pip stands for Python install package and is a commandline function. Because we are using the Anaconda distribution of Python, all of the packages that we will be using in this lesson are already installed for us, so we can move straight on to step 2.
Step 2. In your Python code include an import package-name
statement. Once this is done, you can use all of the functions contained within the package.
As all of these packages are produced by 3rd parties independently of each other, there is the strong possibility that there may be clashes in function names. To allow for this, when you are calling a function from a package that you have imported, you do so by prefixing the function name with the package name. This can make for long-winded function names so the import
statement allows you to specify an alias
for the package name which you must then use instead of the package name.
In future episodes, we will be importing the csv
, json
, pandas
, numpy
and matplotlib
modules. We will describe their use as we use them.
The code that we will use is shown below
import csv
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
The first two we don’t alias as they have short names. The last three we do. Matplotlib is a very large library broken up into what can be thought of as sub-libraries. As we will only be using the functions contained in the pyplot
sub-library we can specify that explicitly when we import. This saves time and space. It does not effect how we call the functions in our code.
The alias
we use (specified after the as
keyword) is entirely up to us. However those shown here for pandas
, numpy
and matplotlib
are nearly universally adopted conventions used for these popular libraries. If you are searching for code examples for these libraries on the Internet, using these aliases will appear most of the time.
Key Points
Functions are used to create re-usable sections of code
Using parameters with functions make them more flexible
You can use functions written by others by importing the libraries containing them into your code
Processing data from a file
Overview
Teaching: 45 min
Exercises: 25 minQuestions
How can I read and write files?
What kind of data files can I read?
Objectives
Describe a file handle
Use
with open() as
to open files for reading and auto-close filesCreate and open for writing or appending and auto-close files
Explain what is meant by a record
Reading and Writing datasets
In all of our examples so far, we have directly allocated values to variables in the code we have written before using the variables.
Python has an input()
function which will ask for input from the user, but for any large amounts of data this will be an impractical way of collecting data.
The reality is that most of the data that your program uses will be read from a file. Additionally, apart from when you are developing, most of your program output will be written to a file.
In this episode we will look at how to read and write files of data in Python.
There are in fact many different approaches to reading data files and which one you choose will depend on such things as the size of the file and the data format of the file.
In this episode we will;
- We will read a file which is in .csv (Comma Separated Values) format.
- We will use standard core Python functions to do this
- We will read the file one line at a time ( line = record = row of a table)
- We will perform simple processing of the file data and print the output
- We will split the file into smaller files based on some processing
The file we will be using is only a small file (131 data records), but the approach we are using will work for any size of file. Imagine 131M records. This is because we only process one record at a time so the memory requirements of the programs will be very small. The larger the file the more the processing time required.
Other approaches to reading files will typically expect to read the whole file in one go. This can be efficient when you subsequently process the data but it require large amounts of memory to hold the entire file. We will look at this approach later when the look at the Pandas package.
For our examples in this episode we are going to use the SAFI_results.csv file. This is available for download here and the description of the file is available here.
The code assumes that the file is in the same directory as your notebook.
We will build up our programs in simple steps.
Step 1 - Open the file , read through it and close the file
with open("SAFI_results.csv") as f: # Open the file and assign it to a new variable which we call 'f'.
# The file will be read-only by default.
# As long as the following code is indented, the file 'f' will be open.
for line in f: # We use a for loop to iterate through the file one line at a time.
print(line) # We simply print the line.
print("I'm on to something else now.") # When we are finished with this file, we stop indenting the code and the file is closed automatically.
Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID
0,17/11/2016,1,2017-03-23T09:49:57.000Z,2017-04-02T17:29:08.000Z,Province1,District1,Ward2,Village2,11,no,no,4,no,yes,no,yes,3,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,2,3,,14,698,-19.11225943,33.48345609,uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef
...
You can think of the file as being a list of strings. Each string in the list is one complete line from the file.
If you look at the output, you can see that the first record in the file is a header record containing column names. When we read a file in this way, the column headings have no significance, the computer sees them as another record in the file.
Step 2 - Select a specific ‘column’ from the records in the file
We know that the first record in the file is a header record and we want to ignore it. To do this we call the readline()
method of the file handle f. We don’t need to assign the line that it will return to a variable as we are not going to use it.
As we read the file the line variable is a string containing a complete record. The fields or columns of the record are separated by each other by “,” as it is a csv file.
As line is a string we can use the split()
method to convert it to a list of column values. We are specicically going to select the column which is the 19th entry in the list (remember the list index starts at 0). This refers to the C01_respondent_roof_type column. We are going to examine the different roof types.
with open ("SAFI_results.csv") as f: # Open the file and assign it to a variable called 'f'.
# Indent the code to keep the file open. Stop indenting to close.
f.readline() # First line is a header so ignore it.
for line in f:
print(line.split(",")[18]) # Index 18, the 19th column is C01_respondent_roof_type.
grass
grass
mabatisloping
mabatisloping
grass
grass
grass
mabatisloping
...
Having a list of the roof types from all of the records is one thing, but it is more likely that we would want a count of each type. By scanning up and down the previous output, there appear to be 3 different types, but we will play safe and assume there may be more.
Step 3 - How many of each different roof types are there?
# 1
with open ("SAFI_results.csv") as f:
# 2
f.readline()
# 3
grass_roof = 0
mabatisloping_roof = 0
mabatipitched_roof = 0
roof_type_other = 0
for line in f:
# 4
roof_type = line.split(",")[18]
# 5
if roof_type == 'grass' :
grass_roof += 1
elif roof_type == 'mabatisloping' :
mabatisloping_roof += 1
elif roof_type == 'mabatipitched' :
mabatipitched_roof += 1
else :
roof_type_other += 1
#6
print("There are ", grass_roof, " grass roofs")
print("There are ", mabatisloping_roof, " mabatisloping roofs")
print("There are ", mabatipitched_roof, " mabatipitchedg roofs")
print("There are ", roof_type_other, " other roof types")
There are 73 grass roofs
There are 48 mabatisloping roofs
There are 10 mabatipitchedg roofs
There are 0 other roof types
What are we doing here?
- Open the file
- Ignore the headerline
- Initialise roof type variables to 0
- Extract the C01_respondent_roof_type information from each record
- Increment the appropriate variable
- Print out the results (we have stopped indenting so the file will be closed)
Instead of printing out the counts of the roof types, you may want to extract all of one particular roof type to a separate file. Let us assume we want all of the grass roof records to be written to a file.
# 1
with open ("SAFI_results.csv") as fr: # Note how we have used a new variable name, 'fr'.
# The file is read-only by default.
with open ("SAFI_grass_roof.csv", "w") as fw: # We are keeping 'fr' open so we indent.
# We specify a second parameter, "w" to make this file writeable.
# We use a different variable, 'fw'.
for line in fr:
# 2
if line.split(",")[18] == 'grass' :
fw.write(line)
What are we doing here?
- Open the files. Because there are now two files, each has its own file handle:
fr
for the file we read andfw
for the file we are going to write. (They are just variable names so you can use anything you like). For the file we are going to write to we usew
for the second parameter. If the file does not exist it will be created. If it does exist, then the contents will be overwritten. If we want to append to an existing file we can usea
as the second parameter. - Because we are just testing a specific field from the record to have a certain value, we don’t need to put it into a variable first. If the expression is True, then we use
write()
method to write the complete line just as we read it to the output file.
In this example we didn’t bother skipping the header line as it would fail the test in the if statement. If we did want to include it we could have added the line
fw.write(fr.readline())
before the for loop
Exercise
From the SAFI_results.csv file extract all of the records where the
C01_respondent_roof_type
(index 18) has a value of'grass'
and theC02_respondent_wall_type
(index 19) has a value of'muddaub'
and write them to a file. Within the same program write all of the records whereC01_respondent_roof_type
(index 18) has a value of'grass'
and theC02_respondent_wall_type
(index 19) has a value of'burntbricks'
and write them to a separate file. In both files include the header record.Solution
with open ("SAFI_results.csv") as fr: with open ("SAFI_grass_roof_muddaub.csv", "w") as fw1: with open ("SAFI_grass_roof_burntbricks.csv", "w") as fw2: headerline = fr.readline() fw1.write(headerline) fw2.write(headerline) for line in fr: if line.split(",")[18] == 'grass' : if line.split(",")[19] == 'muddaub' : fw1.write(line) if line.split(",")[19] == 'burntbricks' : fw2.write(line)
In our example of printing the counts for the roof types, we assumed that we knew what the likely roof types were. Although we did have an 'other'
option to catch anything we missed. Had there been any we would still be non the wiser as to what they represented. We were able to decide on the specific roof types by manually scanning the list of C01_respondent_roof_type
values. This was only practical because of the small file size. For a multi-million record file we could not have done this.
We would like a way of creating a list of the different roof types and at the same time counting them. We can do this by using not a Python list structure, but a Python dictionary.
The Python dictionary structure
In Python a dictionary object maps keys to values. A dictionary can hold any number of keys and values but a key cannot be duplicated.
The following code shows examples of creating a dictionary object and manipulating keys and values.
# an empty dictionary
myDict = {}
# A dictionary with a single Key-value pair
personDict = {'Name' : 'Peter'}
# I can add more about 'Peter' to the dictionary
personDict['Location'] = 'Manchester'
# I can print all of the keys and values from the dictionary
print(personDict.items())
# I can print all of the keys and values from the dictionary - and make it look a bit nicer
for item in personDict:
print(item, "=", personDict[item])
# or all of the keys
print(personDict.keys())
# or all of the values
print(personDict.values())
# I can access the value for a given key
x = personDict['Name']
print(x)
# I can change value for a given key
personDict['Name'] = "Fred"
print(personDict['Name'])
# I can check if a key exists
key = 'Name'
if key in personDict :
print("already exists")
else :
personDict[key] = "New value"
dict_items([('Location', 'Manchester'), ('Name', 'Peter')])
Location = Manchester
Name = Peter
dict_keys(['Location', 'Name'])
dict_values(['Manchester', 'Peter'])
Peter
Fred
already exists
Exercise
- Create a dictionary called
dict_roof_types
with initial keys oftype1
andtype2
and give them values of 1 and 3.- Add a third key
type3
with a value of 6.- Add code to check if a key of
type4
exists. If it does not add it to the dictionary with a value of 1 if it does, increment its value by 1- Add code to check if a key of
type2
exists. If it does not add it to the dictionary with a value of 1 if it does, increment its value by 1- Print out all of the keys and values from the dictionary
Solution
# 1 dict_roof_types = {'type1' : 1 , 'type2' : 3} # 2 dict_roof_types['type3'] = 6 # 3 key = 'type4' if key in dict_roof_types : dict_roof_types[key] += 1 else : dict_roof_types[key] = 1 # 4 key = 'type2' if key in dict_roof_types : dict_roof_types[key] += 1 else : dict_roof_types[key] = 1 # 5 for item in dict_roof_types: print(item, "=", dict_roof_types[item])
We are now in a position to re-write our count of roof types example without knowing in advance what any of the roof types are.
# 1
with open ("SAFI_results.csv") as f:
# 2
f.readline()
# 3
dict_roof_types = {}
for line in f:
# 4
roof_type = line.split(",")[18]
# 5
if roof_type in dict_roof_types :
dict_roof_types[roof_type] += 1
else :
dict_roof_types[roof_type] = 1
# 6
for item in dict_roof_types:
print(item, "=", dict_roof_types[item])
grass = 73
mabatisloping = 48
mabatipitched = 10
What are we doing here?
- Open the file
- Ignore the headerline
- Create an empty dictionary
- Extract the C01_respondent_roof_type information from each record
- Either add to the dictionary with a value of 1 or increment the current value for the key by 1
- Print out the contents of the dictionary (stopped indenting so file is closed)
You can apply the same approach to count values in any of the fields/columns of the file.
Key Points
Reading data from files is far more common than program ‘input’ requests or hard coding values
Python provides simple means of reading from a text file and writing to a text file
Tabular data is commonly recorded in a ‘csv’ file
Text files like csv files can be thought of as being a list of strings. Each string is a complete record
You can read and write a file one record at a time
Python has builtin functions to parse (split up) records into individual tokens
Dates and Time
Overview
Teaching: 15 min
Exercises: 10 minQuestions
How are dates and time represented in Python?
How can I manipulate dates and times?
Objectives
Describe some of the datetime functions available in Python
Describe the use of format strings to describe the layout of a date and/or time string
Make use of date arithmetic
Date and Times in Python
Python can be very flexible in how it interprets ‘strings’ which you want to be considered as a date, time, or date and time, but you have to tell Python how the various parts of the date and/or time are represented in your ‘string’. You can do this by creating a format
. In a format
, different case sensitive characters preceded by the %
character act as placeholders for parts of the date/time, for example %Y
represents year formatted as 4 digit number such as 2014.
A full list of the characters used and what they represent can be found towards the end of the datetime section of the official Python documentation.
There is a today()
method which allows you to get the current date and time.
By default it is displayed in a format similar to the ISO 8601 standard format.
To use the date and time functions you need to import the datetime
module.
from datetime import datetime
today = datetime.today()
print('ISO :', today)
ISO : 2018-04-12 16:19:17.177441
We can use our own formatting instead. For example, if we wanted words instead of number and the 4 digit year at the end we could use the following.
format = "%a %b %d %H:%M:%S %Y"
today_str = today.strftime(format)
print('strftime:', today_str)
print(type(today_str))
today_date = datetime.strptime(today_str, format)
print('strptime:', today_date.strftime(format))
print(type(today_date))
strftime: Thu Apr 12 16:19:17 2018
<class 'str'>
strptime: Thu Apr 12 16:19:17 2018
<class 'datetime.datetime'>
strftime
converts a datetime object to a string and strptime
creates a datetime object from a string.
When you print them using the same format string, they look the same.
The format of the date fields in the SAFI_results.csv file have been generated automatically to comform to the ISO 8601 standard.
When we read the file and extract the date fields, they are of type string. Before we can use them as dates, we need to convert them into Python date objects.
In the format string we use below, the -
, :
, T
and Z
characters are just that, characters in the string representing the date/time.
Only the character preceded with %
have special meanings.
Having converted the strings to datetime objects, there are a variety of methods that we can use to extract different components of the date/time.
from datetime import datetime
format = "%Y-%m-%dT%H:%M:%S.%fZ"
f = open('SAFI_results.csv', 'r')
#skip the header line
line = f.readline()
# next line has data
line = f.readline()
strdate_start = line.split(',')[3] # A04_start
strdate_end = line.split(',')[4] # A05_end
print(type(strdate_start), strdate_start)
print(type(strdate_end), strdate_end)
# the full date and time
datetime_start = datetime.strptime(strdate_start, format)
print(type(datetime_start))
datetime_end = datetime.strptime(strdate_end, format)
print('formatted date and time', datetime_start)
print('formatted date and time', datetime_end)
# the date component
date_start = datetime.strptime(strdate_start, format).date()
print(type(date_start))
date_end = datetime.strptime(strdate_end, format).date()
print('formatted start date', date_start)
print('formatted end date', date_end)
# the time component
time_start = datetime.strptime(strdate_start, format).time()
print(type(time_start))
time_end = datetime.strptime(strdate_end, format).time()
print('formatted start time', time_start)
print('formatted end time', time_end)
f.close()
<class 'str'> 2017-03-23T09:49:57.000Z
<class 'str'> 2017-04-02T17:29:08.000Z
<class 'datetime.datetime'>
formatted date and time 2017-03-23 09:49:57
formatted date and time 2017-04-02 17:29:08
<class 'datetime.date'>
formatted start date 2017-03-23
formatted end date 2017-04-02
<class 'datetime.time'>
formatted start time 09:49:57
formatted end time 17:29:08
Components of dates and times
For a date or time we can also extract individual components of them. They are held internally in the datetime datastructure.
# date parts.
print('formatted end date', date_end)
print(' end date year', date_end.year)
print(' end date month', date_end.month)
print(' end date day', date_end.day)
print (type(date_end.day))
# time parts.
print('formatted end time', time_end)
print(' end time hour', time_end.hour)
print(' end time minutes', time_end.minute)
print(' end time seconds', time_end.second)
print(type(time_end.second))
formatted end date 2017-04-02
end date year 2017
end date month 4
end date day 2
<class 'int'>
formatted end time 17:29:08
end time hour 17
end time minutes 29
end time seconds 8
<class 'int'>
Date arithmetic
We can also do arithmetic with the dates.
date_diff = datetime_end - datetime_start
date_diff
print(type(datetime_start))
print(type(date_diff))
print(date_diff)
date_diff = datetime_start - datetime_end
print(type(date_diff))
print(date_diff)
<class 'datetime.datetime'>
<class 'datetime.timedelta'>
10 days, 7:39:11
<class 'datetime.timedelta'>
-11 days, 16:20:49
Exercise
How do you interpret the last result?
The code below calculates the time difference between supposedly starting the survey and ending the survey (for each respondent).
from datetime import datetime
format = "%Y-%m-%dT%H:%M:%S.%fZ"
f = open('SAFI_results.csv', 'r')
line = f.readline()
for line in f:
#print(line)
strdate_start = line.split(',')[3]
strdate_end = line.split(',')[4]
datetime_start = datetime.strptime(strdate_start, format)
datetime_end = datetime.strptime(strdate_end, format)
date_diff = datetime_end - datetime_start
print(datetime_start, datetime_end, date_diff )
f.close()
2017-03-23 09:49:57 2017-04-02 17:29:08 10 days, 7:39:11
2017-04-02 09:48:16 2017-04-02 17:26:19 7:38:03
2017-04-02 14:35:26 2017-04-02 17:26:53 2:51:27
2017-04-02 14:55:18 2017-04-02 17:27:16 2:31:58
2017-04-02 15:10:35 2017-04-02 17:27:35 2:17:00
2017-04-02 15:27:25 2017-04-02 17:28:02 2:00:37
2017-04-02 15:38:01 2017-04-02 17:28:19 1:50:18
2017-04-02 15:59:52 2017-04-02 17:28:39 1:28:47
2017-04-02 16:23:36 2017-04-02 16:42:08 0:18:32
...
Exercise
In the
SAFI_results.csv
file theA01_interview_date field
(index 1) contains a date in the form of ‘dd/mm/yyyy’. Read the file and calculate the differences in days (because the interview date is only given to the day) between theA01_interview_date
values and theA04_start
values. You will need to create a format string for theA01_interview_date
field.Looking at the results here and from the previous section of code. Do you think the use of the smartphone data entry system for the survey was being used in real time?
Solution
from datetime import datetime format1 = "%Y-%m-%dT%H:%M:%S.%fZ" format2 = "%d/%m/%Y" f = open('SAFI_results.csv', 'r') line = f.readline() for line in f: A01 = line.split(',')[1] A04 = line.split(',')[3] datetime_A04 = datetime.strptime(A04, format1) datetime_A01 = datetime.strptime(A01, format2) date_diff = datetime_A04 - datetime_A01 print(datetime_A04, datetime_A01, date_diff.days ) f.close()
Key Points
Date and Time functions in Python come from the datetime library, which needs to be imported
You can use format strings to have dates/times displayed in any representation you like
Internally date and times are stored in special data structures which allow you to access the component parts of dates and times
Processing JSON data
Overview
Teaching: 30 min
Exercises: 15 minQuestions
What is JSON format?
How can I extract specific data items from a JSON record?
How can I convert an array of JSON record into a table?
Objectives
Describe the JSON data format
Understand where JSON is typically used
Appreciate some advantages of using JSON over tabular data
Appreciate some dis-advantages of processing JSON documents
Compare JSON to the Python Dict data type
Use the JSON package to read a JSON file
Display formatted JSON
Select and display specific fields from a JSON document
Write tabular data from selected elements from a JSON document to a csv file
More on Dictionaries
In the Processing data from file episode we introduced the dictionary object.
We created dictionaries and we added key : value
pairs to the dictionary.
In all of the examples that we used, the value
was always a simple data type like an integer or a string.
The value
associated with a key
in a dictionary can be of any type including a list
or even another dictionary
.
We created a simple dictionary object with the following code:
personDict = {'Name' : 'Peter'}
personDict['Location'] = 'Manchester'
print(personDict)
{'Name': 'Peter', 'Location': 'Manchester'}
So far the keys in the dictionary each relate to a single piece of information about the person. What if we wanted to add a list of items?
personDict['Children'] = ['John', 'Jane', 'Jack']
personDict['Children_count'] = 3
print(personDict)
{'Name': 'Peter', 'Children': ['John', 'Jane', 'Jack'], 'Children_count': 3, 'Location': 'Manchester'}
Not only can I have a key where the value is a list, the value could also be another dictionary object. Suppose I want to add some telephone numbers
personDict['phones'] = {'home' : '0102345678', 'mobile' : '07770123456'}
print(personDict.values())
# adding another phone
personDict['phones']['business'] = '0161234234546'
print(personDict)
dict_values(['Peter', ['John', 'Jane', 'Jack'], {'home': '0102345678', 'mobile': '07770123456'}, 3, 'Manchester'])
{'Name': 'Peter', 'Children': ['John', 'Jane', 'Jack'], 'phones': {'home': '0102345678', 'mobile': '07770123456', 'business': '0161234234546'}, 'Children_count': 3, 'Location': 'Manchester'}
Exercise
- Using the personDict as a base add information relating to the persons home and work addresses including postcodes.
- Print out the postcode for the work address.
- Print out the names of the children on seperate lines (i.e. not as a list)
Solution
personDict['Addresses'] = {'Home' : {'Addressline1' : '23 acacia ave.', 'Addressline2' : 'Romford', 'PostCode' : 'RO6 5WR'}, 'Work' : {'Addressline1' : '19 Orford Road.', 'Addressline2' : 'London', 'PostCode' : 'EC4J 3XY'} } print(personDict['Addresses']['Work']['PostCode']) for child in personDict['Children']: print(child)
The ability to create dictionaries containing lists and other dictionaries, makes the dictionary object very versatile, you can create an arbitrarily complex data structure of dictionaries within dictionaries.
In practice you will not be doing this manually, instead like most data you will read it in from a file.
The JSON data format
The JSON data format was designed as a way of allowing different machines or processes within machines to communicate with each other by sending messages constructed in a well defined format. JSON is now the preferred data format used by APIs (Application Programming Interfaces).
The JSON format although somewhat verbose is not only Human readable but it can also be mapped very easily to a Python dictionary object.
We are going to read a file of data formatted as JSON, convert it into a dictionary object in Python then selectively extract Key-Value pairs and create a csv file from the extracted data.
The JSON file we are going to use is the SAFI.json file. This is the output file from an electronic survey system called ODK. The JSON represents the answers to a series of survey questions. The questions themselves have been replaced with unique Keys, the values are the answers.
Because detailed surveys are by nature nested structures making it possible to record different levels of detail or selectively ask a set of specific questions based on the answer given a previous question, the structure of the answers for the survey can not only be complex and convoluted, it could easily be different from one survey respondent’s set of answers to another.
Advantages of JSON
- Very popular data format for APIs (e.g. results from an Internet search)
- Human readable
- Each record (or document as they are called) is self contained. The equivalent of the column name and column values are in every record.
- Documents do not all have to have the same structure within the same file
- Document structures can be complex and nested
Dis-advantages of JSON
- It is more verbose than the equivalent data in csv format
- Can be more difficult to process and display than csv formatted data
Viewing JSON data
- JSON can often be viewed directly in the web browsers: example.
- Browser extensions are often available for ease of viewing JSON data e.g., Chrome, Firefox, Safari, etc.
- JSON files can be opened using text editors e.g., notepad, nano, textedit, etc.
- There are many websites which allow viewing or beautifying JSON data with ease. For example, http://jsonviewer.stack.hu/. Try pasting or loading JSON data in the second tab called “text” on that website and then clicking on the “format” button. Clicking on the first tab, “viewer” also gives another view of the data.
Use the JSON package to read a JSON file
import json
with open('SAFI.json') as json_data:
d = json.load(json_data)
print(type(d))
print(type(d[0]))
print(json.dumps(d[0], indent=2))
<class 'list'>
<class 'dict'>
{
"G02_months_lack_food": [
"Jan"
],
"G01_no_meals": 2,
"E_no_group_count": "2",
"A03_quest_no": "01",
...
Points to note:
- We import the json package with an import statement.
- We have chosen to use the
with
statement to open the SAFI.json file. Notice the:
at the end of the line and the subsequent indentation. Thewith
statement is in effect until we un-indent. At which time the file will automatically be closed. So we don’t need to do so explicitly. - ‘json_data’ is the file handle.
- The
json.load
method is passed the file handle and reads the complete file. - The variable
d
is a list of dictionaries. (When we read the csv file we considered it to be a list of strings). - The
json.dumps
method can be used to print either the entire file or a specific dictionary from the list in a formatted manner by using the indent parameter)
By default the order in which the keys of the dictionary are printed is not guaranteed. If we want them in sorted order we can have them sorted by using the sort_keys
parameter
print(json.dumps(d[0], indent=2, sort_keys=True))
{
"A01_interview_date": "2016-11-17",
"A03_quest_no": "01",
"A04_start": "2017-03-23T09:49:57.000Z",
"A05_end": "2017-04-02T17:29:08.000Z",
"A06_province": "province1",
...
}
Extracting specific fields from a JSON document
If we want to extract fields from a JSON document, the first step isto convert the JSON document into a Python dictionary. We have in fact already done this with the
d = json.load(json_data)
line. d
a list object and each entry in the list is a Dictionary object.
Extract the fields we want into a flat format
Despite the arbitrary complexity of a JSON document or a Python dictionary object, we can adopt a very systematic approach to extracting individual fields form the structure.
The story so far: Our JSON file has been read into a variable d
. We know that d
is a list of dictionaries. Each dictionary represents a JSON document ( a record).
We can print the contents of the first dictionary in the list with
print(json.dumps(d[0], indent=2, sort_keys=True))
Exercise
In the output from the code above there is a key with the name of
D_curr_crop
. Find it and by looking at the indentation and the[
(lists) and{
(dictionaries) describe in English how you could find the first occurrence ofD_curr_crop
starting withd
.Use a print statement to find out what it is.
Solution
d
is a list of dictionariesd[0]
is the first dictionary- within
d[0]
there is a keyD_plots
whose value is a list and contains dictionariesd[0]['D_plots'][0]
is the first dictionary in the list- within
d[0]['D_plots'][0]
there is a keyD_crops
which is also a list of dictionariesd[0]['D_plots'][0]['D_crops'][0]
is the first dictionary in the list- within this dictionary there is a key
D_curr_crop
Being able to start at the outermost level and work your way in is very important when you need to extract specific items.
print(d[0]['D_plots'][0]['D_crops'][0]['D_curr_crop'])
Being able to drill down in this way is very useful in helping you get a feel for the JSON data structure. In practice it is more likely that instead of returning the first occurrence of D_curr_crop
you will want to return all of them. This requires a little more programming and to be aware of two potential problems.
D_curr_crop
may not exist in any particular dictionary withinD_crops
- any of the lists
D_plots
orD_crops
could be missing or just empty lists ([]
)
In our first attempt we will ignore these problems.
for farms in d:
plot = farms['D_plots']
for crops in plot:
crop = crops['D_crops']
for curr_crops in crop:
print(curr_crops['D_curr_crop'])
maize
maize
maize
tomatoes
vegetable
maize
maize
maize
sorghum
...
In this version we test if all of the keys exist. This could be extended to check that the lists are not empty.
for farms in d:
if 'D_plots' in farms :
plot = farms['D_plots']
for crops in plot:
if 'D_crops' in crops :
crop = crops['D_crops']
for curr_crops in crop:
if 'D_curr_crop' in curr_crops:
print(curr_crops['D_curr_crop'])
We can now produce a list of all of the crops in all of the plots in all of the farms.
We can also create a unique set of all of the crops grown using the Python set
data structure as shown in the code below. A set is like a list but does not allow duplicate values (but doesn’t raise an error if you try to add a duplicate).
unique_crops = set()
for farms in d:
if 'D_plots' in farms :
plot = farms['D_plots']
for crops in plot:
if 'D_crops' in crops :
crop = crops['D_crops']
for curr_crops in crop:
if 'D_curr_crop' in curr_crops:
#print(curr_crops['D_curr_crop'])
unique_crops.add(curr_crops['D_curr_crop'])
print(unique_crops)
{'peanut', 'potatoes', 'tomatoes', 'other', 'vegetable', 'amendoim', 'sunflower', 'bananas', 'sesame', None, 'cucumber', 'onion', 'sorghum', 'piri_piri', 'baby_corn', 'cabbage', 'ngogwe', 'maize', 'pigeonpeas', 'beans'}
Simply having a list of all of the crops is unlikely to be enough. What you are really interested in is which farm grows which crops in which plot.
We can accumulate this information as we move through the list of dictionary objects. At the top level, farm
, there is a unique identifier A03_quest_no
which we can use. for the plot and the crop within the plot we will create our own simple indexing system (plot_no
and crop_no
). At the end instead of just printing the crop name, we also print the details of where this crop is being grown.
for farms in d:
plot_no = 0
id = farms['A03_quest_no']
if 'D_plots' in farms :
plot = farms['D_plots']
for crops in plot:
crop_no = 0
plot_no += 1
if 'D_crops' in crops :
crop = crops['D_crops']
for curr_crops in crop:
crop_no += 1
if 'D_curr_crop' in curr_crops:
print("Farm no ", id," grows ", curr_crops['D_curr_crop']," in plot", plot_no , " and it is crop number ", crop_no)
Farm no 01 grows maize in plot 1 and it is crop number 1
Farm no 01 grows maize in plot 2 and it is crop number 1
Farm no 01 grows maize in plot 1 and it is crop number 1
Farm no 01 grows tomatoes in plot 2 and it is crop number 1
Farm no 01 grows vegetable in plot 3 and it is crop number 1
...
The final stage of this data extraction process is to save the extracted data to a file for subsequent use.
Rather than manually appending all of the information items into a string with ,
seperating each, we can use the csv
module.
To do this we need to create a csv.writer
object and use it to write complete rows of data at a time. csv.writer
expects the data to be provided as a list of items.
For the header row we provide a list of strings containing the colmn names we want and at the end we proivide the data items in a list as well.
import csv
filename = "SAFI_crops.csv"
fw = open(filename, 'w')
cf = csv.writer(fw, lineterminator='\n')
# write the header
cf.writerow(["Farm","plot_no","plot_area","crop_no","crop_name"])
for farms in d:
plot_no = 0
id = farms['A03_quest_no']
if 'D_plots' in farms :
plot = farms['D_plots']
for crops in plot:
crop_no = 0
plot_no += 1
if 'D_crops' in crops :
plot_area = crops['D02_total_plot']
crop = crops['D_crops']
for curr_crops in crop:
crop_no += 1
if 'D_curr_crop' in curr_crops:
#print( id, plot_no , plot_area , crop_no, curr_crops['D_curr_crop'])
cf.writerow([id, plot_no , plot_area , crop_no, curr_crops['D_curr_crop']])
fw.close()
Key Points
JSON is a popular data format for transferring data used by a great many Web based APIs
The JSON data format is very similar to the Python Dictionary structure.
The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data
We can use Python code to extract values of interest and place them in a csv file
Reading data from a file using Pandas
Overview
Teaching: 15 min
Exercises: 5 minQuestions
What is Pandas?
How do I read files using Pandas?
What is the difference between reading files using Pandas and other methods of reading files?
Objectives
Explain what a module is and how they are used in Python
Describe what the Python Data Analysis Library (pandas) is
Load the Python Data Analysis Library (pandas)
Use read_csv to read tabular data into Python
What is Pandas?
pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way.
Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty.
Particular features of pandas that we will be looking at over this and the next couple of episodes include:
- Reading data stored in CSV files (other file formats can be read as well)
- Slicing and subsetting data in Dataframes (tables!)
- Dealing with missing data
- Reshaping data (long -> wide, wide -> long)
- Inserting and deleting columns from data structures
- Aggregating data using data grouping facilities using the split-apply-combine paradigm
- Joining of datasets (after they have been loaded into Dataframes)
If you are wondering why I write pandas with a lower case ‘p’ it is because it is the name of the package and Python is case sensitive.
Importing the pandas library
Importing the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of pd
. We will follow the same convention.
import pandas as pd
Pandas data structures
There are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure.
We will mainly be looking at the Dataframe.
We can easily create a Pandas Dataframe by reading a .csv file
Reading a csv file
When we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer.
The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size.
In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically.
For our examples in this episode we are going to use the SN7577.tab file. This is available for download here and the description of the file is available here
We are going to read in our SN7577.tab file. Although this is a tab delimited file we will still use the pandas read_csv
method, but we will explicitly tell the method that the separator is the tab character and not a comma which is the default.
df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
Exercise
What happens if you forget to specify
sep='\t'
when reading a tab delimited datasetSolution
df_SN7577_oops = pd.read_csv("SN7577.tab") print(df_SN7577_oops.shape) print(df_SN7577_oops)
If you allow pandas to assume that your columns are separated by commas (the default) and there aren’t any, then each record will be treated as a single column. So the shape is given as 1286 rows (correct) but only one column. When the contents is display the only column name is the complete first record. Notice the
\t
used to represent the tab characters in the output. This is the same format we used to specify the tab separator when we correctly read in the file.
Getting information about a Dataframe
You can find out the type of the variable df_SN7577
by using the type
function.
print(type(df_SN7577))
<class 'pandas.core.frame.DataFrame'>
You can see the contents by simply entering the variable name. You can see from the output that it is a tabular format. The column names have been taken from the first record of the file. On the left hand side is a column with no name. The entries here have been provided by pandas and act as an index to reference the individual rows of the Dataframe.
The read_csv()
function has an index_col
parameter which you can use to indicate which of the columns in the file you wish to use as the index instead. As the SN7577 dataset doesn’t have a column which would uniquely identify each row we cannot do that.
Another thing to notice about the display is that it is truncated. By default you will see the first and last 30 rows. For the columns you will always get the first few columns and typically the last few depending on display space.
df_SN7577
Similar information can be obtained with df_SN7577.head()
But here you are only returned the first 5 rows by default.
df_SN7577.head()
Exercise
- As well as the
head()
method there is atail()
method. What do you think it does? Try it.- Both methods accept a single numeric parameter. What do you think it does? Try it.
You can obtain other basic information about your Dataframe of data with:
# How many rows?
print(len(df_SN7577))
# How many rows and columns - returned as a tuple
print(df_SN7577.shape)
#How many 'cells' in the table
print(df_SN7577.size)
# What are the column names
print(df_SN7577.columns)
# what are the data types of the columns?
print(df_SN7577.dtypes)
1286
(1286, 202)
259772
Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av',
'Q5avi',
...
'numhhd', 'numkid', 'numkid2', 'numkid31', 'numkid32', 'numkid33',
'numkid34', 'numkid35', 'numkid36', 'wts'],
dtype='object', length=202)
Q1 int64
Q2 int64
Q3 int64
...
Length: 202, dtype: object
Exercise
When we asked for the column names and their data types, the output was abridged, i.e. we didn’t get the values for all of the columns. Can you write a small piece of code which will return all of the values
Solution
for name in df_SN7577.columns: print(name)
Key Points
pandas is a Python library containing functions and data structures to assist in data analysis
pandas data structures are the Series (like a vector) and the Dataframe (like a table)
the pandas
read_csv
function allows you to read an entirecsv
file into a Dataframe
Extracting row and columns
Overview
Teaching: 15 min
Exercises: 15 minQuestions
How can I extract specific rows and columns from a Dataframe?
How can I add or delete columns from a Dataframe?
How can I find and change missing values in a Dataframe?
Objectives
Define indexing as it relates to data structures
Select specific columns from a data frame
Select specific rows from a data frame based on conditional expressions
Using indexes to access rows and columns
Copy a data frame
Add columns to a data frame
Analyse datasets having missing/null values
We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe.
import pandas as pd
df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
Selecting rows and columns from a pandas Dataframe
If we know which columns we want before we read the data from the file we can tell read_csv()
to only import those columns by specifying columns either by their index number (starting at 0) as a list to the usecols
parameter. Alternatively we can also provide a list of column names.
df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175])
print(df_SN7577_some_cols.shape)
print(df_SN7577_some_cols.columns)
df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'])
print(df_SN7577_some_cols.columns)
(1286, 6)
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object')
Let us assume for now that we read in the complete file which is now in the Dataframe df_SN7577
, how can we now refer to specific columns?
There are two ways of doing this using the column names (or labels):
# Both of these statements are the same
print(df_SN7577['Q1'])
# and
print(df_SN7577.Q1)
0 1
1 3
2 10
3 9
...
If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of 'Q1'
we could also have provided a list of strings. Remember that lists are enclosed in []
.
print(df_SN7577[['Q1', 'Q2', 'Q3']])
Q1 Q2 Q3
0 1 -1 1
1 3 -1 1
2 10 3 2
3 9 -1 10
...
Exercise
What happens if you:
- List the columns you want out of order from the way they appear in the file?
- Put the same column name in twice?
- Put in a non-existing column name? (a.k.a Typo)
Solution
print(df_SN7577[['Q3', 'Q2']]) print(df_SN7577[['Q3', 'Q2', 'Q3']]) print(df_SN7577[['Q33', 'Q2']])
Filtering by Rows
You can filter the Dataframe by rows by specifying a range in the form of a:b
. a
is the first row and b
is one beyond the last row required.
# select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe)
df_SN7577_some_rows = df_SN7577[1:4]
df_SN7577_some_rows
Exercise
What happens if we ask for a single row instead of a range?
Solution
df_SN7577[1]
You get an error if you only specify
1
. You need to use:1
or0:1
to get the first row returned. The:
is always required. You can use:
by itself to return all of the rows.
Using criteria to filter rows
It is more likely that you will want to select rows from the Dataframe based on some criteria, such as “all rows where the value for Q2 is -1”.
df_SN7577_some_rows = df_SN7577[(df_SN7577.Q2 == -1)]
df_SN7577_some_rows
The criteria can be more complex and isn’t limited to a single column’s values:
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)]
df_SN7577_some_rows
We can combine the row selection with column selection:
df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']]
df_SN7577_some_rows
Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows.
There is however another way of selecting rows using the row index:
df_SN7577_some_rows = df_SN7577.iloc[1:4]
df_SN7577_some_rows
Using the iloc
method gives the same results as our previous example.
However, now we can specify a single value and more importantly we can use the range()
function to indicate the records that we want. This can be useful for making pseudo-random selections of rows from across the Dataframe.
# Select the first row from the Dataframe
df_SN7577_some_rows = df_SN7577.iloc[0]
df_SN7577_some_rows
# select every 100th record from the Dataframe.
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)]
df_SN7577_some_rows
You can also specify column ranges using the iloc
method again using the column index numbers:
# columns 0,1,2 and 3
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4]
df_SN7577_some_rows
# columns 0,1,2,78 and 95
df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]]
df_SN7577_some_rows
There is also a loc
method which allows you to use the column names.
# columns 0,1,2,78 and 95 using the column names and changing 'iloc' to 'loc'
df_SN7577_some_rows = df_SN7577.loc[range(0, len(df_SN7577), 100),['Q1', 'Q2', 'Q3', 'Q18bii', 'access6' ]]
df_SN7577_some_rows
Sampling
Pandas does have a sample
method which allows you to extract a sample of the records from the Dataframe.
df_SN7577.sample(10, replace=False) # ten records, do not select same record twice (this is the default)
df_SN7577.sample(frac=0.05, random_state=1) # 5% of records , same records if run again
Key Points
Import specific columns when reading in a .csv with the
usecols
parameterWe easily can chain boolean conditions when filtering rows of a pandas dataframe
The
loc
andiloc
methods allow us to get rows with particular labels and at particular integer locations respectivelypandas has a handy
sample
method which allows us to extract a sample of rows from a dataframe
Data Aggregation using Pandas
Overview
Teaching: 20 min
Exercises: 10 minQuestions
How can I summarise the data in a data frame?
Objectives
Access and summarize data stored in a Data Frame
Perform basic mathematical operations and summary statistics on data in a Pandas Data Frame
Understand missing data
Changing to and from ‘NaN’ values
Using Pandas functions to summarise data in a Data Frame
For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset.
We shall use the SAFI_results.csv dataset to demonstrate how we can obtain these pieces of information
import pandas as pd
df_SAFI = pd.read_csv("SAFI_results.csv")
df_SAFI
For numeric variables we can obtain a variety of basic statistical information by using the describe()
method.
df_SAFI.describe()
This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as NaN
you will get a warning message.
You can also .describe()
on a single variable basis.
df_SAFI['B_no_membrs'].describe()
There are also a set of methods which allow us to obtain individual values.
print(df_SAFI['B_no_membrs'].min())
print(df_SAFI['B_no_membrs'].max())
print(df_SAFI['B_no_membrs'].mean())
print(df_SAFI['B_no_membrs'].std())
print(df_SAFI['B_no_membrs'].count())
print(df_SAFI['B_no_membrs'].sum())
2
19
7.190839694656488
3.1722704895263734
131
942
Unlike the describe()
method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed.
We can do the same thing for the E19_period_use
variable
print(df_SAFI['E19_period_use'].min())
print(df_SAFI['E19_period_use'].max())
print(df_SAFI['E19_period_use'].mean())
print(df_SAFI['E19_period_use'].std())
print(df_SAFI['E19_period_use'].count())
print(df_SAFI['E19_period_use'].sum())
1.0
45.0
12.043478260869565
8.583030848015385
92
1108.0
Exercise
Compare the count values returned for the
B_no_membrs
and theE19_period_use
variables.
- Why do you think they are different?
- How does this affect the calculation of the mean values?
Solution
- We know from when we originally displayed the contents of the
df_SAFI
Dataframe that there are 131 rows in it. This matches the value for theB_no_membrs
count. The count forE19_period_use
however is only 92. If you look at the values in theE19_period_use
column usingdf_SAFI['E19_period_use']
you will see that there are several
NaN
values. They also occurred when we useddescribe()
on the full Dataframe.NaN
stands for Not a Number, ie. the value is missing. There are only 92 non-missing values and this is what is reported by thecount()
method. This value is also used in the calculation of the mean and std values.
Dealing with missing values
We can find out how many variables in our Dataframe contains any NaN
values with the code
df_SAFI.isnull().sum()
Column1 0
A01_interview_date 0
A03_quest_no 0
A04_start 0
...
or for a specific variable
df_SAFI['E19_period_use'].isnull().sum()
39
Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do.
We have been using data from two very different sources.
The SN7577 dataset is provided by the UK Data Service. Datasets from the UK data Service, have already been ‘cleaned’ and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as ‘-1’ or ‘Not Specified’. In cases like these it may be appropriate to replace these values with ‘NaN’ before you try to process the data further.
The SAFI dataset we have been using comes from a project called ‘Studying African Farmer-led Irrigation’. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of ‘NaN’ values in a record (one complete questionnaire) where you would still consider the record to be complete.
We have already seen how we can check for missing values. There are three other actions we need to be able to do:
- Remove complete rows which contain
NaN
- Replace
NaN
with a value of our choice - Replace specific values with
NaN
With these options we can ensure that the data is suitable for the further processing we have planned.
Completely remove rows with NaNs
The dropna()
method will delete all rows if any of the variables contain an NaN
. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning.
df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI.dropna(inplace=True)
print(df_SAFI.shape)
(131, 55)
(0, 55)
Because there are variables in the SAFI dataset which are all NaN
using the dropna()
method effectively deletes all of the rows from the Dataframe, probably not what you wanted. Instead we can use the notnull()
method as a row selection criteria and delete the rows where a specific variable has NaN
values.
df_SAFI = pd.read_csv("SAFI_results.csv")
print(df_SAFI.shape)
df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())]
print(df_SAFI.shape)
(131, 55)
(39, 55)
Replace NaN with a value of our choice
The E19_period_use
variable answers the question: “For how many years have you been irrigating the land?”. In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run
df_SAFI['E19_period_use'].describe()
we get a count value of 92 and all of the other statistics are based on this count value.
Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is not irrigated has been irrigated for 0 years, technically correct.
To see what happens we can convert all of the NaN values in the E19_period_use
column to 0 with the following code:
df_SAFI['E19_period_use'].fillna(0, inplace=True)
If we now run the describe()
again you can see that all of the statistic have been changed because the calculations are NOW based on a count of 131. Probably not what we would have wanted.
Conveniently this allows us to demonstrate our 3rd action.
Replace specific values with NaN
Although we can recognise NaN
with methods like isnull()
or dropna()
actually creating a NaN
value and putting it into a Dataframe, requires the numpy
module. The following code will replace our 0 values with NaN
. We can demonstrate that this has occurred by running the describe()
again and see that we now have our original values back.
import numpy as np
df_SAFI['E19_period_use'].replace(0, np.NaN, inplace = True)
df_SAFI['E19_period_use'].describe()
Categorical variables
For categorical variables, numerical statistics don’t make any sense.
For a categorical variable we can obtain a list of unique values used by the variable by using the unique()
method.
df_SAFI = pd.read_csv("SAFI_results.csv")
pd.unique(df_SAFI['C01_respondent_roof_type'])
array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object)
Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the groupby
method.
Having performed the groupby()
we can them describe()
the results. The format is similar to that which we have seen before except that the ‘grouped by’ variable appears to the left and there is a set of statistics for each unique value of the variable.
grouped_data = df_SAFI.groupby('C01_respondent_roof_type')
grouped_data.describe()
You can group by more than one variable at a time by providing them as a list.
grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])
grouped_data.describe()
You can also obtain individual statistics if you want.
A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count()
A11_years_farm
C01_respondent_roof_type C02_respondent_wall_type
grass burntbricks 22
muddaub 42
sunbricks 9
mabatipitched burntbricks 6
muddaub 3
...
Exercise
- Read in the SAFI_results.csv dataset.
- Get a list of the different
C01_respondent_roof_type
values.- Groupby
C01_respondent_roof_type
and describe the results.- Remove rows with NULL values for
E_no_group_count
.- repeat steps 2 & 3 and compare the results.
Solution
# Steps 1 and 2 import numpy as np df_SAFI = pd.read_csv("SAFI_results.csv") print(df_SAFI.shape) print(pd.unique(df_SAFI['C01_respondent_roof_type']))
# Step 3 grouped_data = df_SAFI.groupby('C01_respondent_roof_type') grouped_data.describe()
# steps 4 and 5 df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())] grouped_data = df_SAFI.groupby('C01_respondent_roof_type') print(df_SAFI.shape) print(pd.unique(df_SAFI['C01_respondent_roof_type'])) grouped_data.describe()
E_no_group_count
is related to whether or not farm plots are irrigated or not. It has no obvious connection to farm buildings. By restricting the data to non-irrigated plots we have accidentally? removed one of the roof_types completely.
Key Points
Summarising numerical and categorical variables is a very common requirement
Missing data can interfere with how statistical summaries are calculated
Missing data can be replaced or created depending on requirement
Summarising or aggregation can be done over single or multiple variables at the same time
Joining Pandas Dataframes
Overview
Teaching: 25 min
Exercises: 10 minQuestions
How can I join two Dataframes with a common key?
Objectives
Understand why we would want to join Dataframes
Know what is needed for a join to be possible
Understand the different types of joins
Understand what the joined results tell us about our data
Joining Dataframes
Why do we want to do this
There are many occasions when we have related data spread across multiple files.
The data can be related to each other in different ways. How they are related and how completely we can join the data from the datasets will vary.
In this episode we will consider different scenarios and show we might join the data. We will use csv files and in all cases the first step will be to read the datasets into a pandas Dataframe from where we will do the joining. The csv files we are using are cut down versions of the SN7577 dataset to make the displays more manageable.
First, let’s download the datafiles. They are listed in the setup page for the lesson. Alternatively, you can download the GitHub repository for this lesson. The data files are in the data directory. If you’re using Jupyter, make sure to place these files in the same directory where your notebook file is.
Scenario 1 - Two data sets containing the same columns but different rows of data
Here we want to add the rows from one Dataframe to the rows of the other Dataframe.
In order to do this we can use the pd.concat()
function.
import pandas as pd
df_SN7577i_a = pd.read_csv("SN7577i_a.csv")
df_SN7577i_b = pd.read_csv("SN7577i_b.csv")
Have a quick look at what these Dataframes look like with
print(df_SN7577i_a)
print(df_SN7577i_b)
Id Q1 Q2 Q3 Q4
0 1 1 -1 1 8
1 2 3 -1 1 4
2 3 10 3 2 6
3 4 9 -1 10 10
...
Id Q1 Q2 Q3 Q4
0 1277 10 10 4 6
1 1278 2 -1 5 4
2 1279 2 -1 4 5
3 1280 1 -1 2 3
...
The concat()
function appends the rows from the two Dataframes to create the df_all_rows Dataframe. When you list this out you can see that all of the data rows are there, however, there is a problem with the index
.
df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b])
df_all_rows
We didn’t explicitly set an index for any of the Dataframes we have used. For df_SN7577i_a
and df_SN7577i_b
default
indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries.
This is really only a problem if you need to access a row by its index. We can fix the problem with the following code.
df_all_rows=df_all_rows.reset_index(drop=True)
# or, alternatively, there's the `ignore_index` option in the `pd.concat()` function:
df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b], ignore_index=True)
df_all_rows
What if the columns in the Dataframes are not the same?
df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv")
df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv")
df_all_rows = pd.concat([df_SN7577i_aa, df_SN7577i_bb])
df_all_rows
In this case df_SN7577i_aa
has no Q4 column and df_SN7577i_bb
has no Q3
column. When they are concatenated, the
resulting Dataframe has a column for Q3
and Q4
. For the rows corresponding to df_SN7577i_aa
the values in the Q4
column are missing and denoted by NaN
. The same applies to Q3
for the df_SN7577i_bb
rows.
Scenario 2 - Adding the columns from one Dataframe to those of another Dataframe
df_SN7577i_c = pd.read_csv("SN7577i_c.csv")
df_SN7577i_d = pd.read_csv("SN7577i_d.csv")
df_all_cols = pd.concat([df_SN7577i_c, df_SN7577i_d], axis = 1)
df_all_cols
We use the axis=1
parameter to indicate that it is the columns that need to be joined together. Notice that the Id
column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not
necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem.
Scenario 3 - Using merge to join columns
We can join columns from two Dataframes using the merge()
function. This is similar to the SQL ‘join’ functionality.
A detailed discussion of different join types is given in the SQL lesson.
You specify the type of join you want using the how
parameter. The default is the inner
join which returns the
columns from both tables where the key
or common column values match in both Dataframes.
The possible values of the how
parameter are shown in the picture below (taken from the Pandas documentation)
The different join types behave in the same way as they do in SQL. In Python/pandas, any missing values are shown as NaN
In order to merge
the Dataframes we need to identify a column common to both of them.
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner')
df_cd
In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to
join on. In this example the Id
column
Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using
the on
parameter.
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id')
In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you
can use the left_on
and right_on
parameters to specify them separately.
df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', left_on = 'Id', right_on = 'Id')
Practice with data
- Examine the contents of the
SN7577i_aa
andSN7577i_bb
csv files using Excel or equivalent.- Using the
SN7577i_aa
andSN7577i_bb
csv files, create a Dataframe which is the result of an outer join using theId
column to join on.- What do you notice about the column names in the new Dataframe?
- Using
shift
+tab
in Jupyter examine the possible parameters for themerge()
function.- re-write the code so that the columns names which are common to both files have suffixes indicating the filename from which they come
- If you add the parameter
indicator=True
, what additional information is provided in the resulting Dataframe?Solution
df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id') df_aabb
df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") df_aabb = pd.merge(df_SN7577i_aa, df_SN7577i_bb, how='outer', on = 'Id',suffixes=('_aa', '_bb'), indicator = True) df_aabb
Key Points
You can join pandas Dataframes in much the same way as you join tables in SQL
The
concat()
function can be used to concatenate two Dataframes by adding the rows of one to the other.
concat()
can also combine Dataframes by columns but themerge()
function is the preferred wayThe
merge()
function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.
Wide and long data formats
Overview
Teaching: 20 min
Exercises: 15 minQuestions
What are long and Wide formats?
Why would I want to change between them?
Objectives
Explain difference between long and wide formats and why each might be used
Illustrate how to change between formats using the
melt()
andpivot()
methods
Wide and long data formats
In the SN7577 dataset that we have been using there is a group of columns which record which daily newspapers each respondent reads. Despite the un-informative names like ‘daily1’ each column refers to a current UK daily national or local newspaper.
Whether the paper is read or not is recorded using the values of 0 or 1 as a boolean indicator. The advantage of using a column for each paper means that should a respondent read multiple newspapers, all of the required information can still be recorded in a single record.
Recording information in this wide format is not always beneficial when trying to analyse the data.
Pandas provides methods for converting data from wide to long format and from long to wide format
The SN7577 dataset does not contain a variable that can be used to uniquely identify a row. This is often referred to as a ‘primary key’ field (or column).
A dataset doesn’t need to have such a key. None of the work we have done so far has required it.
When we create a pandas Dataframe by importing a csv file, we have seen that pandas will create an index for the rows. This index can be used a bit like a key field, but as we have seen there can be problems with the index when we concatenate two Dataframes together.
In the version of SN7577 that we are going to use to demonstrate long and wide formats we will add a new variable with the name of ‘Id’ and we will restrict the other columns to those starting with the word ‘daily’.
import pandas as pd
df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
We will create a new Dataframe with a single column of ‘Id’.
# create an 'Id' column
df_papers1 = pd.DataFrame(pd.Series(range(1,1287)), index=None, columns=['Id'])
Using the range function, we can create values of Id
starting with 1 and going up to 1286 (remember the second parameter to range is one past the last value used.) We have explicitly coded this value because we knew how many rows were in the dataset. If we didn’t, we could have used
len(df_SN7577.index) +1
1287
We will create a 2nd Dataframe, based on SN7577 but containing only the columns starting with the word ‘daily’.
There are several ways of doing this, we’ll cover the way that we have covered all of the prerequisites for. We will use the filter
method of pandas
with its like
parameter.
df_papers2 = df_SN7577.filter(like= 'daily')
The value supplied to like
can occur anywhere in the column name to be matched (and therefore selected).
Another way
If we knew the column numbers and they were all continuous we could use the
iloc
method and provide the index values of the range of columns we want.df_papers2 = df_SN7577.iloc[:,118:143]
To create the Dataframe that we will use, we will concatenate the two Dataframes we have created.
df_papers = pd.concat([df_papers1, df_papers2], axis = 1)
print(df_papers.index)
print(df_papers.columns)
RangeIndex(start=0, stop=1286, step=1)
Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6',
'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12',
'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18',
'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24',
'daily25'],
dtype='object')
We use axis = 1
because we are joining by columns, the default is joining by rows (axis=0
).
From ‘wide’ to ‘long’
To make the displays more manageable we will use only the first eight ‘daily’ columns
## using df_papers
daily_list = df_papers.columns[1:8]
df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list)
# by default, the new columns created will be called 'variable' which is the name of the 'daily'
# and 'value' which is the value of that 'daily' for that 'Id'. So, we will rename the columns
df_daily_papers_long.columns = ['Id','Daily_paper','Value']
df_daily_papers_long
We now have a Dataframe that we can groupby
.
We want to groupby
the Daily_paper
and then sum the Value
.
a = df_daily_papers_long.groupby('Daily_paper')['Value'].sum()
a
Daily_paper
daily1 0
daily2 26
daily3 52
From Long to Wide
The process can be reversed by using the pivot()
method.
Here we need to indicate which column (or columns) remain fixed (this will become an index in the new Dataframe), which column contains the values which are to become column names and which column contains the values for the columns.
In our case we want to use the Id
column as the fixed column, the Daily_paper
column contains the column names and the Value
column contains the values.
df_daily_papers_wide = df_daily_papers_long.pivot(index = 'Id', columns = 'Daily_paper', values = 'Value')
We can change our Id
index back to an ordinary column with
df_daily_papers_wide.reset_index(level=0, inplace=True)
Exercise
- Find out how many people take each of the daily newspapers by Title.
- Which titles don’t appear to be read by anyone?
There is a file called Newspapers.csv which lists all of the newspapers Titles along with the corresponding ‘daily’ value
Hint: Newspapers.csv contains both daily and Sunday newspapers you can filter out the Sunday papers with the following code:
df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))]
Solution
- Read in Newspapers.csv file and keep only the dailies.
df_newspapers = pd.read_csv("Newspapers.csv") df_newspapers = df_newspapers[(df_newspapers.Column_name.str.startswith('daily'))] df_newspapers
- Create the df_papers Dataframe as we did before.
import pandas as pd df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') #create an 'Id' column df_papers1 = pd.DataFrame(pd.Series(range(1,1287)),index=None,columns=['Id']) df_papers2 = df_SN7577.filter(like= 'daily') df_papers = pd.concat([df_papers1, df_papers2], axis = 1) df_papers
- Create a list of all of the dailies, one way would be
daily_list = [] for i in range(1,26): daily_list.append('daily'+str(i))
- Pass the list as the
value_vars
parameter to themelt()
method#use melt to create df_daily_papers_long df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list ) #Change the column names df_daily_papers_long.columns = ['Id', 'Daily_paper', 'Value']
merge
the two Dataframes with a left join, because we want all of the Newspaper Titles to be included.df_papers_taken = pd.merge(df_newspapers, df_daily_papers_long, how='left', left_on = 'Column_name',right_on = 'Daily_paper')
- Then
groupby
the ‘Title’ and sum the ‘Value’df_papers_taken.groupby('Title')['Value'].sum()
Key Points
The
melt()
method can be used to change from wide to long formatThe
pivot()
method can be used to change from the long to wide formatAggregations are best done from data in the long format.
Data visualisation using Matplotlib
Overview
Teaching: 25 min
Exercises: 25 minQuestions
How can I create visualisations of my data?
Objectives
Create simple plots using pandas
Import pyplot from the matplotlib library
Customise plots using pyplot
Plotting in Python
There is a wide variety of ways to plot in Python, like many programming languages.
Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself.
Pandas has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame.
Matplotlib. is a Python graphical library that can be used to produce a variety of different graph types,
it is fully controllable down to basic elements and includes a module pylab
that is somewhere in between
(designed to feel like MATLAB plotting, if you happen to have done that before).
The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported
with import pandas as pd
, you will find that Matplotlib is most commonly imported with import matplotlib as plt
where ‘plt’ is the alias.
In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a
graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the %matplotlib inline
directive.
Although we are using Matplotlib in this episode, Pandas can make use of several other graphical libraries available from within Python such as ggplot2 and Seaborn. Seaborn has some very powerful features and advanced plot types. One of its most useful features is formatting.
Plotting with Pandas
The Pandas library contains very tight integration with Matplotlib. There are functions in Pandas that automatically call Matplotlib functions to produce graphs.
To plot with Pandas we have to import it as we have done in past episodes.
To tell Jupyter that when we produce a graph we want it to be displayed in a cell in the notebook just like any other results,
we use the %matplotlib inline
directive. Without that we need to do a show()
command.
import pandas as pd
%matplotlib inline
We also need data to work with loaded into a DataFrame and it’s helpful to look at a few rows to remember what’s there.
We are going to use the dataset from the setup page, SAFI_full_shortname.csv
. For the data to load, make sure to
have that file in the same folder where your Jupyter notebook is running. If the file is not in that folder, you
are going to have to type the full path.
df = pd.read_csv("SAFI_full_shortname.csv")
df.head()
Next, we can plot the a histogram of a variable.
df['years_liv'].hist()
We can change the number of bins to make it look how we would like, for example
df['years_liv'].hist(bins=20)
We can also specify the column as a parameter and a groupby column with the by
keyword.
there are a lot of keywords available to make it look better, we can see some of the most likely ones
(as decided by Pandas developers) by using shift + tab .
Lets try layout
, figsize
, and sharex
.
df.hist(column='years_liv',by='village',layout=(1,3),figsize=(12,3),sharex=True)
Scatter plot
The scatter plot requires the x and y coordinates of each of the points being plotted. We can add a third dimension as different colors with the c
argument.
df.plot.scatter(x='gps_Latitude', y='gps_Longitude', c='gps_Altitude', colormap="viridis", figsize=[4,4])
Exercise
- Make a scatter plot of
years_farm
vsyears_liv
and color the points bybuildings_in_compound
- Make a bar plot of the mean number of rooms per wall type
Compare the two graphs we have just drawn. How do they differ? Are the differences significant?
Solution
Most importantly the data in the graphs is the same. There are cosmetic differences in the scale points in the x and y axis and in the width of the bars.
The width of the bars can be changed with a parameter in the ‘bar’ function
plt.bar(range(len(s)), s, width = 0.5) # the default width is 0.8
Extension: try plotting by wall and roof type?
Solution
For the scatter plot:
df.plot.scatter(x = 'years_liv', y = 'years_farm', c = 'buildings_in_compound', colormap = 'viridis')
For the barplot: we first need to calculate the mean value of rooms per wall type, then we can make the plot.
rooms_mean = df.groupby('respondent_wall_type').mean()['rooms'] rooms_mean.plot.bar()
Boxplot
A boxplot provides a simple representation of a variety of statistical qualities of a single set of data values. A common use of the boxplot is to compare the statistical variations across a set of variables.
The variables can be an independent series or columns of a Dataframe using the Pandas plot method
df.boxplot(by ='village',column=['buildings_in_compound'])
We can make it look prettier with Seaborn, much more easily than fixing components manually with Matplotlib. Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn comes with Anaconda; to make it available in our Python session we need to import it.
import seaborn as sns
sns.boxplot(data = df, x = 'village', y = 'buildings_in_compound')
We can also draw linear models in a plot using lmplot()
from seaborn
, e.g. for years_farm
vs years_liv
per village
.
sns.lmplot(x='years_farm', y='years_liv',data=df,hue='village')
In general, most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
In Matplotlib you create the graph by providing values for all of the individual components you choose to include.
When you are ready, you call the show
function.
Using this same approach, we can plot two sets of data on the same graph.
We will use a scatter plot to demonstrate some of the available features.
Fine-tuning figures with Matplotlib
If we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly, not through Pandas. First we need to import it.
Customising our plots with Matplotlib
We can further customise our plots with Matplotlib directly. First we need to import it.
The Matplotlib library can be imported using any of the import techniques we have seen.
As Pandas is generally imported with import pandas as pd
, you will find that matplotlib
is most commonly imported
with import matplotlib.pyplot as plt
where plt
is the alias.
For demonstration purposes, we are going to use randomly generated data, using the NumPy library (aliased here as np
).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Generate some date for 2 sets of points.
x1 = pd.Series(np.random.rand(20) - 0.5)
y1 = pd.Series(np.random.rand(20) - 0.5)
x2 = pd.Series(np.random.rand(20) + 0.5)
y2 = pd.Series(np.random.rand(20) + 0.5)
# Add some features
plt.title('Scatter Plot')
plt.ylabel('Range of y values')
plt.xlabel('Range of x values')
# plot the points in a scatter plot
plt.scatter(x1, y1, c='red', label='Red Range') # 'c' parameter is the colour and 'label' is the text for the legend
plt.scatter(x2, y2, c='blue', label='Blue Range')
plt.legend(loc=4) # the locations 1,2,3 and 4 are top-right, top-left, bottom-left and bottom-right
# Show the graph with the two sets of points
plt.show()
In the call to the scatter
method, the label
parameter values are used by the legend.
The c
or color
parameter can be set to any color Matplotlib recognises. Full details of the available colours are available in the Matplotlib website. The markers section will tell you what markers you can use instead of the default ‘dots’. There is also an s
(size) parameter which allows you to change the size of the marker.
Exercise
In the scatterplot the s parameter determines the size of the dots. s can be a simple numeric value, say s=100, which will produce dots all of the same size. However, you can pass a list of values (or a Pandas Series) to provide sizes for the individual dots. This approach is very common as it allows us to provide an extra variable worth of information on the graph.
Modify the code we used for the scatter plot to include a size value for each of the points in the series being plotted. (The downside is that some of the smaller dots may be completely covered by the larger dots. To try and highlight when this has happened, we can change the opacity of the dots.)
Find out which parameter controls the opacity of the dots (clue - it is not called opacity), add it to you code and set it > to a reasonable value.
Solution
# Generate some data for 2 sets of points. # and additional data for the sizes - suitably scaled x1 = pd.Series(np.random.rand(20) - 0.5) y1 = pd.Series(np.random.rand(20) - 0.5) z1 = pd.Series(np.random.rand(20) * 200) x2 = pd.Series(np.random.rand(20) + 0.5) y2 = pd.Series(np.random.rand(20) + 0.5) z2 = pd.Series(np.random.rand(20) * 200) # Add some features plt.title('Scatter Plot') plt.ylabel('Range of y values') plt.xlabel('Range of x values') # plot the points in a scatter plot plt.scatter(x1, y1, c='red', label='Red Range', s=z1, alpha=0.5) # 's' parameter is the dot size plt.scatter(x2, y2, c='blue', label='Blue Range', s=z2, alpha=0.5) # 'alpha' is the opacity plt.legend(loc=4) plt.show()
Internally the Pandas ‘plot’ method has called the ‘bar’ method of Matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph.
We can use Matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the Pandas Series holding the values.
Let’s redo the boxplot we did above:
df.boxplot(column = 'buildings_in_compound', by = 'village')
The automatic title of the plot does not look good, we are missing a title for the y-axis and we do not need the extra x-axis title. We can also remove the gridlines. Let’s fix these things using functions from plt
. Note: all the adjustments for the plot have to go into the same notebook cell together with the plot statement itself.
df.boxplot(column = 'buildings_in_compound', by = 'village')
plt.suptitle('') # remove the automatic title
plt.title('Buildings in compounds per village') # add a title
plt.ylabel('Number of buildings') # add a y-axis title
plt.xlabel('') # remove the x-axis title
plt.grid(None) # remove the grid lines
In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
In matplotlib
you create the graph by providing values for all of the individual components you choose to include.
Saving a graph
If you wish to save your graph as an image you can do so using the plt.savefig()
function. The image can be saved as a pdf, jpg or png file by changing the file extension. plt.savefig()
needs to be called at the end of all your plot statements in the same notebook cell.
df.boxplot(column = 'buildings_in_compound', by = 'village')
plt.suptitle('') # remove the automatic title
plt.title('Buildings in compounds per village') # add a title
plt.ylabel('Number of buildings') # add a y-axis title
plt.xlabel('') # remove the x-axis title
plt.grid(None) # remove the grid lines
plt.savefig('safi_boxplot_buildings.pdf') # save as pdf file
plt.savefig('safi_boxplot_buildings.png', dpi = 150) # save as png file, some extra arguments are provided
In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion.
The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph.
In Matplotlib you create the graph by providing values for all of the individual components you choose to
include. When you are ready, you call the show
function.
Using this same approach we can plot two sets of data on the same graph. We will use a scatter plot to demonstrate some of the available features.
Exercise
Revisit your favorite plot we’ve made so far, or make one with your own data then:
- add axes labels
- add a title
- add a legend
- save it in two different formats
extension: try plotting by wall and roof type!
Key Points
Graphs can be drawn directly from Pandas, but it still uses Matplotlib
Different graph types have different data requirements
Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all
Accessing SQLite Databases
Overview
Teaching: 35 min
Exercises: 25 minQuestions
How can I access database tables using Pandas and Python?
What are the advantages of storing data in a database
Objectives
Use the sqlite3 module to interact with a SQL database
Access data stored in SQLite using Python
Describe the difference in interacting with data stored as a CSV file versus in SQLite
Describe the benefits of accessing data using a database compared to a CSV file
Introducing the sqlite3 module
SQLite is a relational database system. Despite the ‘Lite’ in the name it can handle databases in excess of a Terabyte. The ‘Lite’part really relates to the fact that it is a ‘bare bones’ system. It provides the mechanisms to create and query databases via a simple command line interface but not much else. In the SQL lesson we used a Firefox plugin to provide a GUI (Graphical User Interface) to the SQLite database engine.
In this lesson we will use Python code using the sqlite3 module to access the engine. We can use Python code and the sqlite3 module to create, delete and query database tables.
In practice we spend a lot of the time querying database tables.
Pandas Dataframe v SQL table
It is very easy and often very convenient to think of SQL tables and pandas Dataframes as being similar types of objects. All of the data manipulations, slicing, dicing, aggragetions and joins associated with SQL and SQL tables can all be accomplished with pandas methods operating on a pandas Dataframe.
The difference is that the pandas Dataframe is held in memory within the Python environment. The SQL table can largely be on disc and when you access it, it is the SQLite database engine which is doing the work. This allows you to work with very large tables which your Python environment may not have the memory to hold completely.
A typical use case for SQLite databases is to hold large datasets, you use SQL commands from Python to slice and dice and possibly aggregate the data within the database system to reduce the size to something that Python can comfortably process and then return the results to a Dataframe.
Accessing data stored in SQLite using Python
We will illustrate the use of the sqlite3
module by connecting to an SQLite database using both core Python and also using pandas.
The database that we will use is SN7577.sqlite This contains the data from the SN7577 dataset that we have used in other lessons.
Connecting to an SQlite database
The first thing we need to do is import the sqlite3
library, We will import pandas at the same time for convenience.
import sqlite3
import pandas as pd
We will start looking at the sqlite3 library by connecting to an existing database and returning the results of running a query.
Initially we will do this without using Pandas and then we will repreat the exercise so that you can see the difference.
The first thing we need to do is to make a connection to the database. An SQLite database is just a file. To make a connection to it we only need to use the sqlite3 connect()
function and specify the database file as the first parameter.
The connection is assigned to a variable. You could use any variable name, but ‘con’ is quite commonly used for this purpose
con = sqlite3.connect('SN7577.sqlite')
The next thing we need to do is to create a cursor
for the connection and assign it to a variable. We do this using the cursor
method of the connection object.
The cursor allows us to pass SQL statements to the database, have them executed and then get the results back.
To execute an SQL statement we use the execute()
method of the cursor object.
The only paramater we need to pass to execute()
is a string which contains the SQL query we wish to execute.
In our example we are passing a literal string. It could have been contained in a string variable. The string can contain any valid SQL query. It could also be a valid DDL statement such as a “CREATE TABLE …”. In this lesson however we will confine ourseleves to querying exiting database tables.
cur = con.cursor()
cur.execute("SELECT * FROM SN7577")
<sqlite3.Cursor at 0x115e10d50>
The execute()
method doesn’t actually return any data, it just indicates that we want the data provided by running the SELECT statement.
Exercise
- What happens if you if you ask for a non existent table?, field within a table? or just any kind of syntax error?
Solution
cur = con.cursor() # notice the mistyping of 'SELECT' cur.execute("SELET * FROM SN7577")
In all cases an error message is returned. The error message is not from Python but from SQLite. It is the same error message that you would have got had you made the same errors in the SQLite plugin.
Before we can make use of the results of the query we need to use the fetchall()
method of the cursor.
The fetchall()
method returns a list. Each item in the list is a tuple containing the values from one row of the table. You can iterate through the items in a tuple in the same way as you would do so for a list.
cur = con.cursor()
cur.execute("SELECT * FROM SN7577")
rows = cur.fetchall()
for row in rows:
print(row)
(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3
...
The output is the data only, you do not get the column names.
The column names are available from the ‘description’ of the cursor.
colnames = []
for description in cur.description :
colnames.append(description[0])
print(colnames)
['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a',
...
One reason for using a database is the size of the data involved. Consequently it may not be practial to use fetchall()
as this will return the complete result of your query.
An alternative is to use the fetchone()
method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to fetchone()
will return the next record. When there are no more records it will return ‘None’.
cur = con.cursor()
cur.execute("SELECT * FROM SN7577")
row = cur.fetchone()
print(row)
row = cur.fetchone()
print(row)
(1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
Exercise
Can you write code to return the first 5 records from the SN7577 table in two different ways?
Solution
import sqlite3 con = sqlite3.connect('SN7577.sqlite') cur = con.cursor() # we can use the SQLite 'limit' clause to restrict the number of rows returned and then use 'fetchall' cur.execute("SELECT * FROM SN7577 Limit 5") rows = cur.fetchall() for row in rows: print(row) # we can use 'fetchone' in a for loop cur.execute("SELECT * FROM SN7577") for i in range(1,6): print(cur.fetchone()) # a third way would be to use the 'fetchmany()' method cur.execute("SELECT * FROM SN7577") rows = cur.fetchmany(5) for row in rows: print(row)
Using Pandas to read a database table.
When you use Pandas to read a database table, you connect to the database in the same way as before using the SQLite3 connect()
function and providing the filename of the database file.
Pandas has a method read_sql_query
to which you provide both the string containing the SQL query you wish to run and also the connection variable.
The results from running the query are placed in a pandas Dataframe with the table column names automatically added.
con = sqlite3.connect('SN7577.sqlite')
df = pd.read_sql_query("SELECT * from SN7577", con)
# verify that result of SQL query is stored in the Dataframe
print(type(df))
print(df.shape)
print(df.head())
con.close()
Saving a Dataframe as an SQLite table
There may be occasions when it is convenient to save the data in you pandas Dataframe as an SQLite table for future use or for access to other systems. This can be done using the to_sql()
method.
con = sqlite3.connect('SN7577.sqlite')
df = pd.read_sql_query("SELECT * from SN7577", con)
# select only the row where the response to Q1 is 10 meaning undecided voter
df_undecided = df[df.Q1 == 10]
print(df_undecided.shape)
# Write the new Dataframe to a new SQLite table
df_undecided.to_sql("Q1_undecided", con)
# If you want to overwrite an existing SQLite table you can use the 'if_exists' parameter
#df_undecided.to_sql("Q1_undecided", con, if_exists="replace")
con.close()
(335, 202)
Deleting an SQLite table
If you have created tables in an SQLite database, you may also want to delete them.
You can do this by using the sqlite3 cursor execute()
method
con = sqlite3.connect('SN7577.sqlite')
cur = con.cursor()
cur.execute('drop table if exists Q1_undecided')
con.close()
Exercise
The code below creates an SQLite table as we have done in previous examples. Run this code to create the table.
con = sqlite3.connect('SN7577.sqlite') df_undecided = df[df.Q1 == 10] df_undecided.to_sql("Q1_undecided_v2", con) con.close()
Try using the following pandas code to delete (drop) the table.
pd.read_sql_query("drop table Q1_undecided_v2", con)
- What happens?
- Run this line of code again, What is different?
- Can you explain the difference and does the table now exist or not?
Solution
When the line of code is run the first time you get an error message : ‘NoneType’ object is not iterable.
When you run it a second time you get a different error message: DatabaseError: Execution failed on sql ‘drop table Q1_undecided_v2’: no such table: Q1_undecided_v2
the
read_sql_query()
method is designed to send the SQL containing your query to the SQLite execution engine, which will execute the SQL and return the output to pandas which will create a Dataframe from the results.The SQL statement we sent is valid SQL but it doesn’t return rows from a table, it simply reports success of failure (in dropping the table in this case). The first time we run it the table is deleted and a response to the effect is returned. The resonse cannot be converted to a Dataframe, hence the first error message, which is a pandas error.
When we run it for the second time, the table has already has already been dropped, so this time the error message is from SQLite saying the table didn’t exist. Pandas recognises that this is an SQLite error message and simply passes it on to the user.
The moral of the story: pandas may be better for getting data returned into a Dataframe, but there are some things best left to the sqlite functions directly.
Key Points
The SQLite database system is directly available from within Python
A database table and a pandas Dataframe can be considered similar structures
Using pandas to return all of the results from a query is simpler than using sqlite3 alone