This lesson is in the early stages of development (Alpha version)

Data Analysis and Visualization with Python for Social Scientists *alpha*: Glossary

Key Points

Introduction to Python
  • Python is an interpreted language

  • The REPL (Read-Eval-Print loop) allows rapid development and testing of code segments

  • Jupyter notebooks builds on the REPL concepts and allow code results and documentation to be maintained together and shared

  • Jupyter notebooks is a complete IDE (Integrated Development Environment)

Python basics
  • The Jupyter environment can be used to write code segments and display results

  • Data types in Python are implicit based on variable values

  • Basic data types are Integer, Float, String and Boolean

  • Lists and Dictionaries are structured data types

  • Arithmetic uses standard arithmetic operators, precedence can be changed using brackets

  • Help is available for builtin functions using the help() function further help and code examples are available online

  • In Jupyter you can get help on function parameters using shift+tab

  • Many functions are in fact methods associated with specific object types

Python control structures
  • Most programs will require ‘Loops’ and ‘Branching’ constructs.

  • The if, elif, else statements allow for branching in code.

  • The for and while statements allow for looping through sections of code

  • The programmer must provide a condition to end a while loop.

Creating re-usable code
  • Functions are used to create re-usable sections of code

  • Using parameters with functions make them more flexible

  • You can use functions written by others by importing the libraries containing them into your code

Processing data from a file
  • Reading data from files is far more common than program ‘input’ requests or hard coding values

  • Python provides simple means of reading from a text file and writing to a text file

  • Tabular data is commonly recorded in a ‘csv’ file

  • Text files like csv files can be thought of as being a list of strings. Each string is a complete record

  • You can read and write a file one record at a time

  • Python has builtin functions to parse (split up) records into individual tokens

Dates and Time
  • Date and Time functions in Python come from the datetime library, which needs to be imported

  • You can use format strings to have dates/times displayed in any representation you like

  • Internally date and times are stored in special data structures which allow you to access the component parts of dates and times

Processing JSON data
  • JSON is a popular data format for transferring data used by a great many Web based APIs

  • The JSON data format is very similar to the Python Dictionary structure.

  • The complex structure of a JSON document means that it cannot easily be ‘flattened’ into tabular data

  • We can use Python code to extract values of interest and place them in a csv file

Reading data from a file using Pandas
  • pandas is a Python library containing functions and data structures to assist in data analysis

  • pandas data structures are the Series (like a vector) and the Dataframe (like a table)

  • the pandas read_csv function allows you to read an entire csv file into a Dataframe

Extracting row and columns
  • First key point.

Data Aggregation using Pandas
  • Summarising numerical and categorical variables is a very common requirement

  • Missing data can interfere with how statistical summaries are calculated

  • Missing data can be replaced or created depending on requirement

  • Summarising or aggregation can be done over single or multiple variables at the same time

Joining Pandas Dataframes
  • You can join pandas Dataframes in much the same way as you join tables in SQL

  • The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other.

  • concat() can also combine Dataframes by columns but the merge() function is the preferred way

  • The merge() function is equivalent to the SQL JOIN clause. ‘left’, ‘right’ and ‘inner’ joins are all possible.

Wide and long data formats
  • The melt() method can be used to change from wide to long format

  • The pivot() method can be used to change from the long to wide format

  • Aggregations are best done from data in the long format.

Data visualisation using Matplotlib
  • Graphs can be drawn directly from pandas, but it still uses matplotlib

  • Different graph types have different data requirements

  • Graphs are created from a variety of discrete components placed on a ‘canvas’, you don’t have to use them all

  • Plotting multiple graphs on a single ‘canvas’ is possible

Accessing SQLite Databases
  • The SQLite database system is directly available from within Python

  • A database table and a pandas Dataframe can be considered similar structures

  • Using pandas to return all of the results from a query is simpler than using sqlite3 alone

Glossary

0-based indexing
is a way of assigning indices to elements in a sequential, ordered data structure starting from 0, i.e. where the first element of the sequence has index 0.
attribute
a property of an object that can be viewed, accessed with a . but no () ex: df.dtypes
boolean
a data type that can be True or False
cast
the process of changing the type of a variable, in python the data type names operate as functions for casting, for example int(3.5)
CSV (file)
is an acronym which stands for Comma-Separated Values file. CSV files store tabular data, either numbers, strings, or a combination of the two, in plain text with columns separated by a comma and rows by the carriage return character.
database
is an organized collection of data.
DataFrame
is a two-dimensional labeled data structure with columns of (potentially) different type.
data structure
is a particular way of organizing data in memory.
data type
is a particular kind of item that can be assigned to a variable, defined by by the values it can take, the programming language in use and the operations that can be performed on it. examples: int (integer), str (string), float, boolean, list
docstring
is an recommended documentation string to describe what a Python function does.
faceting
is the act of plotting relationships between set variables in multiple subsets of the data with the results appearing as different panels in the same figure.
float
is a Python data type designed to store positive and negative decimal numbers by means of a floating point representation.
function
is a group of related statements that perform a specific task.
integer
is a Python data type designed to store positive and negative integer numbers.
interactive mode
is an online mode of operation in which the user writes the commands directly on the command line one-by-one and execute them immediately by pressing a button on the keyword, usually Enter.
join key
is a variable or an array representing the column names over which pandas.DataFrame.join() merge together columns of different data sets.
library
is a set of functions and methods grouped together to perform some specific sort of tasks.
list
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a list can be accessed by their index and can be modified.
loop
is a sequence of instructions that is continually repeated until a condition is satisfied.
method
a function that is specific to a type of data, accessed via . and requires () to run, for example df.sum()
NaN
is an acronym for Not-a-Number and represents that either a value is missing or the calculation cannot output any meaningful result.
None
is an object that represents no value.
scripting mode
is an offline mode of operation in which the user writes the commands to be executed in a text file (with .py extension for Python) which is then compiled or interpreted to run the program. Notes that Python interprets script on run-time and compiles a binary version of the program to speed up the execution time.
sequential (data structure)
is an ordered group of objects stored in memory which can be accessed specifying their index, i.e. their position, in the structure.
string
is a Python data type designed to store sequences of characters.
tuple
is a Python data structure designed to contain sequences of integers, floats, strings and any combination of the previous. The sequence is ordered and indexed by integers, starting from 0. Elements of a tuple can be accessed by their index but cannot be modified.
variable
a named quantity that can store a value, a variable can store any type, but always one type for a given value.

Jupyter Notebook Hints

Esc will take you into command mode where you can navigate around your notebook with arrow keys.

While in command mode:

while in edit mode:

Full Shortcut Listing

Cmd + Shift + P (or Ctrl + Shift + P on Linux and Windows)