Python for Social Science Workshop - Lesson 1


Jose J Alcocer


April 4, 2023


Intro to Python and Its Workspace


1.0 Setting up the environment

Like R, Python can be used with a variety of Integrated Development Environments (IDEs), mainly dependent on what you are using Python for. Because we will be using Python for Data Science, we will be using the DataSpell IDE from JetBrains. Released in 2021 by the company, this IDE was designed specifically for exploratory data analysis by combining interactivity of Jupyter notebooks (A very popular web-based application for programming in Python that is as closest to an R Markdown as it can get) and coding assistance similar to PyCharm (An IDE mainly used for professional web development) and RStudio.
The big advantages of using DataSpell over the conventional Anaconda Distribution platform (You can learn more about Anaconda here) can be summarized as:

  • Having support for Jupyter notebooks with the addition of coding assistance;
  • Having data view access (similar to RStudio's dataframe viewing experiences);
  • Having a more streamlined process to open a Jupyter notebook (Using Anaconda takes longer & slower steps);
  • Having R script compatibility (Even though we will not be using it for R because RStudio is undefeated)

DataSpell is offered for free to university students, so as part of today's agenda, we will be downloading it, setting it up, and taking a quick tutorial on it before getting down to the Python basics. The following section outlines the steps for this activity.

  1. Download Anaconda using this link followed by installing it to your computer.
  2. Apply for JetBrain's educational license here by filling out the web form.
  3. Wait for email confirmation to continue setting up JetBrain account via email.
  4. Download and install DataSpell program from product pack.
  5. Upon opening DataSpell for the first time, complete quick tutorial given by the program.


1.0.1 Navigating the Jupyter Notebook Environment

When creating a new file, you can either create a python script, or a Jupyter notebook. For the purposes of this workshop, we will work within the confines of a Jupyter notebook. As a result, it is important to become familiar with shortcuts that will make your time more bearable when programming in Python.

The following is a shortcut guide meant to give you some insight on the key strokes you can use to navigate across a notebook.

There are blank cells created below for purpose of trying some of the key strokes out.

  • Use the escape key (esc, top left of the keyboard) to go to edit mode from command mode. If the color to the left of the cell is blue, you're in edit mode.
    • Use the ↑ and ↓ keys to navigate from cell to cell
    • Use the "M" key to turn a code cell into a markdown cell
    • Use the "Y" key to turn a markdown cell into a code cell
    • Use the "A" and "B" keys to insert blank cells above and below your current location
    • Use the "C" and "V" keys to copy and paste cells
    • Press "D" twice to delete a cell
    • Press Shift-M to merge two cells into one cell
  • Use the enter / return key to go to command mode from edit mode. If the color to the left of the cell is green, you're in edit mode.
    • Enter some text in a code cell
    • Enter some text in a Markdown cell
  • In either mode, use Ctrl-enter to render the result of either type of cell.
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

1.1 Coding Basics

Like in R, Python has similar data types and building blocks necessary for computing more intensive tasks. This section will cover a few of those basic types of commands.

1.1.1 Numeric Object Creation and Arithmetic Operators

Like in R, Python allows us to assign a range of values to objects, which can then be used to perform a wide set of operations. In this section, we will go over basic object creation and basic arithmetic operations that can be done.

While in R, you can create an object using the <- or = operators, here, you can only use the latter method to create an object. You can, however, continue to use the # to make notes within your lines of code. For example:

In [1]:
import sys

# Creating basic integer objects
A = 5
B = 10
C = 30

Objects created will be stored for viewing in the 'Jupyter Variables' window that can be accessed by clicking on the bottom right tab that also says 'Jupyter Variables'.


Like in R, you can call objects by either using the print() function or simply typing out the object name.

In [ ]:
# Using print function
print(A)
In [ ]:
# Simply calling the object
A

We are able to use several arithmetic operators within Python. Using the objects created above, we will execute some of commonly used ones.

In [6]:
# Addition
A+B+C
Out[6]:
45
In [7]:
# Subtraction
C-B
Out[7]:
20
In [8]:
# Multiplication
A*B*C
Out[8]:
1500
In [10]:
# Exponentiation
A**A
Out[10]:
3125
In [9]:
# Division
C/A
Out[9]:
6.0

Dividing will always produce a floating value (i.e., number with decimals or scientific notation). If you want to produce a scalar integer (e.g., number without decimals or scientific notation), you can use the floor division operator //.

In [11]:
C//A
Out[11]:
6

If you want to convert a floating variable or a string into an integer, you can use the int() function. You can use the type() function to determine the type (or class for Python's lingo) of object your variable is. Alternatively, you can see the class of the object in the Jupyter Variables window as well.

In [15]:
D = 6.88889

# Using `type()` to show it is a float
print(type(D))

int(D)
<class 'float'>
Out[15]:
6

We use the print() function so that both outputs could be displayed above. If you do not use print when wanting to see multiple outputs, the cell will only display to you the last function it executed.


Conversely, you can also use the str() function to convert an integer into a string (i.e., text or characters)

In [18]:
str(D)

# Like R, you can chain functions inside each other to save space
type(str(D))
Out[18]:
str

1.1.2. Strings and Character Object Creation

Operations with strings in Python work slightly different than in R. Unlike in R, where you cannot use any arithmetic operators with string type objects, you are able to use the multiplication operator in Python. For example:

In [23]:
E = 'Hello'

E*7
Out[23]:
'HelloHelloHelloHelloHelloHelloHello'

As you can see, Python interprets the operation of multiplication as an order to repeat the text stored in the object 'E', seven times.


You can also concatenate string objects together with the addition operator.

In [28]:
# Not accounting for space
print('Python'+'Worskshop')

# Accounting for space
print('Python '+'Worskshop')
PythonWorskshop
Python Worskshop

1.1.3 Boolean Objects and Operators

Python can also use Boolean operations, which is a tool used to check for truth or falsity of created boolean variables (variables that take on either the value of TRUE OR FALSE). The and operator checks both objects to make sure they are deemed 'TRUE' and only if they are true, you will receive a 'TRUE' output. or, on the otherhand, checks to make sure that at least one object of the two is deemed 'TRUE', and if at least one of them is 'TRUE', you will receive a 'TRUE' output. For example:

In [30]:
variable1 = True
variable2 = False

# Checking to see if both booleans are true
print(variable1 and variable2)
# Checking to see if at least one boolean is true
print(variable1 or variable2)

variable3 = True

# We should get an output of 'True' here
print(variable1 and variable3)
False
True
True

Booleans are also used for comparison operators, such as > (greater than), < (less than), >= (greater than or equal), <= (less than or equal), and == (equal to).

In [33]:
print(5>2)
print(2>5)
print(5==2)
True
False
False

1.2 Multiple Values in Objects (Like Vectors for Python) - Tuples, Lists, and Dictionaries

Like in R, it is possible to store multiple values within a single object. These vectors come in different forms, depending on what you want to use them for. In a gist, these types of vectors are:

  • Tuples - An ordered heterogeneous data-type vector that is not mutable ;
  • List - An ordered heterogeneous data-type vector that is mutable;
  • Dictionary - a collection of stored keys values that map a key to a specific value

1.2.1 Tuples and Lists

We will first discuss tuples and lists, as they are very similar in nature with one few key distinction. Tuples and Lists are similar in the following way:

  1. They can both store different types of data (e.g., you can store integers, characters, and floaters in either a tuple or list);
  2. They are both ordered means, meaning the order in which you create them, they are kept;
  3. They are both sequential data types, so they can be used to iterate over each observation;
  4. They can both be accessed via indexing

The only distinction between both of these data type vectors is that tuples are not able to be changed, whereas lists are. In the context of Python, a tuple is an immutable vector, which basically means that no observation within it can be modified.

Why is this important and when should you use tuples over lists or vice versa?

It all comes down to memory and time efficiency. Because lists are mutable, Python allocates an additional amount of memory block to allow it to be changed at any time. As a result, lists take up a bit more memory space than tuples. How does this affect the time component? Because of this extra memory block, indexing parts of a list can take slightly longer than if you tried to find observations in a tuple. Realistically, while there are some memory and time differences, the gap between them can either be small or large depending on the kind of data you are working with. The following code is meant to demonstrate these principles.

In [89]:
### The first set of code is meant to compare how much memo

## Generating data for a tuple | Tuples are created by using parentheses `()` when creating an object or by using the `tuple()` function when creating an object
# importing random and sys packages to help us generate n values and get system information
import random
import time

random.seed(4)
tuple1 = tuple(random.sample(range(1,5000),4000))

## Generating data for a list | Lists are created by using the brackets `[]` when creating an object or by using the `list()` function alike
random.seed(4)
list1 = list(random.sample(range(1,5000),4000))

## Getting memory sizes of the tuple and list
print(sys.getsizeof(tuple1), 'bytes of memory for the tuple object')
print(sys.getsizeof(list1), 'bytes of memory for the list object')

## Getting system time data for indexing the tuple and list
start_time = time.time()
for item in tuple1 :
    aa = tuple1[3999]
end_time = time.time()
print("Lookup time for tuple: ", end_time - start_time)

start_time = time.time()
for item in list1 :
    aa = list1[3999]
end_time = time.time()
print("Lookup time for list: ", end_time - start_time)
32040 bytes of memory for the tuple object
32056 bytes of memory for the list object
Lookup time for tuple:  0.0005550384521484375
Lookup time for list:  0.0005030632019042969

With the code above, we can see how while there are differences, they are minimal in this example.


The following code snippet is meant to show in a more simple way how to create both of these vectors.

In [82]:
# Creating a tuple
tuple2 = (1,2,3.3,'A','B','C',True)
print(type(tuple2))
# Creating a list
list2 = [1,2,3.3,'A','B','C', True]
print(type(list2))
<class 'tuple'>
<class 'list'>

1.2.2 Indexing, Slicing (Subsetting in R), and Mutating

Like in R, we can index specific observations from both a tuple or list. We do this with the [] operator. Unlike R that begins counting at 1, Python's system starts at 0, so it is important to keep this in mind when wanting to index an object.

In [85]:
# Indexing 4th observation from tuple
print(tuple2[3])

# Indexing 7th observation from list
print(list2[6])
A
True

We can take it step further and subset specific observations from either of these objects via 'slicing'. This is done by using the following syntax: 'list[x:y]', where you tell Python to grab the the observations that start from x all the way up to (but not include) y. For example:

In [88]:
# Slicing list to get the values that range from 1 all the way up to 3.3 only
print(list2[0:3])

# Slicing list to get all the values that start at A and go to the end
print(list2[3:])
[1, 2, 3.3]
['A', 'B', 'C', True]

As mentioned above, the main difference between tuples and lists is that only a list can be changed. We can change the content of a list by indexing the position we want and assigning it a new value. This called 'Mutating'.

In [91]:
# Showing original list
print(list2)

# Mutating list by changing the 4th observation (A) to a new letter (J)
list2[3] = 'J'
list2
[1, 2, 3.3, 'A', 'B', 'C', True]
Out[91]:
[1, 2, 3.3, 'J', 'B', 'C', True]

1.2.3 Dictionaries

Now that we covered tuples and list, we can talk about Dictionaries. Dictionaries, while similar to Lists, follow a numeric index where each element is assigned a key. That key will then be tied to the corresponding index and element assigned. Dictionaries can be created either with the dict() function or by creating an object and using braces '{}'. The following are examples showing both methods.

In [97]:
# Crating a dictionary using `dict()` function
dictionary1 = dict(One='1', Two=2, Three='3', Four=4, Five='5', Apple=True)
print(dictionary1)

# Creating a dictionary using object creation and braces
dictionary2 = {'One':'1', 'Two':2, 'Three':'3', 'Four':4, 'Five':'5', 'Apple':True}
print(dictionary2)
{'One': '1', 'Two': 2, 'Three': '3', 'Four': 4, 'Five': '5', 'Apple': True}
{'One': '1', 'Two': 2, 'Three': '3', 'Four': 4, 'Five': '5', 'Apple': True}

Like indexing tuples and lists, you can index a dictionary using the same '[]' operator and typing the exact name of the key.

In [102]:
# Indexing particular keys from dictionary1
print(dictionary1['Four'])
dictionary1['Apple']
4
Out[102]:
True

1.3 Functions

We have been using a variety of functions throughout this lesson to get Python to perform a set of operations. Functions are essentially instructions that the computer takes through arguments in order to produce a desired result. Similar to R, we can use integrated functions or we can create our own set of functions.


1.3.1 Common integrated Python functions

The following chunk of code demonstrates some functions that can assist your programming in Python. In addition, the link here showcases a list of built-in python functions available to its users.

In [114]:
# The `len()` function tells you the length of an object
# Using len() to find the length of 'list1' created earlier
print(len(list1))

# 'round()' function rounds a desired number to the place that you'd like
print(round(3.14159265359, 2))

# 'max()' function returns the largest value of an object
print(max(list1))

# 'min()' function returns the smallest value of an object
print(min(list1))

# 'sum()' function calculates the sum of all values found within an object
list4 = [1,2,3,4,5,6]
print(sum(list4))

# 'list()' function converts an object into a list
tuple3 = (1,2,3,4,5,6,7,8,9,10)
print(type(tuple3))
list3 = list(tuple3)
print(type(list3))

# `tuple()` function does the same but the opposite
tuple4 = tuple(list3)
print(type(tuple4))
4000
3.14
4999
2
21
<class 'tuple'>
<class 'list'>
<class 'tuple'>

1.3.2 Creating Function

Creating functions can be very straight forward and they can be combined with multiple operators and arguments depending on what your desired outcome is. For a basic example, we will be creating a function that calculates the mean of a number it is given.

In [115]:
# Defining a function that takes the mean of a list
def function1(X):
    entire_sum = sum(X)
    entire_length = len(X)
    mean = entire_sum/entire_length
    print(mean)

function1(list1)
2499.2215

1.3 Libraries (Like Packages in R)

While we created a function to calculate the mean above, there are already packages that exist that contain functions that allow us to do this and many other operations. These packages are called libraries in Python and they are downloaded to your workspace by using the import command. Depending on the library name, you can import it and assigned it a shorthand name to avoid having to spell out the entire name of the library to use it. In the following examples, you can see how we will use the import function, then add the "as 'x'" argument right after so that Python stores that library as the shorthand name assigned.

In [117]:
# Importing packages that will be used for the next lesson
import pandas as pd
import numpy as np

# Importing package that can calculate statistical operations such as finding the mean of an object
import statistics as st

From this point, you can use the library imported to conduct operations of interest. In this case, we will use the mean() function from the 'statistics' package to find the mean of 'list1'

In [118]:
st.mean(list1)
Out[118]:
2499.2215

You must use the library name or its shorthand everytime in order to perform the desired operation or else Python will not know what function you are referring to.