Data Wrangling With Python

Setting Up

We would be using the Windows command line for completing this tutorial. Make sure you have Python installed on your console, and the environment variables are in the right place. If this is your first time using python on your console,

  1. Search “Python” in the search bar of the start menu
  2. Right click on the icon and press open file location
  3. In the folder you are directed to, select IDLE and go to its properties. Again click on open file location.
  4. Copy the file path and search ‘Edit the System Variable’ on the search bar of the start menu. Click on the icon to open the settings pane.
  5. Click on Environmental Variables and then double click on ‘Path’ on the bottom menu.
  6. Click add and paste the file path. Press okay and exit the window.
  7. If you have done everything right, the python environment should be ready to use.
  8. Similarly, navigate to the scripts folder and add the path to the Environmental Variables. This will make sure pip is ready to use.

Starting Off…

In this project, we will be loading a dataset(csv file) into our system’s command line and would be performing some operations to clean the data. As you might have guessed, we will be using the Pandas library to implement this project.

Install pandas by typing in pip install pandas and then type in Python for the next command.

Dropping Off Null Columns

Next step would be to import pandas before we start weaving our magic. Open the dataset we will be working on by using the pd.read_csv() command, inserting the file path in the exact same format we used. The print (df.to_string()) command would display the contents of the csv file after we dropped of all the rows with no values.

pip install pandas
python
import pandas as pd 
df=pd.read_csv('c:\\Users\\Xavier\\Downloads\\data.csv')
x = df["ColumnName"].mode()[0]
df["ColumnName"].fillna(x, inplace = True)
print(df.to_string())

Replacing And Dropping Values

In a column where you store the heights of your friends in cm ,an entry of 1.82 won’t make sense and it’s pretty likely that you made a typo. You can replace such values by the code snippet shown here.

You can delete such columns by using this snippet instead:

for x in df.index:
  if df.loc[x, "Height"] < 100:
    df.drop(x, inplace = True)
pip install pandas
python
import pandas as pd #use of aliasing
df=pd.read_csv('c:\\Users\\Xavier\\Downloads\\data.csv')
df.dropna(inplace = True) #drops all rows with null values
df.drop_duplicates(inplace = True) #used to drop repeated entries
print (df.to_string())

Replacing Empty Values

It might happen that you wish to replace the contents of the empty cells with the mean, median or mode. You can also insert any value of your choice instead of x to say replace all the empty cells with 3.

Replace mode()[0] by median() and mean() to get median and mode respectively.

pip install pandas
python
import pandas as pd 
df=pd.read_csv('c:\\Users\\Xavier\\Downloads\\data.csv')
for x in df.index:
  if df.loc[x, "Height"] < 100:
    df.loc[x, "Duration"] = 100

You are advised to create your own dataset to put in to practice whatever we have learnt today. Make sure you type your commands one line at a time.