How to do your first Machine Learning Project?

6 min readFeb 15, 2021

Everything I did to create my first machine learning model.

Background information:

To learn machine learning concepts, there are tons of amazing videos, blogs, and courses online. You could even find how to learn data science in one month,3 months,6 months, and so. Those are excellent ones if you follow as they suggest. I was confused about which one to chose, and then I found my way of learning. To make real use cases and build the model from scratch by learning each step. And this is working for me so far.

Concepts involved building a model :

There are few steps to be carried out before and after building a model. We will understand each step while working on the problem.

Data Exploration.
Data Cleaning.
Model building.
Model evaluation.
Prediction.

Thanks to python, we can do all these steps easily.We will be using Pandas ,Numpy and matplot libraries. Let us take simple dataset here to build our project. The problem statement is to predict the happiness of the person based on the given dataset.

Data Exploration:

We need to read and understand each column in the dataset and find any patterns or correlations between them.

Input:

#store the dataset in Dataframe ‘income_data’
income_data = pd.read_csv(“Data/income_data.csv”)
data = income_data.copy() #take a backup of Dataframe
print(data) #display the data

Output:

Unnamed: 0    income  happiness
0             1  3.862647   2.314489
1             2  4.979381   3.433490
2             3  4.923957   4.599373
3             4  3.214372   2.791114
4             5  7.196409   5.596398
..          ...       ...        ...
493         494  5.249209   4.568705
494         495  3.471799   2.535002
495         496  6.087610   4.397451
496         497  3.440847   2.070664
497         498  4.530545   3.710193

[498 rows x 3 columns]

So we have 498 records and 3 columns in the dataset. We have two variables — income and happiness. We will check the basic statistics of this dataset.

data.describe()

       Unnamed: 0      income   happiness
count  498.000000  498.000000  498.000000
mean   249.500000    4.466902    3.392859
std    143.904482    1.737527    1.432813
min      1.000000    1.506275    0.266044
25%    125.250000    3.006256    2.265864
50%    249.500000    4.423710    3.472536
75%    373.750000    5.991913    4.502621
max    498.000000    7.481521    6.863388

Please ignore the first column stats; we will remove that in the cleanup section. We could see the mean, mode, median, and standard deviation of income and happiness.

Let us find the correlation between both the variable.

data.corr()
            Unnamed: 0    income  happiness
Unnamed: 0    1.000000  0.024831   0.029269
income        0.024831  1.000000   0.865634
happiness     0.029269  0.865634   1.000000

It's obviously visible the happiness and income is 86.5% related. We will visualize the data relation.

data.plot(x=’income’, y=’happiness’, style=’o’) 
plt.title(‘Income vs Happiness’) 
plt.xlabel(‘Income’) 
plt.ylabel(‘Happiness’) 
plt.show()

This shows that the income value is directly proportional to happiness.

Data Cleaning:

Before building our model, we need to transform the variables into numbers, remove non-essential fields and check no null values are present.

data.isnull().sum() # Check all variables are having nullUnnamed: 0    0
income        0
happiness     0
dtype: int64

There are no null values present, and the variables are present as numeric already.

But we have the first column, which is not necessary for our model. we will drop the column.

data.drop(data.columns[[0]], axis = 1, inplace = True) #Remove based on first index
data.head() #Display first five rows

Output:

income  happiness
0  3.862647   2.314489
1  4.979381   3.433490
2  4.923957   4.599373
3  3.214372   2.791114
4  7.196409   5.596398

Model building:

We analyzed and cleaned our dataset. since our target data is continuous and proportional to the input data. We can use linear regression to build the model.

What is Linear regression?

Imagine we have two variables as below.

Manually generated Linear Regression plot.

Here we can easily say that the price is directly proportional to the number of apples. We can write a simple code to find the price for the other data. But we will not deal with much more straightforward data, so we can’t find plot slope in them.

How did we find the pattern in the above example? By reading and processing the data. Similarly, we will feed the data to the machine and teach the machine to plot the slope.

We can go to Canada from India in many ways, but we need to use the optimum route to make our journey more comfortable.

Similarly, the values are scattered, and we can draw the slope anywhere. But we need to find the optimum slope where the sum of squares of (distance between the slope and the target value) is minimum. This method is called least square regression.

Don’t worry !! We have the existing linear regression model function in python to do that for us.

#Data preparation
input_value = data.iloc[:, :-1].values 
target_value = data.iloc[:, 1].values#Splitting the data for test and train
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(input_value, target_value, 
 test_size=0.2, random_state=0)#Building the model
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression()  
regressor.fit(input_value, target_value)# Plotting the regression line
line = regressor.coef_*input_value+regressor.intercept_# Plotting for the test data
plt.scatter(input_value, target_value)
plt.plot(input_value, line,'red');
plt.show()

Model Evaluation:

This is again the same way how we have been assessed in schools. There will be 10 problems in the chapter, and the teacher knows the answers for all the 10 problems. They will teach us 8 problems ( here it’s called training the model) and gives us 2 problems for assignment to test whether we understood or not. Then they will evaluate the assignment by comparing it with the actual answer.

Similarly, here we split the data based on test and train function. Trained the model on train data and will evaluate the result on test data by comparing it with the actual result. This will provides the model performance.

We will use the Mean Absolute Error method to evaluate the model.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_scorepred_cv = regressor.predict(X_test)
from sklearn import metrics 
print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, pred_cv))

Output :

Mean Absolute Error: 0.6174050608886752

Predicting the value:

We can now run our model on test data to compare our model predicted value and the actual value.

predict_data = pd.DataFrame({‘Actual’: y_test, ‘Predicted’: pred_cv}) 
predict_dataActual  Predicted
0   1.775933   3.033184
1   1.877147   2.045445
2   2.465761   1.530116
3   1.560355   2.281021
4   0.898733   1.840929
..       ...        ...
95  3.615471   3.718798
96  4.802092   4.503831
97  4.328417   4.414682
98  5.498147   5.176406
99  1.095999   2.381832

We have the model built; we will provide our random value as input to predict the target value.

income_input = [[1.775933]]
happiness_pred = regressor.predict(income_input)
print(“Income :”,income)
print(“Happiness :”,happiness_pred)Income : [[1.775933]]
Happiness : [1.41746106]

Github: https://github.com/Karthik1693/Self_Learning_Project/tree/master/Income%20Happiness