Check Dataset is Linear or not for ML

In Machine learning knowing the type of dataset is very important. Because our model selection depend on the type of dataset. There will be two type of dataset

  1. Linear Dataset: If X(features) and Y(target) form the data set can be divide by a straight line which can group X and Y differently , then linear data set.

  2. Non linear Dataset: Data X and Y can not be divide by straight line, the non-linear data set

It can check by two way

1. Using data visualisation

Here we try to plot a line between X and Y, if we able to draw a straight line between X and Y then dataset is liner

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
#Dataset1, which have day wise share post and likes and reads
dataFrame_1= pd.read_csv('./non_linear_dataset.csv')
print(dataFrame_1.head())

Screenshot 2019-12-08 at 11.55.20 PM.png

plt.title('Is linear or not check')
plt.xlabel('x=shares')
plt.ylabel('likes')
plt.ylim(10, 10000)
sns.regplot(x=dataFrame_1['shares'], y=dataFrame_1['likes'])

Screenshot 2019-12-09 at 12.58.17 AM.png

#Dataset2, which have day wise spend
dataFrame_2= pd.read_csv('./input/linear_dataset.csv')
print(dataFrame_2.head())

Screenshot 2019-12-09 at 12.01.25 AM.png

plt.title('Is linear or not check')
plt.xlabel('Days')
plt.ylabel('Spend')
sns.regplot(x=dataFrame_2['days'], y=dataFrame_2['spend'])

Screenshot 2019-12-09 at 12.02.57 AM.png

If we look into graph, in dataset1 we are not able draw a straight line which will divide X and Y into two part. So dataset1 is not linear data set

But in dataset2 we are able to divide X and Y into tow group by a straight line. So dataset2 is linear dataset

2. Using r2_score

Here we are going to use Linear Regression model and then check r2_score. If r2_score is more then Linear else not linear.

We are going to use same dataset. So no need to read again file because we have already read it.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# For dataset1 check r2_score
X=dataFrame_1[['shares']]
y= dataFrame_1.likes
model_1 = LinearRegression()
model_1.fit(X, y)
model1_predict = model_1.predict(X)
print('R2 Score is: ', r2_score(y, model1_predict))
R2 Score is:  0.001235016978507808
# here we can see r2_score is very low, means dataset is not linear
#For dataset2 check r2_score
X2=dataFrame_2[['days']]
y2= dataFrame_2.spend
model_2 = LinearRegression()
model_2.fit(X2, y2)
model2_predict = model_2.predict(X2)
print('R2 Score is: ', r2_score(y2, model2_predict))
R2 Score is:  0.9891203611402716
# here r2_score is very high, means dataset is linear

These are the two method to check our dataset is linear or not. The best r2_score is 1.0.

If you found other method to check then please comment!!

Comments (2)

Mervej Raj's photo

What should be the r2_score to denote its linear or not??

Biplab Malakar's photo

Software Engineer, JavaScript Developer, MEAN Developer, Node.js Developer, MERN Developer, Hybrid Mobile App Developer and ML Developer

I am not sure. Actually r2_score defined the accuracy. So if accuracy is very low then its denote linear model is not fit for this data.

Normally I consider 50%, if r2_score less then 50% means not-linear and for non-linear r2_score can go negative.