In Machine learning knowing the type of dataset is very important. Because our model selection depend on the type of dataset. There will be two type of dataset

  1. Linear Dataset: If X(features) and Y(target) form the data set can be divide by a straight line which can group X and Y differently , then linear data set.

  2. Non linear Dataset: Data X and Y can not be divide by straight line, the non-linear data set

It can check by two way

1. Using data visualisation

Here we try to plot a line between X and Y, if we able to draw a straight line between X and Y then dataset is liner

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
#Dataset1, which have day wise share post and likes and reads
dataFrame_1= pd.read_csv('./non_linear_dataset.csv')

plt.title('Is linear or not check')
plt.ylim(10, 10000)
sns.regplot(x=dataFrame_1['shares'], y=dataFrame_1['likes'])

#Dataset2, which have day wise spend
dataFrame_2= pd.read_csv('./input/linear_dataset.csv')

plt.title('Is linear or not check')
sns.regplot(x=dataFrame_2['days'], y=dataFrame_2['spend'])

If we look into graph, in dataset1 we are not able draw a straight line which will divide X and Y into two part. So dataset1 is not linear data set

But in dataset2 we are able to divide X and Y into tow group by a straight line. So dataset2 is linear dataset

2. Using r2_score

Here we are going to use Linear Regression model and then check r2_score. If r2_score is more then Linear else not linear.

We are going to use same dataset. So no need to read again file because we have already read it.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
# For dataset1 check r2_score
y= dataFrame_1.likes
model_1 = LinearRegression(), y)
model1_predict = model_1.predict(X)
print('R2 Score is: ', r2_score(y, model1_predict))
R2 Score is:  0.001235016978507808
# here we can see r2_score is very low, means dataset is not linear
#For dataset2 check r2_score
y2= dataFrame_2.spend
model_2 = LinearRegression(), y2)
model2_predict = model_2.predict(X2)
print('R2 Score is: ', r2_score(y2, model2_predict))
R2 Score is:  0.9891203611402716
# here r2_score is very high, means dataset is linear

These are the two method to check our dataset is linear or not. The best r2_score is 1.0.

If you found other method to check then please comment!!

What should be the r2_score to denote its linear or not??

I am not sure. Actually r2_score defined the accuracy. So if accuracy is very low then its denote linear model is not fit for this data.

Normally I consider 50%, if r2_score less then 50% means not-linear and for non-linear r2_score can go negative.