Check Dataset is Linear or not for ML
In Machine learning knowing the type of dataset is very important. Because our model selection depend on the type of dataset. There will be two type of dataset
Linear Dataset: If X(features) and Y(target) form the data set can be divide by a straight line which can group X and Y differently , then linear data set.
Non linear Dataset: Data X and Y can not be divide by straight line, the non-linear data set
It can check by two way
1. Using data visualisation
Here we try to plot a line between X and Y, if we able to draw a straight line between X and Y then dataset is liner
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
#Dataset1, which have day wise share post and likes and reads dataFrame_1= pd.read_csv('./non_linear_dataset.csv') print(dataFrame_1.head())
plt.title('Is linear or not check') plt.xlabel('x=shares') plt.ylabel('likes') plt.ylim(10, 10000) sns.regplot(x=dataFrame_1['shares'], y=dataFrame_1['likes'])
#Dataset2, which have day wise spend dataFrame_2= pd.read_csv('./input/linear_dataset.csv') print(dataFrame_2.head())
plt.title('Is linear or not check') plt.xlabel('Days') plt.ylabel('Spend') sns.regplot(x=dataFrame_2['days'], y=dataFrame_2['spend'])
If we look into graph, in dataset1 we are not able draw a straight line which will divide X and Y into two part. So dataset1 is not linear data set
But in dataset2 we are able to divide X and Y into tow group by a straight line. So dataset2 is linear dataset
2. Using r2_score
Here we are going to use Linear Regression model and then check r2_score. If r2_score is more then Linear else not linear.
We are going to use same dataset. So no need to read again file because we have already read it.
from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score
# For dataset1 check r2_score X=dataFrame_1[['shares']] y= dataFrame_1.likes model_1 = LinearRegression() model_1.fit(X, y) model1_predict = model_1.predict(X) print('R2 Score is: ', r2_score(y, model1_predict))
R2 Score is: 0.001235016978507808 # here we can see r2_score is very low, means dataset is not linear
#For dataset2 check r2_score X2=dataFrame_2[['days']] y2= dataFrame_2.spend model_2 = LinearRegression() model_2.fit(X2, y2) model2_predict = model_2.predict(X2) print('R2 Score is: ', r2_score(y2, model2_predict))
R2 Score is: 0.9891203611402716 # here r2_score is very high, means dataset is linear
These are the two method to check our dataset is linear or not. The best r2_score is 1.0.
If you found other method to check then please comment!!
What should be the r2_score to denote its linear or not??
I am not sure. Actually r2_score defined the accuracy. So if accuracy is very low then its denote linear model is not fit for this data.
Normally I consider 50%, if r2_score less then 50% means not-linear and for non-linear r2_score can go negative.