# Biplab Malakar # Check Dataset is Linear or not for ML

In Machine learning knowing the type of dataset is very important. Because our model selection depend on the type of dataset. There will be two type of dataset

1. Linear Dataset: If X(features) and Y(target) form the data set can be divide by a straight line which can group X and Y differently , then linear data set.

2. Non linear Dataset: Data X and Y can not be divide by straight line, the non-linear data set

# It can check by two way

## 1. Using data visualisation

Here we try to plot a line between X and Y, if we able to draw a straight line between X and Y then dataset is liner

``````import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
``````
``````#Dataset1, which have day wise share post and likes and reads
`````` ``````plt.title('Is linear or not check')
plt.xlabel('x=shares')
plt.ylabel('likes')
plt.ylim(10, 10000)
sns.regplot(x=dataFrame_1['shares'], y=dataFrame_1['likes'])
`````` ``````#Dataset2, which have day wise spend
`````` ``````plt.title('Is linear or not check')
plt.xlabel('Days')
plt.ylabel('Spend')
sns.regplot(x=dataFrame_2['days'], y=dataFrame_2['spend'])
`````` If we look into graph, in dataset1 we are not able draw a straight line which will divide X and Y into two part. So dataset1 is not linear data set

But in dataset2 we are able to divide X and Y into tow group by a straight line. So dataset2 is linear dataset

## 2. Using r2_score

Here we are going to use Linear Regression model and then check r2_score. If r2_score is more then Linear else not linear.

We are going to use same dataset. So no need to read again file because we have already read it.

``````from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
``````
``````# For dataset1 check r2_score
X=dataFrame_1[['shares']]
y= dataFrame_1.likes
model_1 = LinearRegression()
model_1.fit(X, y)
model1_predict = model_1.predict(X)
print('R2 Score is: ', r2_score(y, model1_predict))
``````
``````R2 Score is:  0.001235016978507808
# here we can see r2_score is very low, means dataset is not linear
``````
``````#For dataset2 check r2_score
X2=dataFrame_2[['days']]
y2= dataFrame_2.spend
model_2 = LinearRegression()
model_2.fit(X2, y2)
model2_predict = model_2.predict(X2)
print('R2 Score is: ', r2_score(y2, model2_predict))
``````
``````R2 Score is:  0.9891203611402716
# here r2_score is very high, means dataset is linear
``````

These are the two method to check our dataset is linear or not. The best r2_score is 1.0.

If you found other method to check then please comment!!