R Square formula value shows how close data point is to the fitted regression line, it also known as the coefficient of determination or coefficient of multiple determination. The coefficient of determination R-square reflects the proportion of variance of one variable which is predictable from another variable. It is the ratio of explained variation to total variation.
Let’s try to understand the working of R square formula with some mathematical calculation.
Let take previous article Ordinary Least Squares datapoint example X-axis value (1,2,3,4,5) and Y-axis value (3,4,2,4,5).
To simplify the mathematical calculation make a table.
R squared formula
= Distance Predicted – Mean / Distance Actual – Mean
Where Yp is Y predicted value.
Let calculate the value based on the above formula.
Mean of Y value = 3.6
SSe = Mean of = 1.6
SSt = Mean of = 5.2
= 1.6 / 5.2 = 0.3
R-square formula value can vary between 0 to 1 if R-square value is close to 0 mean its not good regression model and if R-square value close to 1 means good model, if R-square value = 1 means X Y value point are same as predicted value point which is not possible in real time because of noise in data or some other factors in input data points.
If R-square = 0.85 the 85% of total variation can be cover by the model rest 15% of the variation in Y is random and not reflected by the model.
R-square value can be increased by increasing the independent features like X1, X2, X3 which equally contributes to predicting for the dependent value of Y, but the same time we even need to look of overfitting and underfitting of the regression model.
There is a variety of error for all those points that don’t fall exactly on the line. It is impossible to understand these error to judge the goodness of fit of the model i.e How Representative the model is likely to be in general.
- P1 – Original Y data point for given X.
- P2 – Estimated Y value for given X.
- Y bar – Average of all Y values in the data set.
- SST- Sum of square error Total(SST) variance of P1 from Y bar (Y – Y bar)^2.
- SSE – Explained error (p2 – Ybar)^2 (Portion SST captured by regression model).
- SSR – Residual error (P1 -P2)^2.
Residual error = Predicted value – Actual value
Our goal is to reduce Residual error
SSr = Residual error
SSe = Explained error
R squared formula
- That model is most fit where every data point lies on the line i.e SSR = 0 for all data points.
- Hence SSE should be equal to SST i.e SSE/SST should be 1.
- A poor fit will mean large SSR (since points do not fall on the line) hence SSE =0 therefor SSE/SST =0
- SSE/SST is called as R-Square or coefficient of determination
- R-square is always between 0 to 1 and is a measure of the utility of the regression model
R-square value tells us how well we understood our model, the model is able to capture % of data. So if the R-square value is 75% mean model is able to capture 75% of the variation in data. Or in another word we can say 75% of the variation in Yp can be explained by the variation of X.
It is a ratio of explained variation to the total variation coefficient of determination R-square value come under.
R-square depends on the number of X features, so if we artificially increase the more features or independent variables like X(x1,x2,x3,x4,x5 …. xn) then R-square value also increase.
Let’s understand R-square by comparing below two model examples.
Model 1 Model 2
Features n = 6 Features n = 8
R-square = 75% R-square = 86%
m = 10000 m = 10000
So in the above example, we have 2 model which has the same number of row m = 10000 and model 1 having 6 features and model 2 having 8 features which reproduce from model 1 features by taking there square to capture more datapoint.
As we can see R-square value also increased as we increased the number of features and error reduce at the same time.
So M2 model has less error because model try to capture more data point and instead of normal linear curve it becomes parabolic curve, but the same time model might be overfitted and not the optimal model while testing another unseen datapoint because model 2 try to capture more data point for the given features values but as soon we change feature values to test model performance model will not cover that data point very well. So, in short, we need to tune the R-square value in such a way to find the best fit model which has less error and able to capture most of the random data point.
In Python, we can find the R-Square value by using statsmodels.api module and try to fit our training data features to OLS class.
#Python code to find OLS. import statsmodels.api as sm model=sm.OLS(y_train,x_train[['X1','X2','X3']]) result=model.fit() print(result.summary())
Adjusted R-square shows the number of an active predictor in the model. Adjusted R-square is always less then R-square. Its value can be -Ve but not always. It is required to optimize the model accuracy. So In the above example, we understand by increasing the no of features (Xn) R-square value also increased. So at what point we should stop adding more artificial features.
If the difference between R-Square and Adjusted R-square increase it means our model start overfitting stage (Not increasing no of the row but try to learn more feature) in short n:m is not correct.
n = No of an independent feature.
m = No of row (total value for given feature).
So we can consider a good Model which has less difference between R-square and Adjusted R-square value.
For example :- Consider a model which has three independent features(Xn) CRIM, ZN, CHAS which predict the Y values and if we try to find the OLS values by using above python code it results in R-square value .380 or 38% which is close to zero, so it is not an optimal model, mean this model is not capable to predict dependent variable value Y with limited independent features of Xn. Data points are scattered and model regression line not able to capture efficiently.
This model is not best because the difference between R-square and Adjusted R-Square is more. So to get the best-predicted model we need to reduce the difference between R-square and Adjusted R-square to do so we need to increase the no of independent features.
So to increase the model efficiency we ask more features from a data source or otherwise artificially produce features(Xn). As we can see below screenshot R-square value increased from .380 to .936 or 93% which consider as a good model. Which mean most of the datapoint lay down near to regression line. In other words, This model is good because the difference between R-square and Adjusted R-square is less.
Wrapping up:- R-square value tell how well the regression model fits given data points in other hand Adjusted R-square value tells how features (X1, X2…Xn) important to your model.