Home Energy: Linear Regression

I’m going to start out by trying out some linear regression. The hope is that this will allow us to input new data and output the Energy Production. Should be straightforward. The first model regresses Energy Production on Temperature, the second regresses Energy Production on Daylight, and the third regresses Energy Production on both Temperature and Daylight. We’ll start with these and see what we get.

Energy Production on Temperature gives us a model which is not very good. The good news is that the p value is very low at 2.32e-195. The means the model is a good predictor. Technically, it tells us to reject the null hypothesis that the coefficient is 0. However, R-Squared is very low at .07, meaning a high variance in the prediction. This is bad. It doesn’t seem to be very useful because of this.
The Daylight vs Energy Production seems to be slightly better. It also has a very low p value (0), but a slightly higher R Squared (.28). This means that more of the variance can be explained by the model.

So, what happens when we combine both Temperature and Daylight? We get a p value of 0.0, once again. But a marginally improved R-squared value of .37 as compared to the .28 for the Daylight data.

coef std err t P>|t| 95% Conf. Int.
Intercept 40.3892 7.144 5.654 0.000 26.386 54.392
Temperature 5.0511 0.124 40.856 0.000 4.809 5.293
Daylight 2.6425 0.036 74.090 0.000 2.573 2.712

The next thing I tried was to break the data down by month before doing the regression. The hope being that we could better predict EnergyProduction when we take into account what month it is. So, each month has a separate linear model that it uses. Only Daylight is used to predict. The plot shows different color dots and regression lines for each month. The p values are once again very low. The R Squared values vary to a good degree month to month from a low of .004 in September to .46 in August. This seems to suggest that there could be a large spread in what our model predicts in certain months, while being a much tighter error in other months.

How good the model succeeds will be based on how well it does on new data. The measurement we will use with this will be the MAPE (Mean Absolute Percentage Error). I already broke the data down into training and test. The models were created with the training data and will be measured with the test data. Let’s compare the regression model with Temperature and Daylight vs. the model breaking it down into months.

Model   MAPE Score
Temperature and Daylight Regression 15.06
Daylight Regression by Month 12.88

And just for fun, I expanded the month by month regression to include both Temperature and Daylight. This gave a very small improvement.

Model   MAPE Score
Temperature and Daylight by Month 12.48

Creating a separate model for each month that includes both the Temperature and Daylight factors gives the best predictions. Although, it’s only a marginal improvement on the Daylight model for each month, I’ll stick with it, because it is the best and the cost to implement is low enough. Let’s see if we can improve on this score on future analysis!

Home Energy: Testing and Measuring of Prediction Models

Before we continue, it is important to clarify a couple prediction model concepts. This will allow us to test our models.

This is a side post on how we will evaluate the performance of the models.

We will break the data down into 2 sets. The first set is what will be used to train our model and give it the parameters. This will include the first 80% of the entries. Think of this as teaching the model. The 2nd set will test the model to see how well it performs on new data. This will subsist of the last 20% of the data.

The performance of the models will be compared using Mean Absolute Percentage Error (MAPE). MAPE is calculated by taking the mean of the standardized absolute errors and turning it to a percentage.


This will give us a standardized measurement with which to compare two models. This score will be calculated on the new test data. MAPE percentage error is an intuitive way of understanding the error statistic. As Minitab’s online resource explains, a MAPE score of 5 means that the forecast is off by 5% on average. Other options to measure the error of models would have been R-Squared, Mean Absolute Deviation, or Mean Squared Deviation. I won’t go into the details of these except to explain that R-Squared is the percentage of the response variable variation explained by the model on a scale of 0%-100%. In other words, it is a deviation measure between the model and actual values over a deviation measure between the mean and actual values.
In summary, in order to improve something we have to measure it. So, we want to make sure we have a system in place to systematically measure and compare our models.