I’m going to start out by trying out some linear regression. The hope is that this will allow us to input new data and output the Energy Production. Should be straightforward. The first model regresses Energy Production on Temperature, the second regresses Energy Production on Daylight, and the third regresses Energy Production on both Temperature and Daylight. We’ll start with these and see what we get.
Energy Production on Temperature gives us a model which is not very good. The good news is that the p value is very low at 2.32e-195. The means the model is a good predictor. Technically, it tells us to reject the null hypothesis that the coefficient is 0. However, R-Squared is very low at .07, meaning a high variance in the prediction. This is bad. It doesn’t seem to be very useful because of this.
The Daylight vs Energy Production seems to be slightly better. It also has a very low p value (0), but a slightly higher R Squared (.28). This means that more of the variance can be explained by the model.
So, what happens when we combine both Temperature and Daylight? We get a p value of 0.0, once again. But a marginally improved R-squared value of .37 as compared to the .28 for the Daylight data.
coef | std | err | t | P>|t| | 95% Conf. Int. | |
Intercept | 40.3892 | 7.144 | 5.654 | 0.000 | 26.386 | 54.392 |
Temperature | 5.0511 | 0.124 | 40.856 | 0.000 | 4.809 | 5.293 |
Daylight | 2.6425 | 0.036 | 74.090 | 0.000 | 2.573 | 2.712 |
The next thing I tried was to break the data down by month before doing the regression. The hope being that we could better predict EnergyProduction when we take into account what month it is. So, each month has a separate linear model that it uses. Only Daylight is used to predict. The plot shows different color dots and regression lines for each month. The p values are once again very low. The R Squared values vary to a good degree month to month from a low of .004 in September to .46 in August. This seems to suggest that there could be a large spread in what our model predicts in certain months, while being a much tighter error in other months.
How good the model succeeds will be based on how well it does on new data. The measurement we will use with this will be the MAPE (Mean Absolute Percentage Error). I already broke the data down into training and test. The models were created with the training data and will be measured with the test data. Let’s compare the regression model with Temperature and Daylight vs. the model breaking it down into months.
Model | MAPE Score |
Temperature and Daylight Regression | 15.06 |
Daylight Regression by Month | 12.88 |
And just for fun, I expanded the month by month regression to include both Temperature and Daylight. This gave a very small improvement.
Model | MAPE Score |
Temperature and Daylight by Month | 12.48 |
Creating a separate model for each month that includes both the Temperature and Daylight factors gives the best predictions. Although, it’s only a marginal improvement on the Daylight model for each month, I’ll stick with it, because it is the best and the cost to implement is low enough. Let’s see if we can improve on this score on future analysis!