Home Energy: Cyclical Components

We’re going to try a different way of trying to predict the Energy Production of a home solar panel. The technique I want to look at uses the seasonal_decompose function in Statsmodel python package.

However, to start it will help to see how the different houses compare across the months. I graphed each house’s energy production using the same x-axis of dates. The median is shown as the thick, red line.

 

As you can see, every house seems to have a similar change in each month at its own level. However, it’s tough to get a trend out of this that would be useful in predicting future months.

Let’s attempt to break it down, though. We can use StatModel’s seasonal_decompose function to try and get a trend, seasonal, and residual component. Passing the median timeseries into the function is necessary because we can only give it one time series. The median will be less influenced by outliers than the mean because it only takes the middle value and not the weighted average. Although, in this dataset, it doesn’t seem to be much of an issue because of the lack of large outliers. Put simply, the seasonal_decomposition function uses a convolution filter to take out the trend. A convolution filter is a type of weighted average which not only looks at previous values, but also subsequent. After taking out the trend, it finds the seasonal pattern. This is essentially the averages for the particular periods. For example, the averages of all the Julys, the averages of all the Augusts, etc… What’s then left is the residual. The last component of what would make up the data.

The reason there is trend information lacking for 6 months on the ends is that the convolution filter uses an average of 6 months before and after the position. It runs out of data on either side. The useful information that pops out of these plots is the cyclical pattern of the seasonal information. EnergyProduction seems to be dependent on the seasons. This is similar to what we say when we broke the data down by months. Let’s see if we can make use of this.

I combined all the trend and seasonality components into one model. This model contains a sine function for cyclicality and a linear component for overall trend. I used the least square function from scipy to optimmize the parameters in the model function. This function requires initial guesses for which I used: constant = mean, linear component’s slope = slope of linearly fitted line, amplitude = 3 * standard deviation / sqrt(2),  phase = pi/6. These seemed to be the most appropriate guesses that resulted in the best parameter estimates by the least square optimization.

The red line of the model shows a pretty good fit to the data, visually. A sinuisoidal component with a slight upward trend. Let’s see how it does on test data in predicting new data.

Very similar looking to the training data. However, this is not surprising since even the test data has 500 data points. The MAPE score comes out to be 19.1. This is a quite a bit worse than our previous best prediction of 12.48. most likely due to the fact that we are only predicting using the time component and not the additional factors of Temperature and Daylight that the previous model included. Still, makes for a good exercise for manipulating timeseries, breaking down components of the data, and fitting cyclical data.

References:
-GitHub Code: https://github.com/262globe/Blog.git
http://stackoverflow.com/questions/26470570/seasonal-decomposition-of-time-series-by-loess-with-python
https://searchcode.com/codesearch/view/86129185/
http://statsmodels.sourceforge.net/devel/generated/statsmodels.tsa.filters.filtertools.convolution_filter.html
http://www.cs.cornell.edu/courses/cs1114/2013sp/sections/s06_convolution.pdf
http://stackoverflow.com/questions/16716302/how-do-i-fit-a-sine-curve-to-my-data-with-pylab-and-numpy

Advertisements

Home Energy: Linear Regression

I’m going to start out by trying out some linear regression. The hope is that this will allow us to input new data and output the Energy Production. Should be straightforward. The first model regresses Energy Production on Temperature, the second regresses Energy Production on Daylight, and the third regresses Energy Production on both Temperature and Daylight. We’ll start with these and see what we get.

Energy Production on Temperature gives us a model which is not very good. The good news is that the p value is very low at 2.32e-195. The means the model is a good predictor. Technically, it tells us to reject the null hypothesis that the coefficient is 0. However, R-Squared is very low at .07, meaning a high variance in the prediction. This is bad. It doesn’t seem to be very useful because of this.
The Daylight vs Energy Production seems to be slightly better. It also has a very low p value (0), but a slightly higher R Squared (.28). This means that more of the variance can be explained by the model.

So, what happens when we combine both Temperature and Daylight? We get a p value of 0.0, once again. But a marginally improved R-squared value of .37 as compared to the .28 for the Daylight data.

coef std err t P>|t| 95% Conf. Int.
Intercept 40.3892 7.144 5.654 0.000 26.386 54.392
Temperature 5.0511 0.124 40.856 0.000 4.809 5.293
Daylight 2.6425 0.036 74.090 0.000 2.573 2.712

The next thing I tried was to break the data down by month before doing the regression. The hope being that we could better predict EnergyProduction when we take into account what month it is. So, each month has a separate linear model that it uses. Only Daylight is used to predict. The plot shows different color dots and regression lines for each month. The p values are once again very low. The R Squared values vary to a good degree month to month from a low of .004 in September to .46 in August. This seems to suggest that there could be a large spread in what our model predicts in certain months, while being a much tighter error in other months.

How good the model succeeds will be based on how well it does on new data. The measurement we will use with this will be the MAPE (Mean Absolute Percentage Error). I already broke the data down into training and test. The models were created with the training data and will be measured with the test data. Let’s compare the regression model with Temperature and Daylight vs. the model breaking it down into months.

Model   MAPE Score
Temperature and Daylight Regression 15.06
Daylight Regression by Month 12.88

And just for fun, I expanded the month by month regression to include both Temperature and Daylight. This gave a very small improvement.

Model   MAPE Score
Temperature and Daylight by Month 12.48

Creating a separate model for each month that includes both the Temperature and Daylight factors gives the best predictions. Although, it’s only a marginal improvement on the Daylight model for each month, I’ll stick with it, because it is the best and the cost to implement is low enough. Let’s see if we can improve on this score on future analysis!

Home Energy: Testing and Measuring of Prediction Models

Before we continue, it is important to clarify a couple prediction model concepts. This will allow us to test our models.

This is a side post on how we will evaluate the performance of the models.

We will break the data down into 2 sets. The first set is what will be used to train our model and give it the parameters. This will include the first 80% of the entries. Think of this as teaching the model. The 2nd set will test the model to see how well it performs on new data. This will subsist of the last 20% of the data.

The performance of the models will be compared using Mean Absolute Percentage Error (MAPE). MAPE is calculated by taking the mean of the standardized absolute errors and turning it to a percentage.

http://support.minitab.com/en-us/minitab/17/png/measures_of_accuracy.dita_dctm_Chron0900045780196e20_0.png

This will give us a standardized measurement with which to compare two models. This score will be calculated on the new test data. MAPE percentage error is an intuitive way of understanding the error statistic. As Minitab’s online resource explains, a MAPE score of 5 means that the forecast is off by 5% on average. Other options to measure the error of models would have been R-Squared, Mean Absolute Deviation, or Mean Squared Deviation. I won’t go into the details of these except to explain that R-Squared is the percentage of the response variable variation explained by the model on a scale of 0%-100%. In other words, it is a deviation measure between the model and actual values over a deviation measure between the mean and actual values.
In summary, in order to improve something we have to measure it. So, we want to make sure we have a system in place to systematically measure and compare our models.