Home Energy: Testing and Measuring of Prediction Models

Before we continue, it is important to clarify a couple prediction model concepts. This will allow us to test our models.

This is a side post on how we will evaluate the performance of the models.

We will break the data down into 2 sets. The first set is what will be used to train our model and give it the parameters. This will include the first 80% of the entries. Think of this as teaching the model. The 2nd set will test the model to see how well it performs on new data. This will subsist of the last 20% of the data.

The performance of the models will be compared using Mean Absolute Percentage Error (MAPE). MAPE is calculated by taking the mean of the standardized absolute errors and turning it to a percentage.


This will give us a standardized measurement with which to compare two models. This score will be calculated on the new test data. MAPE percentage error is an intuitive way of understanding the error statistic. As Minitab’s online resource explains, a MAPE score of 5 means that the forecast is off by 5% on average. Other options to measure the error of models would have been R-Squared, Mean Absolute Deviation, or Mean Squared Deviation. I won’t go into the details of these except to explain that R-Squared is the percentage of the response variable variation explained by the model on a scale of 0%-100%. In other words, it is a deviation measure between the model and actual values over a deviation measure between the mean and actual values.
In summary, in order to improve something we have to measure it. So, we want to make sure we have a system in place to systematically measure and compare our models.

Home Energy: Data Exploration

My first project is to predict energy production of a solar panel. I got my hands on an energy dataset that compares temperature and daylight to energy production (I’m assuming this is energy production of solar panels) of houses. It is monthly frequency over two years. This is from Cambridge Energy Lab’s Challenge. The ultimate goal is to be able to predict Energy Production from the data on an entirely new house. To start off, we’re going to do some data exploration to get a better sense of the data.

The data itself looks like as follows:

There are 12,000 rows. Let’s go through the columns –

Label: 0-22 index repeated for each house

House: 500 houses represented

Year/Month: 2011/7-2013/5

Temperature: Degrees Celsius

 Statistic Value
Count 11,500
Mean 14.37
Stdev 8.49
Minimum 0.8
25th Percentile 5.3
50th Percentile 13.2
75th Percentile 22.8
Maximum 29

Daylight: How much daylight. Not sure of the units.

Statistic Value
Count 11,500
Mean 189.12
Stdev 29.43
Minimum 133.7
25th Percentile 169.1
50th Percentile 181.8
75th Percentile 205.2
Maximum 271.3

EnergyProduction: (kWh)

Statistic Value
Count 11,500
Mean 612.75
Stdev 142
Minimum 254
25th Percentile 509
50th Percentile 592
75th Percentile 698
Maximum 1254

These preliminary statistiscs give us a nice idea of what the magnitude and range of the data is, but quite frankly, nothing we couldn’t have gotten from a quick browse through of the raw data.

And to get a preliminary scatter plot with Temperature vs. Energy Production:

There’s is not a very clear linear relationship here. There might be some clustering to be done as there seems to be diferent groups the data could be put into.

And graphing the Daylight variable:

The Daylight dependent scatterplot seems to show data that is ripe for linear regression. This is something that should be investigated.

An important part of the data that I originally overlooked are the months. Graphing the Energy Production month by month over a year, should give a fair amount of information. This is because weather is very dependent on what month it is, and Energy Production is very dependent on weather. In theory.

There seems to be some correlation month to month. I’m not sure how able a month would be able to tell us what Energy Production to expect, but the better information might be what type of change to expect from the previous month. I’ll have to think about this and see what can be done.

In summary, this post gives a cursory information of what the data is we are working with. The key will be in what steps we take next. There is a great number of different analysis we could perform from linear regression relating Temperature or Daylight to EnergyProduction, to K Means clustering to group the houses in an effort to try and differentiate them. We’ll have to be smart about the next steps, but in the end need to try something and see how it works.