Autore Topic: How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset  (Letto 107 volte)

0 Utenti e 1 Visitatore stanno visualizzando questo topic.

Offline Flavio58

Advertisement
How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset

[html]

Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites. The EMC Data Science Global Hackathon dataset, or the ‘Air Quality […]


The post How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset appeared first on Machine Learning Mastery.



            

Real-world time series forecasting is challenging for a whole host of reasons not limited to problem features such as having multiple input variables, the requirement to predict multiple time steps, and the need to perform the same type of prediction for multiple physical sites.


The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction‘ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.


In this tutorial, you will discover and explore the Air Quality Prediction dataset that represents a challenging multivariate, multi-site, and multi-step time series forecasting problem.


After completing this tutorial, you will know:



  • How to load and explore the chunk-structure of the dataset.

  • How to explore and visualize the input and target variables for the dataset.

  • How to use the new understanding to outline a suite of methods for framing the problem, preparing the data, and modeling the dataset.


Let’s get started.


How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset 640w, 300w" sizes="(max-width: 640px) 100vw, 640px" />

How to Load, Visualize, and Explore a Complex Multivariate Multistep Time Series Forecasting Dataset
Photo by H Matthew Howarth, some rights reserved.


Tutorial Overview


This tutorial is divided into seven parts; they are:



  1. Problem Description

  2. Load Dataset

  3. Chunk Data Structure

  4. Input Variables

  5. Target Variables

  6. A Wrinkle With Target Variables

  7. Thoughts on Modeling


Problem Description


The EMC Data Science Global Hackathon dataset, or the ‘Air Quality Prediction‘ dataset for short, describes weather conditions at multiple sites and requires a prediction of air quality measurements over the subsequent three days.


Specifically, weather observations such as temperature, pressure, wind speed, and wind direction are provided hourly for eight days for multiple sites. The objective is to predict air quality measurements for the next three days at multiple sites. The forecast lead times are not contiguous; instead, specific lead times must be forecast over the 72 hour forecast period; they are:

+1, +2, +3, +4, +5, +10, +17, +24, +48, +72

Further, the dataset is divided into disjoint but contiguous chunks of data, with eight days of data followed by three days that require a forecast.


Not all observations are available at all sites or chunks and not all output variables are available at all sites and chunks. There are large portions of missing data that must be addressed.


The dataset was used as the basis for a short duration machine learning competition (or hackathon) on the Kaggle website in 2012.


Submissions for the competition were evaluated against the true observations that were withheld from participants and scored using Mean Absolute Error (MAE). Submissions required the value of -1,000,000 to be specified in those cases where a forecast was not possible due to missing data. In fact, a template of where to insert missing values was provided and required to be adopted for all submissions (what a pain).


A winning entrant achieved a MAE of 0.21058 on the withheld test set (private leaderboard) using random forest on lagged observations. A writeup of this solution is available in the post:



In this tutorial, we will explore this dataset in order to better understand the nature of the forecast problem and suggest approaches for how it may be modeled.


Load Dataset


The first step is to download the dataset and load it into memory.


The dataset can be downloaded for free from the Kaggle website. You may have to create an account and log in, in order to be able to download the dataset.


Download the entire dataset, e.g. “Download All” to your workstation and unzip the archive in your current working directory with the folder named ‘AirQualityPrediction



You should have five files in the AirQualityPrediction/ folder; they are:



  • SiteLocations.csv

  • SiteLocations_with_more_sites.csv

  • SubmissionZerosExceptNAs.csv

  • TrainingData.csv

  • sample_code.r


Our focus will be the ‘TrainingData.csv‘ that contains the training dataset, specifically data in chunks where each chunk is eight contiguous days of observations and target variables.


The test dataset (remaining three days of each chunk) is not available for this dataset at the time of writing.


Open the ‘TrainingData.csv‘ file and review the contents. The unzipped data file is relatively small (21 megabytes) and will easily fit into RAM.


Reviewing the contents of the file, we can see that the data file contains a header row.


We can also see that missing data is marked with the ‘NA‘ value, which Pandas will automatically convert to NumPy.NaN.


We can see that the ‘weekday‘ column contains the day as a string, whereas all other data is numeric.


Below are the first few lines of the data file for reference.

"rowID","chunkID","position_within_chunk","month_most_common","weekday","hour","Solar.radiation_64","WindDirection..Resultant_1","WindDirection..Resultant_1018","WindSpeed..Resultant_1","WindSpeed..Resultant_1018","Ambient.Max.Temperature_14","Ambient.Max.Temperature_22","Ambient.Max.Temperature_50","Ambient.Max.Temperature_52","Ambient.Max.Temperature_57","Ambient.Max.Temperature_76","Ambient.Max.Temperature_2001","Ambient.Max.Temperature_3301","Ambient.Max.Temperature_6005","Ambient.Min.Temperature_14","Ambient.Min.Temperature_22","Ambient.Min.Temperature_50","Ambient.Min.Temperature_52","Ambient.Min.Temperature_57","Ambient.Min.Temperature_76","Ambient.Min.Temperature_2001","Ambient.Min.Temperature_3301","Ambient.Min.Temperature_6005","Sample.Baro.Pressure_14","Sample.Baro.Pressure_22","Sample.Baro.Pressure_50","Sample.Baro.Pressure_52","Sample.Baro.Pressure_57","Sample.Baro.Pressure_76","Sample.Baro.Pressure_2001","Sample.Baro.Pressure_3301","Sample.Baro.Pressure_6005","Sample.Max.Baro.Pressure_14","Sample.Max.Baro.Pressure_22","Sample.Max.Baro.Pressure_50","Sample.Max.Baro.Pressure_52","Sample.Max.Baro.Pressure_57","Sample.Max.Baro.Pressure_76","Sample.Max.Baro.Pressure_2001","Sample.Max.Baro.Pressure_3301","Sample.Max.Baro.Pressure_6005","Sample.Min.Baro.Pressure_14","Sample.Min.Baro.Pressure_22","Sample.Min.Baro.Pressure_50","Sample.Min.Baro.Pressure_52","Sample.Min.Baro.Pressure_57","Sample.Min.Baro.Pressure_76","Sample.Min.Baro.Pressure_2001","Sample.Min.Baro.Pressure_3301","Sample.Min.Baro.Pressure_6005","target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"
1,1,1,10,"Saturday",21,0.01,117,187,0.3,0.3,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,6.1816228132982,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.56815355612325,0.690015329704154,NA,NA,NA,NA,NA,NA,2.84349016287551,0.0920223353681394,1.69321097077376,0.368089341472558,0.184044670736279,0.368089341472558,0.276067006104418,0.892616653070952,1.74842437199465,NA,NA,5.1306307034019,1.34160578423204,2.13879182993514,3.01375212399952,NA,5.67928016629218,NA
2,1,2,10,"Saturday",22,0.01,231,202,0.5,0.6,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.47583334194495,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.99138023331659,NA,5.56815355612325,0.923259948195698,NA,NA,NA,NA,NA,NA,3.1011527019063,0.0920223353681394,1.94167127626774,0.368089341472558,0.184044670736279,0.368089341472558,0.368089341472558,1.73922213845783,2.14412041407765,NA,NA,5.1306307034019,1.19577906855465,2.72209869264472,3.88871241806389,NA,7.42675098668978,NA
3,1,3,10,"Saturday",23,0.01,247,227,0.5,1.5,NA,NA,NA,14.9,NA,NA,NA,NA,NA,NA,NA,NA,5.8,NA,NA,NA,NA,NA,NA,NA,NA,747,NA,NA,NA,NA,NA,NA,NA,NA,750,NA,NA,NA,NA,NA,NA,NA,NA,743,NA,NA,NA,NA,NA,2.67923294292042,8.92192983362627,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,1.7524146053186,NA,5.56815355612325,0.680296803933673,NA,NA,NA,NA,NA,NA,3.06434376775904,0.0920223353681394,2.52141198908702,0.460111676840697,0.184044670736279,0.368089341472558,0.368089341472558,1.7852333061419,1.93246904273093,NA,NA,5.13639545700122,1.40965825154816,3.11096993445111,3.88871241806389,NA,7.68373198968942,NA
4,1,4,10,"Sunday",0,0.01,219,218,0.2,1.2,NA,NA,NA,14,NA,NA,NA,NA,NA,NA,NA,NA,4.8,NA,NA,NA,NA,NA,NA,NA,NA,751,NA,NA,NA,NA,NA,NA,NA,NA,754,NA,NA,NA,NA,NA,NA,NA,NA,748,NA,NA,NA,NA,NA,2.67923294292042,5.09824561921501,NA,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,0.114975168664303,NA,2.38965627997991,NA,5.6776192223642,0.612267123540305,NA,NA,NA,NA,NA,NA,3.21157950434806,0.184044670736279,2.374176252498,0.460111676840697,0.184044670736279,0.368089341472558,0.276067006104418,1.86805340797323,2.08890701285676,NA,NA,5.21710200739181,1.47771071886428,2.04157401948354,3.20818774490271,NA,4.83124285639335,NA
...

We can load the data file into memory using the Pandas read_csv() function and specify the header row on line 0.

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)

We can also get a quick idea of how much missing data there is in the dataset. We can do that by first trimming the first few columns to remove the string weekday data and convert the remaining columns to floating point values.

# trim and transform to floats
values = dataset.values
data = values[:, 6:].astype('float32')

We can then calculate the total number of missing observations and the percentage of values that are missing.

# summarize amount of missing data
total_missing = data.size - count_nonzero(isnan(data))
percent_missing = total_missing / data.size * 100
print('Total Missing: %d/%d (%.1f%%)' % (total_missing, data.size, percent_missing))

The complete example is listed below.

# load dataset
from numpy import isnan
from numpy import count_nonzero
from pandas import read_csv
# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# summarize
print(dataset.shape)
# trim and transform to floats
values = dataset.values
data = values[:, 6:].astype('float32')
# summarize amount of missing data
total_missing = data.size - count_nonzero(isnan(data))
percent_missing = total_missing / data.size * 100
print('Total Missing: %d/%d (%.1f%%)' % (total_missing, data.size, percent_missing))

Running the example first prints the shape of the loaded dataset.


We can see that we have about 37,000 rows and 95 columns. We know these numbers are misleading given that the data is in fact divided into chunks and the columns are divided into the same observations at different sites.


We can also see that a little over 40% of the data is missing. This is a lot. The data is very patchy and we are going to have to understand this well before modeling the problem.

(37821, 95)
Total Missing: 1443977/3366069 (42.9%)



Need help with Deep Learning for Time Series?


Take my free 7-day email crash course now (with sample code).


Click to sign-up and also get a free PDF Ebook version of the course.


Download Your FREE Mini-Course




Chunk Data Structure


A good starting point is to look at the data in terms of the chunks.


Chunk Durations


We can group data by the ‘chunkID’ variable (column index 1).


If each chunk is eight days and the observations are hourly, then we would expect (8 * 24) or 192 rows of data per chunk.


If there are 37,821 rows of data, then there must be chunks with more or less than 192 hours as 37,821/192 is about 196.9 chunks.


Let’s first split the data into chunks. We can first get a list of the unique chunk identifiers.

chunk_ids = unique(values[:, 1])

We can then collect all rows for each chunk identifier and store them in a dictionary for easy access.

chunks = dict()
# sort rows by chunk id
for chunk_id in chunk_ids:
   selection = values[:, chunk_ix] == chunk_id
   chunks[chunk_id] = values[selection, :]

Below defines a function named to_chunks() that takes a NumPy array of the loaded data and returns a dictionary of chunk_id to rows for the chunk.

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

The ‘position_within_chunk‘ in the data file indicates the order of a row within a chunk. At this stage, we assume that rows are already ordered and do not need to be sorted. A skim of the raw data file seems to confirm this assumption.


Once the data is sorted into chunks, we can calculate the number of rows in each chunk and have a look at the distribution, such as with a box and whisker plot.

# plot distribution of chunk durations
def plot_chunk_durations(chunks):
   # chunk durations in hours
   chunk_durations = [len(v) for k,v in chunks.items()]
   # boxplot
   pyplot.subplot(2, 1, 1)
   pyplot.boxplot(chunk_durations)
   # histogram
   pyplot.subplot(2, 1, 2)
   pyplot.hist(chunk_durations)
   # histogram
   pyplot.show()

The complete example that ties all of this together is listed below

# split data into chunks
from numpy import unique
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# plot distribution of chunk durations
def plot_chunk_durations(chunks):
   # chunk durations in hours
   chunk_durations = [len(v) for k,v in chunks.items()]
   # boxplot
   pyplot.subplot(2, 1, 1)
   pyplot.boxplot(chunk_durations)
   # histogram
   pyplot.subplot(2, 1, 2)
   pyplot.hist(chunk_durations)
   # histogram
   pyplot.show()

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
print('Total Chunks: %d' % len(chunks))
# plot chunk durations
plot_chunk_durations(chunks)

Running the example first prints the number of chunks in the dataset.


We can see that there are 208, which suggests that indeed the number of hourly observations must vary across the chunks.

Total Chunks: 208

A box and whisker plot and a histogram plot of chunk durations is created. We can see that indeed the median is 192, meaning that most chunks have eight days of observations or close to it.


We can also see a long tail of durations down to about 25 rows. Although there are not many of these cases, we would expect that will be challenging to forecast given the lack of data.


The distribution also raises questions about how contiguous the observations within each chunk may be.


Box and whisker plot and Histogram plot of chunk duration in hours 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Box and whisker plot and Histogram plot of chunk duration in hours


Chunk Contiguousness


It may be helpful to get an idea of how contiguous (or not) the observations are within those chunks that do not have the full eight days of data.


One approach to considering this is to create a line plot for each discontiguous chunk and show the gaps in the observations.


We can do this on a single plot. Each chunk has a unique identifier, from 1 to 208, and we can use this as the value for the series and mark missing observations within the eight day interval via NaN values that will not appear on the plot.


Inverting this, we can assume that we have NaN values for all time steps within a chunk, then use the ‘position_within_chunk‘ column (index 2) to determine the time steps that do have values and mark them with the chunk id.


The plot_discontinuous_chunks() below implements this behavior, creating one series or line for each chunk with missing rows all on the same plot. The expectation is that breaks in the line will help us see how contiguous or discontiguous these incomplete chunks happen to be.

# plot chunks that do not have all data
def plot_discontiguous_chunks(chunks, row_in_chunk_ix=2):
   n_steps = 8 * 24
   for c_id,rows in chunks.items():
      # skip chunks with all data
      if rows.shape[0] == n_steps:
         continue
      # create empty series
      series = [nan for _ in range(n_steps)]
      # mark all rows with data
      for row in rows:
         # convert to zero offset
         r_id = row[row_in_chunk_ix] - 1
         # mark value
         series[r_id] = c_id
      # plot
      pyplot.plot(series)
   pyplot.show()

The complete example is listed below.

# plot discontiguous chunks
from numpy import nan
from numpy import unique
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# plot chunks that do not have all data
def plot_discontiguous_chunks(chunks, row_in_chunk_ix=2):
   n_steps = 8 * 24
   for c_id,rows in chunks.items():
      # skip chunks with all data
      if rows.shape[0] == n_steps:
         continue
      # create empty series
      series = [nan for _ in range(n_steps)]
      # mark all rows with data
      for row in rows:
         # convert to zero offset
         r_id = row[row_in_chunk_ix] - 1
         # mark value
         series[r_id] = c_id
      # plot
      pyplot.plot(series)
   pyplot.show()

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# plot discontiguous chunks
plot_discontiguous_chunks(chunks)

Running the example creates a single figure with one line for each of the chunks with missing data.


The number and lengths of the breaks in the line for each chunk give an idea of how discontiguous the observations within each chunk happen to be.


Many of the chunks do have long stretches of contiguous data, which is a good sign for modeling.


There are cases where chunks have very few observations and those observations that are present are in small contiguous patches. These may be challenging to model.


Further, not all of these chunks have observations at the end of chunk: the period right before a forecast is required. These specifically will be a challenge for those models that seek to persist recent observations.


The discontinuous nature of the series data within the chunks will also make it challenging to evaluate models. For example, one cannot simply split chunk data in half, train on the first half and test on the second when the observations are patchy. At least, when the incomplete chunk data is considered.


Line plots of chunks with discontinuous observations 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Line plots of chunks with discontinuous observations


Daily Coverage Within Chunks


The discontiguous nature of the chunks also suggests that it may be important to look at the hours covered by each chunk.


The time of day is important in environmental data, and models that assume that each chunk covers the same daily or weekly cycle may stumble if the start and end time of day vary across chunks.


We can quickly check this by plotting the distribution of the first hour (in a 24 hour day) of each chunk.


The number of bins in the histogram is set to 24 so we can clearly see the distribution for each hour of the day in 24-hour time.


Further, when collecting the first hour of the chunk, we are careful to only collect it from those chunks that have all eight days of data, in case a chunk with missing data does not have observations at the beginning of the chunk, which we know happens.

# plot distribution of chunk start hour
def plot_chunk_start_hour(chunks, hour_in_chunk_ix=5):
   # chunk start hour
   chunk_start_hours = [v[0, hour_in_chunk_ix] for k,v in chunks.items() if len(v)==192]
   # boxplot
   pyplot.subplot(2, 1, 1)
   pyplot.boxplot(chunk_start_hours)
   # histogram
   pyplot.subplot(2, 1, 2)
   pyplot.hist(chunk_start_hours, bins=24)
   # histogram
   pyplot.show()

The complete example is listed below.

# plot distribution of chunk start hour
from numpy import nan
from numpy import unique
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# plot distribution of chunk start hour
def plot_chunk_start_hour(chunks, hour_in_chunk_ix=5):
   # chunk start hour
   chunk_start_hours = [v[0, hour_in_chunk_ix] for k,v in chunks.items() if len(v)==192]
   # boxplot
   pyplot.subplot(2, 1, 1)
   pyplot.boxplot(chunk_start_hours)
   # histogram
   pyplot.subplot(2, 1, 2)
   pyplot.hist(chunk_start_hours, bins=24)
   # histogram
   pyplot.show()

# load dataset
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# plot distribution of chunk start hour
plot_chunk_start_hour(chunks)

Running the example creates a box and whisker plot and a histogram of the first hour within each chunk.


We can see a reasonably uniform distribution of the start time across the 24 hours in the day.


Further, this means that the interval to be forecast for each chunk will also vary across the 24 hour period. This adds a wrinkle for models that might expect a standard three day forecast period (midnight to midnight).


Distribution of first hour in observations within each chunk 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Distribution of first hour in observations within each chunk


Now that we have some idea of the chunk-structure of the data, let’s take a closer look at the input variables that describe the meteorological observations.


Input Variables


There are 56 input variables.


The first six (indexes 0 to 5) are metadata information for the chunk and time of the observations. They are:

rowID
chunkID
position_within_chunk
month_most_common
weekday
hour

The remaining 50 describe meteorological information for specific sites; they are:

Solar.radiation_64
WindDirection..Resultant_1
WindDirection..Resultant_1018
WindSpeed..Resultant_1
WindSpeed..Resultant_1018
Ambient.Max.Temperature_14
Ambient.Max.Temperature_22
Ambient.Max.Temperature_50
Ambient.Max.Temperature_52
Ambient.Max.Temperature_57
Ambient.Max.Temperature_76
Ambient.Max.Temperature_2001
Ambient.Max.Temperature_3301
Ambient.Max.Temperature_6005
Ambient.Min.Temperature_14
Ambient.Min.Temperature_22
Ambient.Min.Temperature_50
Ambient.Min.Temperature_52
Ambient.Min.Temperature_57
Ambient.Min.Temperature_76
Ambient.Min.Temperature_2001
Ambient.Min.Temperature_3301
Ambient.Min.Temperature_6005
Sample.Baro.Pressure_14
Sample.Baro.Pressure_22
Sample.Baro.Pressure_50
Sample.Baro.Pressure_52
Sample.Baro.Pressure_57
Sample.Baro.Pressure_76
Sample.Baro.Pressure_2001
Sample.Baro.Pressure_3301
Sample.Baro.Pressure_6005
Sample.Max.Baro.Pressure_14
Sample.Max.Baro.Pressure_22
Sample.Max.Baro.Pressure_50
Sample.Max.Baro.Pressure_52
Sample.Max.Baro.Pressure_57
Sample.Max.Baro.Pressure_76
Sample.Max.Baro.Pressure_2001
Sample.Max.Baro.Pressure_3301
Sample.Max.Baro.Pressure_6005
Sample.Min.Baro.Pressure_14
Sample.Min.Baro.Pressure_22
Sample.Min.Baro.Pressure_50
Sample.Min.Baro.Pressure_52
Sample.Min.Baro.Pressure_57
Sample.Min.Baro.Pressure_76
Sample.Min.Baro.Pressure_2001
Sample.Min.Baro.Pressure_3301
Sample.Min.Baro.Pressure_6005

Really, there are only eight meteorological input variables:

Solar.radiation
WindDirection..Resultant
WindSpeed..Resultant
Ambient.Max.Temperature
Ambient.Min.Temperature
Sample.Baro.Pressure
Sample.Max.Baro.Pressure
Sample.Min.Baro.Pressure

These variables are recorded across 23 unique sites; they are:

1, 14, 22, 50, 52, 57, 64, 76, 1018, 2001, 3301, 6005

The data is beautifully complex.


Not all variables are recorded at all sites.


There is some overlap in the site identifiers used in the target variables, such as 1, 50, 64, etc.


There are site identifiers used in the target variables that are not used in the input variables, such as 4002. There are also site identifiers that are used in the input that are not used in the target identifiers, such as 15.


This suggests, at the very least, that not all variables are recorded at all locations. That recording stations are heterogeneous across sites. Further, there might be something special about sites that only collect measures of a given type or collect all measurements.


Let’s take a closer look at the data for the input variables.


Temporal Structure of Inputs for a Chunk


We can start off by looking at the structure and distribution of inputs per chunk.


The first few chunks that have all eight days of observations have the chunkId of 1, 3, and 5.


We can enumerate all of the input columns and create one line plot for each. This will create a time series line plot for each input variable to give a rough idea of how each varies across time.


We can repeat this for a few chunks to get an idea how the temporal structure may differ across chunks.


The function below named plot_chunk_inputs() takes the data in chunk format and a list of chunk ids to plot. It will create a figure with 50 line plots, one for each input variable, and n lines per plot, one for each chunk.

# plot all inputs for one or more chunk ids
def plot_chunk_inputs(chunks, c_ids):
   pyplot.figure()
   inputs = range(6, 56)
   for i in range(len(inputs)):
      ax = pyplot.subplot(len(inputs), 1, i+1)
      ax.set_xticklabels([])
      ax.set_yticklabels([])
      column = inputs
      for chunk_id in c_ids:
         rows = chunks[chunk_id]
         pyplot.plot(rows[:,column])
   pyplot.show()

The complete example is listed below.

# plot inputs for a chunk
from numpy import unique
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# plot all inputs for one or more chunk ids
def plot_chunk_inputs(chunks, c_ids):
   pyplot.figure()
   inputs = range(6, 56)
   for i in range(len(inputs)):
      ax = pyplot.subplot(len(inputs), 1, i+1)
      ax.set_xticklabels([])
      ax.set_yticklabels([])
      column = inputs
      for chunk_id in c_ids:
         rows = chunks[chunk_id]
         pyplot.plot(rows[:,column])
   pyplot.show()

# load data
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# plot inputs for some chunks
plot_chunk_inputs(chunks, [1])

Running the example creates a single figure with 50 line plots, one for each of the meteorological input variables.


The plots are hard to see, so you may want to increase the size of the created figure.


We can see that the observations for the first five variables look pretty complete; these are solar radiation, wind speed, and wind direction. The rest of the variables appear pretty patchy, at least for this chunk.


Parallel Time Series Line Plots For All Input Variables for 1 Chunks 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Parallel Time Series Line Plots For All Input Variables for 1 Chunks


We can update the example and plot the input variables for the first three chunks with the full eight days of observations.

plot_chunk_inputs(chunks, [1, 3 ,5])

Running the example creates the same 50 line plots, each with three series or lines per plot, one for each chunk.


Again, the figure makes the individual plots hard to see, so you may need to increase the size of the figure to better review the patterns.


We can see that these three figures do show similar structures within each line plot. This is helpful finding as it suggests that it may be useful to model the same variables across multiple chunks.


Parallel Time Series Line Plots For All Input Variables for 3 Chunks 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Parallel Time Series Line Plots For All Input Variables for 3 Chunks


It does raise the question as to whether the distribution of the variables differs greatly across sites.


Input Data Distribution


We can look at the distribution of input variables crudely using box and whisker plots.


The plot_chunk_input_boxplots() below will create one box and whisker per input feature for the data for one chunk.

# boxplot for input variables for a chuck
def plot_chunk_input_boxplots(chunks, c_id):
   rows = chunks[c_id]
   pyplot.boxplot(rows[:,6:56])
   pyplot.show()

The complete example is listed below.

# boxplots of inputs for a chunk
from numpy import unique
from numpy import isnan
from numpy import count_nonzero
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# boxplot for input variables for a chuck
def plot_chunk_input_boxplots(chunks, c_id):
   rows = chunks[c_id]
   pyplot.boxplot(rows[:,6:56])
   pyplot.show()

# load data
dataset = read_csv('TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# boxplot for input variables
plot_chunk_input_boxplots(chunks, 1)

Running the example creates 50 boxplots, one for each input variable for the observations in the first chunk in the training dataset.


We can see that variables of the same type may have the same spread of observations, and each group of variables appears to have differing units. Perhaps degrees for wind direction, hectopascales for pressure, degrees Celsius for temperature, and so on.


Box and whisker plots of input variables for one chunk 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Box and whisker plots of input variables for one chunk


It may be interesting to further investigate the distribution and spread of observations for each of the eight variable types. This is left as a further exercise.


We have some rough ideas about the input variables, and perhaps they may be useful in predicting the target variables. We cannot be sure.


We can now turn our attention to the target variables.


Target Variables


The goal of the forecast problem is to predict multiple variables across multiple sites for three days.


There are 39 time series variables to predict.


From the column header, they are:

"target_1_57","target_10_4002","target_10_8003","target_11_1","target_11_32","target_11_50","target_11_64","target_11_1003","target_11_1601","target_11_4002","target_11_8003","target_14_4002","target_14_8003","target_15_57","target_2_57","target_3_1","target_3_50","target_3_57","target_3_1601","target_3_4002","target_3_6006","target_4_1","target_4_50","target_4_57","target_4_1018","target_4_1601","target_4_2001","target_4_4002","target_4_4101","target_4_6006","target_4_8003","target_5_6006","target_7_57","target_8_57","target_8_4002","target_8_6004","target_8_8003","target_9_4002","target_9_8003"

The naming convention for these column headers is:

target_[variable identifier]_[site identifier]]

We can convert these column headers into a small dataset of variable ids and site ids with a little regex.


The results are as follows:

var, site
1,57
10,4002
10,8003
11,1
11,32
11,50
11,64
11,1003
11,1601
11,4002
11,8003
14,4002
14,8003
15,57
2,57
3,1
3,50
3,57
3,1601
3,4002
3,6006
4,1
4,50
4,57
4,1018
4,1601
4,2001
4,4002
4,4101
4,6006
4,8003
5,6006
7,57
8,57
8,4002
8,6004
8,8003
9,4002
9,8003

Helpfully, the targets are grouped by variable id.


We can see that one variable may have to be predicted across multiple sites; for example, variable 11 predicted at sites 1, 32, 50, and so on:

var, site
11,1
11,32
11,50
11,64
11,1003
11,1601
11,4002
11,8003

We can see that different variables may need to be predicted for a given site. For example, site 50 requires variables 11, 3, and 4:

var, site
11,50
3,50
4,50

We can save the small dataset of targets to a file called ‘targets.txt‘ and load it up for some quick analysis.

# summarize targets
from numpy import unique
from pandas import read_csv
# load dataset
dataset = read_csv('targets.txt', header=0)
values = dataset.values
# summarize unique
print('Unique Variables: %d' % len(unique(values[:, 0])))
print('Unique Sites: %d' % len(unique(values[:, 1])))

Running the example prints the number of unique variables and sites.


We can see that 39 target variables is far less than (12*14) 168 if we were predicting all variables for all sites.

Unique Variables: 12
Unique Sites: 14

Let’s take a closer look at the data for the target variables.


Temporal Structure of Targets for a Chunk


We can start off by looking at the structure and distribution of targets per chunk.


The first few chunks that have all eight days of observations have the chunkId of 1, 3, and 5.


We can enumerate all of the target columns and create one line plot for each. This will create a time series line plot for each target variable to give a rough idea of how it varies across time.


We can repeat this for a few chunks to get a rough idea of how the temporal structure may vary across chunks.


The function below, named plot_chunk_targets(), takes the data in chunk format and a list of chunk ids to plot. It will create a figure with 39 line plots, one for each target variable, and n lines per plot, one for each chunk.

# plot all targets for one or more chunk ids
def plot_chunk_targets(chunks, c_ids):
   pyplot.figure()
   targets = range(56, 95)
   for i in range(len(targets)):
      ax = pyplot.subplot(len(targets), 1, i+1)
      ax.set_xticklabels([])
      ax.set_yticklabels([])
      column = targets
      for chunk_id in c_ids:
         rows = chunks[chunk_id]
         pyplot.plot(rows[:,column])
   pyplot.show()

The complete example is listed below.

# plot targets for a chunk
from numpy import unique
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# plot all targets for one or more chunk ids
def plot_chunk_targets(chunks, c_ids):
   pyplot.figure()
   targets = range(56, 95)
   for i in range(len(targets)):
      ax = pyplot.subplot(len(targets), 1, i+1)
      ax.set_xticklabels([])
      ax.set_yticklabels([])
      column = targets
      for chunk_id in c_ids:
         rows = chunks[chunk_id]
         pyplot.plot(rows[:,column])
   pyplot.show()

# load data
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# plot targets for some chunks
plot_chunk_targets(chunks, [1])

Running the example creates a single figure with 39 line plots for chunk identifier “1”.


The plots are small, but give a rough idea of the temporal structure for the variables.


We can see that there are more than a few variables that have no data for this chunk. These cannot be forecasted directly, and probably not indirectly.


This suggests that in addition to not having all variables for all sites, that even those specified in the column header may not be present for some chunks.


We can also see breaks in some of the series for missing values. This suggests that even though we may have observations for each time step within the chunk, that we may not have a contiguous series for all variables in the chunk.


There is a cyclic structure to many of the plots. Most have eight peaks, very likely corresponding to the eight days of observations within the chunk. This seasonal structure could be modeled directly, and perhaps removed from the data when modeling and added back to the forecasted interval.


There does not appear to be any trend to the series.


Parallel Time Series Line Plots For All Target Variables for 1 Chunk 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Parallel Time Series Line Plots For All Target Variables for 1 Chunk


We can re-run the example and plot the target variables for the first three chunks with complete data.

# plot targets for some chunks
plot_chunk_targets(chunks, [1, 3 ,5])

Running the example creates a figure with 39 plots and three time series per plot, one for the targets for each chunk.


The plot is busy, and you may want to increase the size of the plot window to better see the comparison across the chunks for the target variables.


For many of the variables that have a cyclic daily structure, we can see the structure repeated across the chunks.


This is encouraging as it suggests that modeling a variable for a site may be helpful across chunks.


Further, plots 3-to-10 correspond to variable 11 across seven different sites. The string similarity in temporal structure across these plots suggest that modeling the data per variable which is used across sites may be beneficial.


Parallel Time Series Line Plots For All Target Variables for 3 Chunks 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Parallel Time Series Line Plots For All Target Variables for 3 Chunks


Boxplot Distribution of Target Variables


It is also useful to take a look at the distribution of the target variables.


We can start by taking a look at the distribution of each target variable for one chuck by creating box and whisker plots for each target variable.


A separate boxplot can be created for each target side-by-side, allowing the shape and range of values to be directly compared on the same scale.

# boxplot for target variables for a chuck
def plot_chunk_targets_boxplots(chunks, c_id):
   rows = chunks[c_id]
   pyplot.boxplot(rows[:,56:])
   pyplot.show()

The complete example is listed below.

# boxplots of targets for a chunk
from numpy import unique
from numpy import isnan
from numpy import count_nonzero
from pandas import read_csv
from matplotlib import pyplot

# split the dataset by 'chunkID', return a dict of id to rows
def to_chunks(values, chunk_ix=1):
   chunks = dict()
   # get the unique chunk ids
   chunk_ids = unique(values[:, chunk_ix])
   # group rows by chunk id
   for chunk_id in chunk_ids:
      selection = values[:, chunk_ix] == chunk_id
      chunks[chunk_id] = values[selection, :]
   return chunks

# boxplot for target variables for a chuck
def plot_chunk_targets_boxplots(chunks, c_id):
   rows = chunks[c_id]
   pyplot.boxplot(rows[:,56:])
   pyplot.show()

# load data
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# group data by chunks
values = dataset.values
chunks = to_chunks(values)
# boxplot for target variables
plot_chunk_targets_boxplots(chunks, 1)

Running the example creates a figure containing 39 boxplots, one for each of the 39 target variables for the first chunk.


We can see that many of the variables have a median close to zero or one; we can also see a large asymmetrical spread for most variables, suggesting the variables likely have a skew with outliers.


It is encouraging that the boxplots from 4-10 for variable 11 across seven sites show a similar distribution. This is further supporting evidence that data may be grouped by variable and used to fit a model that could be used across sites.


Box and whisker plots of target variables for one chunk 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Box and whisker plots of target variables for one chunk


We can re-create this plot using data across all chunks to see dataset-wide patterns.


The complete example is listed below.

# boxplots of targets for all chunks
from pandas import read_csv
from matplotlib import pyplot

# boxplot for all target variables
def plot_target_boxplots(values):
   pyplot.boxplot(values[:,56:])
   pyplot.show()

# load data
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# boxplot for target variables
values = dataset.values
plot_target_boxplots(values)

Running the example creates a new figure showing 39 box and whisker plots for the entire training dataset regardless of chunk.


It is a little bit of a mess, where the circle outliers obscure the main data distributions.


We can see that outlier values do extend into the range 5-to-10 units. This suggests there might be some use in standardizing and/or rescaling the targets when modeling.


Perhaps the most useful finding is that there are some targets that do not have any (or very much) data regardless of chunk. These columns probably should be excluded from the dataset.


Box and whisker plots of target variables for all training data 1280w, 300w, 768w, 1024w" sizes="(max-width: 1280px) 100vw, 1280px" />

Box and whisker plots of target variables for all training data


Apparently Empty Target Columns


We can investigate the apparent missing data further by creating a bar chart of the ratio of missing data per column, excluding the metadata columns at the beginning (e.g. the first five columns).


The plot_col_percentage_missing() function below implements this.

# bar chart of the ratio of missing data per column
def plot_col_percentage_missing(values, ix_start=5):
   ratios = list()
   # skip early columns, with meta data or strings
   for col in range(ix_start, values.shape[1]):
      col_data = values[:, col].astype('float32')
      ratio = count_nonzero(isnan(col_data)) / len(col_data) * 100
      ratios.append(ratio)
      if ratio > 90.0:
         print(col, ratio)
   col_id = [x for x in range(ix_start, values.shape[1])]
   pyplot.bar(col_id, ratios)
   pyplot.show()

The complete example is listed below.

# summarize missing data per column
from numpy import isnan
from numpy import count_nonzero
from pandas import read_csv
from matplotlib import pyplot

# bar chart of the ratio of missing data per column
def plot_col_percentage_missing(values, ix_start=5):
   ratios = list()
   # skip early columns, with meta data or strings
   for col in range(ix_start, values.shape[1]):
      col_data = values[:, col].astype('float32')
      ratio = count_nonzero(isnan(col_data)) / len(col_data) * 100
      ratios.append(ratio)
      if ratio > 90.0:
         print(ratio)
   col_id = [x for x in range(ix_start, values.shape[1])]
   pyplot.bar(col_id, ratios)
   pyplot.show()

# load data
dataset = read_csv('AirQualityPrediction/TrainingData.csv', header=0)
# plot ratio of missing data per column
values = dataset.values
plot_col_percentage_missing(values)

Running the example first prints the column id (zero offset) and the ratio of missing data, if the ratio is above 90%.


We can see that there are in fact no columns with zero non-NaN data, but perhaps two dozen (12) that have above 90% missing data.


Interestingly, seven of these are target variables (index 56 or higher).

11 91.48885539779488
20 91.48885539779488
29 91.48885539779488
38 91.48885539779488
47 91.48885539779488
58 95.38880516115385
66 96.9805134713519
68 95.38880516115385
72 97.31630575606145
86 95.38880516115385
92 95.38880516115385
94 95.38880516115385

A bar chart of column index number to ratio of missing data is created.


We can see that there might be some stratification to the ratio of missing data, with a cluster below 10%, a cluster around 70%, and a cluster above 90%.


We can also see a separation between input variable and target variables where the former are quite regular as they show the same variable type measured across different sites.


Such small amounts of data for some target variables suggest the need to leverage other factors besides past observations in order to make predictions.




Consulente in Informatica dal 1984

Software automazione, progettazione elettronica, computer vision, intelligenza artificiale, IoT, sicurezza informatica, tecnologie di sicurezza militare, SIGINT. 

Facebook:https://www.facebook.com/flaviobernardotti58
Twitter : https://www.twitter.com/Flavio58

Cell:  +39 366 3416556

f.bernardotti@deeplearningitalia.eu

#deeplearning #computervision #embeddedboard #iot #ai

 

Related Topics


Sitemap 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326