RSS icon
Carl Chester Lloyd

PostsAboutReading

ML.NET Regression with Bike Sharing

July 21, 2019

rentalbikes

Machine learning is currently the hot thing, and promises to continue to be the hot thing for the foreseeable future. Unlike Bitcoin and block chain technology companies are investing in machine learning, deep learning, neural networks, and artificial intelligence with an understanding of the sort of benefits they can extract from this investment. Of course, this has attracted the interest of myself, and many other software engineers.

Most of the tools for doing machine learning seem to be built up in python, and while it is fun to learn new languages I was excited when I saw that Microsoft was building a framework for supporting machining learning in C#. It is called ML.Net. This will be the first in a series of posts exploring ML.Net, and generally playing with machine learning.

The Data Set

For this foray into machine learning with ML.NET we are going to be using a bike sharing data set supplied by UCI. This repository is an excellent source for people wanting to learn machine learning by getting their hands dirty, and diving right into real data sets.

The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.

ReadMe

The ReadMe is an excellent source of information about this data set. The key task from the readme is laid out pretty plainly. We are going to look at just the daily data, and trying to predict the daily rental count.

“Regression: Predication of bike rental count hourly or daily based on the environmental and seasonal settings.“

There is also a nice listing of the fields in the data set, and what the values correspond to.

  • instant: record index
  • dteday : date
  • season : season (1:springer, 2:summer, 3:fall, 4:winter)
  • yr : year (0: 2011, 1:2012)
  • mnth : month ( 1 to 12)
  • hr : hour (0 to 23)
  • holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
  • weekday : day of the week
  • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
  • weathersit :

    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
  • temp : Normalized temperature in Celsius. The values are divided to 41 (max)
  • atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
  • hum: Normalized humidity. The values are divided to 100 (max)
  • windspeed: Normalized wind speed. The values are divided to 67 (max)
  • casual: count of casual users
  • registered: count of registered users
  • cnt: count of total rental bikes including both casual and registered

With that let’s get started.

Basic Setup

There are four main steps in a machine learning project like this. 1. Clean and setup the data. 2. Train a model based on the data. 3. Evaulate the model. 4. Use the model to make predictions. The main tool in ML.NET is the MLContext. The documentation explains this well, but basically this is the main thing in ML.NET used to access, and manipulate all required parts of a machine learning project.

The common context for all ML.NET operations. Once instantiated by the user, it provides a way to create components for data preparation, feature enginering, training, prediction, model evaluation. It also allows logging, execution control, and the ability set repeatable random numbers.

static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "bkTrain.csv");
static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "bkTest.csv");
MLContext mlContext = new MLContext(seed: 0);

The seed is used for internal random number generation.

Classes are used in ML.NET to contain the data, and also for labeling, and processing. We are going to create two classes, a feature class, and a prediction class.

public class BikeShare
    {
        [LoadColumn(0)]
        public float instant;
        [LoadColumn(1)]
        public float dteday;
        [LoadColumn(2)]
        public float season;
        [LoadColumn(3)]
        public float yr;
        [LoadColumn(4)]
        public float mnth;
        [LoadColumn(5)]
        public float holiday;
        [LoadColumn(6)]
        public float weekday;
        [LoadColumn(7)]
        public float workingday;
        [LoadColumn(8)]
        public float weathersit;
        [LoadColumn(9)]
        public float temp;
        [LoadColumn(10)]
        public float atemp;
        [LoadColumn(11)]
        public float hum;
        [LoadColumn(12)]
        public float windspeed;
        [LoadColumn(13)]
        public float casual;
        [LoadColumn(14)]
        public float registered;
        [LoadColumn(15)]
        public float cnt;
    }

This is pretty straight forward. Every column in the dataset gets a field, and each field gets an attribute called LoadColumn with a fieldIndex parameter which will let ML.Net know which column goes to which field when the data is loaded.

 public class BikeSharePrediction
    {
        [ColumnName("Score")]
        public float cnt;
    }

This class is for what we are trying to predict. The column name score will be used later for evaluation, but it allows ML.Net to know what we are trying to predict.

Cleaning and Dividing the data

Thankfully this dataset does not need cleaned up as it has already been preprocessed. All categorical variables have already been changed to an enumerated value to make data processing easier. Furthermore, null and/or nonsense values have been removed from the dataset. Division of the data into two sets is then performed. One set for training, and another for testing. Here we used the conventional 80/20 split for train/test data. The test data gives us data to evaluate our model against unseen data.

static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "bkTrain.csv");
static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "bkTest.csv");

This will give us a the path of where the training, and test data exists. The file property Copy to Output Directory should be changed to Copy if newer.

Training

This is the part where we create a model based on the training data.

IDataView dataView = mlContext.Data.LoadFromTextFile<T>(dataPath, hasHeader: true, separatorChar: ',');

We load the data from the training file by passing the path in. We then tell it if there is a header, and how the data is formatted. In this case there is a header, and it is comma separated. This gives us a nice dataview that we can operate on.

var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: "cnt")
                   .Append(mlContext.Transforms.Concatenate("Features", "season", "yr", "mnth", "holiday", "workingday", "weathersit", "temp", "hum", "windspeed"))
                   .Append(mlContext.Regression.Trainers.FastTree());

Here we build a pipeline. We use CopyColumns to create a Label column from the cnt column so ML.Net knows what it is trying to predict. We then add a list of features which will be used to make the prediction. However we will not be using all of them. We are leaving out: instant, dteday, hr, casual, registered, and cnt. The instant, and date will not tell us anything. The hour is not available in the daily dataset. The casual, and registered counts are part of what we are trying to predict, which is the cnt or count.

After this we pick the trainer to use to build the model. The problem we are trying to solve is a regression problem so we will use a regression trainer. In this case we are going to use the FastTree.

var model = pipeline.Fit(dataView);

This is the part where everything comes together. Now that the pipeline is built we run our data contained in the dataview through our pipeline with Fit. Now we have a model.

R-Squared and RMSE

Before we get to evaluating our model, then using it for prediction let’s go over a couple of metrics we are going to use for the evaluation. Names r-squared or coefficient of determination, and RMSE or root mean square error. Wikipedia Definition:

In statistics, the coefficient of determination, denoted R² or r² and pronounced “R squared”, is the proportion of the variance in the dependent variable that is predictable from the independent variable.

What this means is given all our features, r-squared is how much our features explain the target. Otherwise how strong the relationship between our model and the target. The r-squared ranges from 0 to 1 and is interpreted as 0% to 100%. The closer this number is to 1 the better.

For RMSE once again Wikipedia has a solid definition:

The root-mean-square deviation or root-mean-square error is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed.

Basically this is how far off our predicted values from our model will be from true values. The smaller this number is the better.

Evaluation

IDataView dataView = mlContext.Data.LoadFromTextFile<T>(_testDataPath, hasHeader: true, separatorChar: ',');

We start off by leading the data from our test file. This is done the same way we loaded the training data, and gives us a convenient dataview.

var predictions = model.Transform(dataView);

Then run the test data through the model. This gives us another data view containing the predictions for each entry in the test data.

var metrics = mlContext.Regression.Evaluate(predictions, "Label", "Score");

Now it is just a matter of evaluation. The predictions are evaluated against the actual values which were copied into the Label column earlier. Metrics are then returned, and RSquared and RootMeanSquaredError can be accessed from that.

Prediction

We can also use the model that we created to make prediction on unseen data.

var predictionFunction = mlContext.Model.CreatePredictionEngine<BikeShare, BikeSharePrediction>(model);

Once we have the PredictionEngine we can create Feature objects, then run them through the predictionFunction to predict the

var prediction = predictionFunction.Predict(sample);

Results

Using this set up were were able to get pretty good results.

RSquared Score:      0.73
Root Mean Squared Error:      964.99

Predicted cnt: 2616.396, actual cnt: 2729
Predicted cnt: 6751.012, actual cnt: 7013

The r-squared score tells us that about 73% of our target or bike rental count for the day can be explained by the features that we used. Furthermore, when making predictions our average error was 965 rentals in a given day. This seems high at first, but considering the max is 8714 this represents an 11% error. We can see from the predictions that we are not too far off. The model looks to be reliable. Not great, but not bad.

Hopefully this was as fun for you as it was for me. I’ll be coming back to this to see what else we can do. Maybe we can get even better results.


Carl Lloyd

Written by Carl Lloyd. He spends his time playing with technology, and learning new things.