RSS icon
Carl Chester Lloyd

PostsAboutReading

Bike Sharing Revisited

August 23, 2019

inside bus

Last time I looked at the bike sharing dataset for regression I was doing a couple of things either wrong or in a suboptimal way. Even after this article I probably still will, but at least I will be a little closer this time around.

  1. The way I was splitting the data for testing and training was wrong.

  2. The way I was evaluating the model was suboptimal.

Now let’s look at what was off, and the better way of doing things.

Splitting data for training and testing

Last time I manually split the data into a training set with 80% of the data, and a test set of 20% of the data. Each in it’s own file, bkTrain and bkTest respectively. This was correct, but the way I did it was wrong. I wrote my own script to split the data, and took the 80% from the first 80% of the file. One problem with this is that I did not need to write my own script. The second problem is that I should not have taken only the first 80% without randomizing the data.

Luckily ML.Net comes with the correct way of doing this portion of the process with the TrainTestSplit method. With this there is no need to write a custom script to split data as this method does exactly that. By setting the parameter for testFraction to .2 we can get a nice 80/20 train/test split of our data.

DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
IDataView trainData = dataSplit.TrainSet;
IDataView testData = dataSplit.TestSet;

Model Evaluation with Cross Validation

The second thing that was not exactly wrong, but suboptimal was the evaluation of the model. Before I was holding back 20% of the data to test my model. This is method is generally better applied to larger datasets. This would seem to be a somewhat smaller dataset with only 731 entries, and so a cross validation method is more suited.

This method works in a similar way to the previous method, but done many times over, and with different data each time. Instead of only testing against that particular data that I set to the side upfront I train, then test against a new split of data k many times. This technique is also known as k-fold, and each fold has a different train/test or train/validation split.

For this I will do five folds. The results are then averaged over the five evaluations to get the results.

IDataView transformedData = model.Transform(dataView);

var cvResults = mlContext.Regression.CrossValidate(transformedData, estimator, numberOfFolds: 5);
var rSquared = cvResults.Sum(fold => fold.Metrics.RSquared) / cvResults.Count;
var rmse = cvResults.Sum(fold => fold.Metrics.RootMeanSquaredError) / cvResults.Count;

Putting it all together

The estimator used is the FastTree Regression trainer like last time. The MLContext is set with a seed. Then we call Train followed by CrossValidation. Train loads the data into a dataview which is then split. A pipeline is constructed using the features and the target or label with the estimator specified. Once we have that we just use the pipeline to fit the training data. In CrossValidation we load the data once more, transform it through the model, then runs the CrossValidate on the transformedData using the estimator for the specified number of folds. In this case five. The Transform part on the newly loaded dataview is important as this validates the schema of the dataview and checks that the Features have been specified. Once the CrossValidate is done the metrics are averaged over the number of folds.

       static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "day.csv");
       static void Main(string[] args)
       {
           var mlContext = new MLContext(seed: 0);
           var estimator = mlContext.Regression.Trainers.FastTree();

           var model = Train<BikeShare>(mlContext, _dataPath, estimator);

           CrossValidation<BikeShare>(mlContext, model, _dataPath, estimator);
       }
       
       public static ITransformer Train<T>(MLContext mlContext, string dataPath, IEstimator<ITransformer> estimator)
       {
           IDataView dataView = mlContext.Data.LoadFromTextFile<T>(dataPath, hasHeader: true, separatorChar: ',');
           DataOperationsCatalog.TrainTestData dataSplit = mlContext.Data.TrainTestSplit(dataView, testFraction: 0.2);
           IDataView trainData = dataSplit.TrainSet;
           IDataView testData = dataSplit.TestSet;

           var pipeline = mlContext.Transforms.CopyColumns(outputColumnName: "Label", inputColumnName: "cnt")
                   .Append(mlContext.Transforms.Concatenate("Features", "season", "yr", "mnth", "holiday", "weekday", "workingday", "weathersit", "atemp", "temp", "hum", "windspeed"))
                   .Append(estimator);

           Console.WriteLine("=============== Create and Train the Model ===============");
           var model = pipeline.Fit(trainData);
           Console.WriteLine("=============== End of training ===============");
           Console.WriteLine();

           return model;
       }

       private static ITransformer CrossValidation<T>(MLContext mlContext, ITransformer model, string dataPath, IEstimator<ITransformer> estimator)
       {
           IDataView dataView = mlContext.Data.LoadFromTextFile<T>(dataPath, hasHeader: true, separatorChar: ',');
           IDataView transformedData = model.Transform(dataView);

           var cvResults = mlContext.Regression.CrossValidate(transformedData, estimator, numberOfFolds: 5);
           var rSquared = cvResults.Sum(fold => fold.Metrics.RSquared) / cvResults.Count;
           var rmse = cvResults.Sum(fold => fold.Metrics.RootMeanSquaredError) / cvResults.Count;

           ITransformer[] models = cvResults.OrderByDescending(fold => fold.Metrics.RSquared).Select(fold => fold.Model).ToArray();
           ITransformer topModel = models[0];

           Console.WriteLine();
           Console.WriteLine($"*************************************************");
           Console.WriteLine($"*       Model quality metrics cross validation        ");
           Console.WriteLine($"*------------------------------------------------");
           Console.WriteLine($"*       RSquared Score: {rSquared:0.##}     ");
           Console.WriteLine($"*       Root Mean Squared Error:      {rmse:#.##}");
           Console.WriteLine($"*************************************************");

           return topModel;
       }

Results

The previous results:

RSquared Score:      0.73
Root Mean Squared Error:      964.99

The new results:

RSquared Score:      0.88
Root Mean Squared Error:      680.03

The results here are better than last time. The RSquared jumped by .15, and the root mean squared error decreased by 284.96 bikes. This means our model is better at explaining the target with our features, and our average error is down by about a third. I would say using these methods was a success. I hope you go off, use them too, and don’t repeat my mistakes.


Carl Lloyd

Written by Carl Lloyd. He spends his time playing with technology, and learning new things.