In this tutorial we're going to forecast the number of bikes in 5 bike stations from the city of Toulouse. We'll do so by building a simple model step by step. The dataset contains 182,470 observations. Let's first take a peak at the data.

In [1]:

Copied!





from pprint import pprint
from river import datasets

dataset = datasets.Bikes()

for x, y in dataset:
    pprint(x)
    print(f"Number of available bikes: {y}")
    break
from pprint import pprint
from river import datasets

dataset = datasets.Bikes()

for x, y in dataset:
    pprint(x)
    print(f"Number of available bikes: {y}")
    break

{'clouds': 75,
 'description': 'light rain',
 'humidity': 81,
 'moment': datetime.datetime(2016, 4, 1, 0, 0, 7),
 'pressure': 1017.0,
 'station': 'metro-canal-du-midi',
 'temperature': 6.54,
 'wind': 9.3}
Number of available bikes: 1

Let's start by using a simple linear regression on the numeric features. We can select the numeric features and discard the rest of the features using a Select. Linear regression is very likely to go haywire if we don't scale the data, so we'll use a StandardScaler to do just that. We'll evaluate the model by measuring the mean absolute error. Finally we'll print the score every 20,000 observations.

In [2]:

Copied!





from river import compose
from river import linear_model
from river import metrics
from river import evaluate
from river import preprocessing
from river import optim

model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression(optimizer=optim.SGD(0.001))

metric = metrics.MAE()

evaluate.progressive_val_score(
    dataset.take(50000), model, metric, print_every=5_000
)
from river import compose
from river import linear_model
from river import metrics
from river import evaluate
from river import preprocessing
from river import optim

model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression(optimizer=optim.SGD(0.001))

metric = metrics.MAE()

evaluate.progressive_val_score(
    dataset.take(50000), model, metric, print_every=5_000
)

[5,000] MAE: 4.258195

[10,000] MAE: 4.495646

[15,000] MAE: 4.752093

[20,000] MAE: 4.912763

[25,000] MAE: 4.93422

[30,000] MAE: 5.164359

[35,000] MAE: 5.320904

[40,000] MAE: 5.333578

[45,000] MAE: 5.35498

[50,000] MAE: 5.378719

Out[2]:

MAE: 5.378719

The model doesn't seem to be doing that well, but then again we didn't provide a lot of features. Generally, a good idea for this kind of problem is to look at an average of the previous values. For example, for each station we can look at the average number of bikes per hour. To do so we first have to extract the hour from the moment field. We can then use a TargetAgg to aggregate the values of the target.

In [3]:

Copied!





from river import feature_extraction
from river import stats


def get_hour(x):
    x["hour"] = x["moment"].hour
    return x


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression(optimizer=optim.SGD(1e-2))

metric = metrics.MAE()

evaluate.progressive_val_score(
    dataset.take(50000), model, metric, print_every=5_000
)
from river import feature_extraction
from river import stats


def get_hour(x):
    x["hour"] = x["moment"].hour
    return x


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= linear_model.LinearRegression(optimizer=optim.SGD(1e-2))

metric = metrics.MAE()

evaluate.progressive_val_score(
    dataset.take(50000), model, metric, print_every=5_000
)

[5,000] MAE: 70.674347

[10,000] MAE: 37.085408

[15,000] MAE: 25.784916

[20,000] MAE: 20.189606

[25,000] MAE: 16.932216

[30,000] MAE: 14.674797

[35,000] MAE: 13.090288

[40,000] MAE: 11.851672

[45,000] MAE: 10.82784

[50,000] MAE: 10.110409

Out[3]:

MAE: 10.110409

By adding a single feature, we've managed to significantly reduce the mean absolute error. At this point you might think that the model is getting slightly complex, and is difficult to understand and test. Pipelines have the advantage of being terse, but they aren't always to debug. Thankfully river has some ways to relieve the pain.

The first thing we can do it to visualize the pipeline, to get an idea of how the data flows through it.

In [4]:

Copied!

model
model

Out[4]:

['clouds', [...]

Select (
  clouds
  humidity
  pressure
  temperature
  wind
)

get_hour


def get_hour(x):
    x["hour"] = x["moment"].hour
    return x

y_mean_by_station_and_hour

TargetAgg (
  by=['station', 'hour']
  how=Mean ()
  target_name="y"
)

StandardScaler

StandardScaler (
  with_std=True
)

LinearRegression

LinearRegression (
  optimizer=SGD (
    lr=Constant (
      learning_rate=0.01
    )
  )
  loss=Squared ()
  l2=0.
  l1=0.
  intercept_init=0.
  intercept_lr=Constant (
    learning_rate=0.01
  )
  clip_gradient=1e+12
  initializer=Zeros ()
)

The debug_one method shows what happens to an input set of features, step by step.

And now comes the catch. Up until now we've been using the progressive_val_score method from the evaluate module. What this does is that it sequentially predicts the output of an observation and updates the model immediately afterwards. This way of proceeding is often used for evaluating online learning models. But in some cases it is the wrong approach.

When evaluating a machine learning model, the goal is to simulate production conditions in order to get a trust-worthy assessment of the performance of the model. In our case, we typically want to forecast the number of bikes available in a station, say, 30 minutes ahead. Then, once the 30 minutes have passed, the true number of available bikes will be available and we will be able to update the model using the features available 30 minutes ago.

What we really want is to evaluate the model by forecasting 30 minutes ahead and only updating the model once the true values are available. This can be done using the moment and delay parameters in the progressive_val_score method. The idea is that each observation in the stream of the data is shown twice to the model: once for making a prediction, and once for updating the model when the true value is revealed. The moment parameter determines which variable should be used as a timestamp, while the delay parameter controls the duration to wait before revealing the true values to the model.

In [5]:

Copied!





import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)
import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)

[5,000] MAE: 68.449039

[10,000] MAE: 36.322091

[15,000] MAE: 25.515207

[20,000] MAE: 20.198137

[25,000] MAE: 17.023774

[30,000] MAE: 14.857317

[35,000] MAE: 13.370759

[40,000] MAE: 12.199763

[45,000] MAE: 11.218521

[50,000] MAE: 10.522559

Out[5]:

MAE: 10.522559

The performance is a bit worse, which is to be expected. Indeed, the task is more difficult: the model is only shown the ground truth 30 minutes after making a prediction.

Goin' Deep¶

Rebuilding Linear Regression in PyTorch¶

In [6]:

Copied!





from deep_river.regression import Regressor
from river import feature_extraction
from river import stats
import torch


class LinearRegression(torch.nn.Module):
    def __init__(self, n_features, outputSize=1):
        super(LinearRegression, self).__init__()
        self.linear = torch.nn.Linear(n_features, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= Regressor(
    module=LinearRegression(10),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
)
from deep_river.regression import Regressor
from river import feature_extraction
from river import stats
import torch


class LinearRegression(torch.nn.Module):
    def __init__(self, n_features, outputSize=1):
        super(LinearRegression, self).__init__()
        self.linear = torch.nn.Linear(n_features, outputSize)

    def forward(self, x):
        out = self.linear(x)
        return out


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= Regressor(
    module=LinearRegression(10),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
)

In [7]:

Copied!





import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)
import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)

[5,000] MAE: 69.777636

[10,000] MAE: 36.989336

[15,000] MAE: 25.960218

[20,000] MAE: 20.53185

[25,000] MAE: 17.290755

[30,000] MAE: 15.079789

[35,000] MAE: 13.561451

[40,000] MAE: 12.366619

[45,000] MAE: 11.366838

[50,000] MAE: 10.656045

Out[7]:

MAE: 10.656045

Building RNN Models¶

In [8]:

Copied!





from deep_river.regression import RollingRegressor
from river import feature_extraction
from river import stats
import torch


class RnnModule(torch.nn.Module):

    def __init__(self, n_features, hidden_size):
        super().__init__()
        self.n_features = n_features
        self.rnn = torch.nn.RNN(
            input_size=n_features, hidden_size=hidden_size, num_layers=1
        )
        self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)

    def forward(self, X, **kwargs):
        output, hn = self.rnn(X)  # lstm with input, hidden, and internal state
        return self.fc(output[-1, :])


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= RollingRegressor(
    module=RnnModule(10, hidden_size=16),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
    hidden_size=20,
    window_size=32,
)
from deep_river.regression import RollingRegressor
from river import feature_extraction
from river import stats
import torch


class RnnModule(torch.nn.Module):

    def __init__(self, n_features, hidden_size):
        super().__init__()
        self.n_features = n_features
        self.rnn = torch.nn.RNN(
            input_size=n_features, hidden_size=hidden_size, num_layers=1
        )
        self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)

    def forward(self, X, **kwargs):
        output, hn = self.rnn(X)  # lstm with input, hidden, and internal state
        return self.fc(output[-1, :])


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= RollingRegressor(
    module=RnnModule(10, hidden_size=16),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
    hidden_size=20,
    window_size=32,
)

In [9]:

Copied!





import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)
import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)

[5,000] MAE: 4.308362

[10,000] MAE: 4.350874

[15,000] MAE: 4.23753

[20,000] MAE: 4.249423

[25,000] MAE: 4.292688

[30,000] MAE: 4.319149

[35,000] MAE: 4.380308

[40,000] MAE: 4.357141

[45,000] MAE: 4.262677

[50,000] MAE: 4.308476

Out[9]:

MAE: 4.308476

Building LSTM Models¶

In [10]:

Copied!





class LstmModule(torch.nn.Module):

    def __init__(self, n_features, hidden_size=1):
        super().__init__()
        self.n_features = n_features
        self.hidden_size = hidden_size
        self.lstm = torch.nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_size,
            num_layers=1,
            bidirectional=False,
        )
        self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)

    def forward(self, X, **kwargs):
        output, (hn, cn) = self.lstm(
            X
        )  # lstm with input, hidden, and internal state
        return self.fc(output[-1, :])


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= RollingRegressor(
    module=LstmModule(10, hidden_size=16),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
    hidden_size=20,
    window_size=32,
)
class LstmModule(torch.nn.Module):

    def __init__(self, n_features, hidden_size=1):
        super().__init__()
        self.n_features = n_features
        self.hidden_size = hidden_size
        self.lstm = torch.nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_size,
            num_layers=1,
            bidirectional=False,
        )
        self.fc = torch.nn.Linear(in_features=hidden_size, out_features=1)

    def forward(self, X, **kwargs):
        output, (hn, cn) = self.lstm(
            X
        )  # lstm with input, hidden, and internal state
        return self.fc(output[-1, :])


model = compose.Select("clouds", "humidity", "pressure", "temperature", "wind")
model += get_hour | feature_extraction.TargetAgg(
    by=["station", "hour"], how=stats.Mean()
)
model |= preprocessing.StandardScaler()
model |= RollingRegressor(
    module=LstmModule(10, hidden_size=16),
    loss_fn="mse",
    optimizer_fn="sgd",
    lr=1e-2,
    hidden_size=20,
    window_size=32,
)

In [11]:

Copied!





import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)
import datetime as dt

evaluate.progressive_val_score(
    dataset=dataset.take(50000),
    model=model.clone(),
    metric=metrics.MAE(),
    moment="moment",
    delay=dt.timedelta(minutes=30),
    print_every=5_000,
)

[5,000] MAE: 4.245994

[10,000] MAE: 4.000317

[15,000] MAE: 3.823724

[20,000] MAE: 3.819412

[25,000] MAE: 3.86971

[30,000] MAE: 3.820139

[35,000] MAE: 3.843358

[40,000] MAE: 3.789714

[45,000] MAE: 3.731681

[50,000] MAE: 3.758691

Out[11]:

MAE: 3.758691

In [ ]:

Bike-sharing forecasting¶

Goin' Deep¶

Rebuilding Linear Regression in PyTorch¶

Building RNN Models¶

Building LSTM Models¶