The Thinking Network (Installment 10): Predictive Routing with Machine Learning

Installment 10 of The Thinking Network. Intelligence enters the loop. We implement a machine learning model using linear regression to predict network congestion and proactively reroute BGP traffic.

The Thinking Network (Installment 10): Predictive Routing with Machine Learning

Architecture Overview: Phase 7 (The Predictive Brain)Objective: Transition from reactive threshold monitoring to anticipatory, AI-driven traffic engineering.Core Technologies: Machine Learning (Linear Regression), Predictive Analytics, Python data science libraries, and Time-Series forecasting.The Goal: Train a local model on 60 seconds of rolling telemetry data to predict latency SLA breaches 30 seconds into the future, triggering a proactive BGP reroute before the network actually degrades.

Everything before this installment was necessary.

IS-IS built the routing foundation. BGP provided the control plane lever. The health audit verified the fabric was sound. The sensing layer made the network's state legible. The reactive actuator proved the network could respond to a condition without human intervention.

All of that was prelude.

This installment is about the moment the system stopped waiting for a problem to arrive and started reading where the data was going.


The Difference Between Reactive and Predictive

Phase 6 - the reactive actuator - works like this: measure the current value, compare it to the threshold, act if the threshold is crossed.

The gap between the threshold crossing and the action is real. Phase 6 polls every ten seconds. If latency crosses 5ms at second zero of a cycle, the actuator does not fire until second ten. During those ten seconds, the network is in breach. The SLA is being violated. Nothing has happened yet.

Phase 7 asks a different question. Not: is the current value above the threshold? But: given where this value has been for the last sixty seconds, where will it be in thirty seconds?

If the answer is above the threshold, the action happens now. Not in response to a breach. In anticipation of one.

That is the distinction that matters. The reactive system catches the problem. The predictive system prevents it.


Machine Learning in Networking: How Linear Regression Works

This is the point where it is worth being precise about the model being used, because the word "AI" carries expectations that linear regression does not always meet.

Linear regression fits a straight line through a set of data points. Given a sequence of timestamped latency measurements - sixty seconds of history from Prometheus - it calculates the line that best describes the trend in that data. That line has a slope. A positive slope means values are rising. A negative slope means they are falling.

Once the line is fitted, it is extended. The model calculates where the line will be thirty seconds from now if the current trend continues. That is the prediction.

This is not deep learning. It is not a neural network. It does not require a GPU. It does not require thousands of training examples. It runs on the i7-12800H in this machine without measurable CPU impact. The entire computation - fetch data, fit model, predict - completes in milliseconds.

What makes it powerful is not the sophistication of the algorithm. It is the quality and continuity of the data it reads. Sixty seconds of timestamped latency measurements, collected every ten seconds by Phase 5, is a dataset that contains directional information. The model reads that direction and projects it forward.


The Startup Problem

When Phase 7 first started alongside Phases 5 and 6, the predictions were deeply negative:

12:00:53  Sense: 1.56ms | Predicted (+30s): -0.30ms | Trend: stable/falling
12:01:03  Sense: 1.41ms | Predicted (+30s): -4.87ms | Trend: stable/falling
12:01:13  Sense: 1.67ms | Predicted (+30s): -6.58ms | Trend: stable/falling

Latency cannot be negative. These predictions were meaningless. The model was not broken - it was fitting a line through contaminated data.

The Prometheus history at that moment contained the previous session's measurements: the artificially injected 8ms delay, the breaches, the zeros when Phase 5 had died. The model saw a dataset that had peaked high and then dropped to zero. It fitted a steeply falling line through that history. Extended thirty seconds into the future, that line went deeply negative.

The fix was patience. Wait for fresh nominal data to accumulate and push the contaminated history out of the sixty-second window. After approximately ninety seconds of clean operation, the predictions stabilized:

12:02:03  Sense: 1.83ms | Predicted (+30s): 2.62ms | Trend: rising
12:02:13  Sense: 1.65ms | Predicted (+30s): 1.95ms | Trend: rising
12:02:33  Sense: 1.67ms | Predicted (+30s): 1.40ms | Trend: stable/falling
12:03:13  Sense: 2.14ms | Predicted (+30s): 2.17ms | Trend: rising
12:03:23  Sense: 1.52ms | Predicted (+30s): 1.59ms | Trend: rising

Predictions tracking actual values. The model working on clean data. The system ready.

The Injection

With the model stable, 8ms of artificial delay was injected on the client container's fabric-facing interface:


docker exec clab-nbl-diamond-v1-client-l3-1 tc qdisc add dev eth1 root netem delay 8ms

The next sensing cycle detected the impact:

12:04:09  Sense | L2: 0.04ms | L3: 10.80ms | SLA: 5.0ms
12:04:09  [SLA BREACH #1] L3 RTT: 10.80ms > 5.0ms threshold

L3 jumped from 2.49ms to 10.80ms in one cycle. The impairment was immediate and severe.

The Moment

Four seconds after the breach was detected by Phase 5, Phase 7 processed that data point:

12:04:13  Sense: 10.80ms | Predicted (+30s): 6.56ms | Trend: stable/falling
12:04:13  [AI ALERT] Predicted breach: 6.56ms > 5.0ms -- rerouting proactively
12:04:13    [AI ALERT] Calling Phase 6 actuator --force-reroute

The prediction was 6.56ms - above the 5ms threshold. Phase 7 did not wait. It called Phase 6 with the --force-reroute flag immediately.

Phase 6 responded:

12:04:13  [PREDICTIVE] Force reroute requested by Phase 7 AI brain
12:04:13    [ACTION] SLA violated -- rerouting via Path B (srl1->srl3)
12:04:13    Setting BGP local-preference: 100 -> 50 on srl1
12:04:14    [ACTION] Reroute applied -- traffic now on Path B

One second from Phase 7's decision to BGP committed on srl1.

The sequence that matters:

  • 12:04:09 - Phase 5 detects breach: L3 at 10.80ms
  • 12:04:13 - Phase 7 reads that data point, model predicts 6.56ms in 30s
  • 12:04:13 - Phase 7 calls Phase 6 --force-reroute
  • 12:04:14 - BGP Local Preference changed to 50 on srl1. Traffic moves to Path B.

The reactive actuator also fired on the same cycle at 12:04:18 - it saw the breach independently and attempted its own reroute. By that point the reroute had already happened. The two systems working in parallel, the predictive one winning by four seconds.

The Race at Recovery

When the congestion was removed, something instructive happened at 12:05:38. Phase 7 and Phase 6 executed on the same tick, pulling in opposite directions:

12:05:38  Sense: 2.07ms | Predicted (+30s): 6.34ms  ← Phase 7 still sees rising trend
12:05:38  [AI ALERT] Predicted breach: 6.34ms -- rerouting proactively  ← Phase 7 fires
12:05:38  [PREDICTIVE] Force reroute requested by Phase 7 AI brain
12:05:38  Sense | L3: 2.07ms | State: REROUTED  ← Phase 6 reads same cycle
12:05:38  SLA RECOVERY: 2.07ms <= 5.0ms  ← Phase 6 detects recovery
12:05:38    Normal routing restored -- traffic back on Path A  ← Phase 6 restores

Phase 7 was still predicting a breach at 6.34ms. Phase 6 saw the actual value at 2.07ms and restored normal routing. The recovery logic won.

This is the model's limitation made visible. Linear regression fitted to a window that still contained stressed data saw a rising trend that no longer existed in reality. The sixty-second lookback window had not yet washed out the high measurements from the injection period. The model was extrapolating from history that was no longer representative of current conditions.

This is not a failure. It is accurate behavior from the model given the data it had. It is also the honest reason why the next phase exists - the trust layer - which evaluates whether the model's recent decisions have been accurate and governs how much authority it is allowed to exercise.

What Was Built

Phase 7 reads sixty seconds of latency history from Prometheus. It trains a LinearRegression model on that data - timestamps as the independent variable, latency values as the dependent variable. It projects the fitted line thirty seconds forward. If the projection exceeds 5ms, it calls Phase 6 with --force-reroute.

The code that does this:


X, y = get_historical_data()  # 60s of (timestamp, latency_ms) pairs from Prometheus
model = LinearRegression()
model.fit(X, y)

future_time = np.array([[time.time() + PREDICT_AHEAD_S]])  # now + 30s
prediction = float(model.predict(future_time)[0])

if prediction > SLA_THRESHOLD_MS:
    logger.warning(f"[AI ALERT] Predicted breach: {prediction:.2f}ms -- rerouting proactively")
    trigger_proactive_reroute()

Six lines of model logic. The sophistication is not in the algorithm. It is in the architecture that surrounds it - the continuous sensing data that feeds it, the actuator that responds to it, the fabric that carries out its decisions.

Why This Is Intelligence

The reactive actuator is a thermostat. It measures and responds.

Phase 7 is something different. It reads a trajectory. It reasons about where that trajectory leads. It acts on that reasoning before the outcome it is predicting has occurred.

That is the definition of anticipatory behavior. Not reflexes. Not threshold response. Reading the direction of a changing situation and acting before it becomes a problem.

The network sensed a developing condition. It forecast the consequence. It acted. No human was in the loop at any point between the impairment being injected and the BGP reroute being committed.

That is the moment intelligence entered this network.

Build Record

June 28, 2026 - Dell Precision 3571 - Garuda Linux rolling - SR Linux v26.3.2_

Startup - contaminated history (expected):

12:00:53  Sense: 1.56ms | Predicted (+30s): -0.30ms | Trend: stable/falling
12:01:13  Sense: 1.67ms | Predicted (+30s): -6.58ms | Trend: stable/falling

Model stabilized on clean data (~90s after start):

12:03:13  Sense: 2.14ms | Predicted (+30s): 2.17ms | Trend: rising
12:03:23  Sense: 1.52ms | Predicted (+30s): 1.59ms | Trend: rising

Congestion injected - predictive breach and reroute:

12:04:09  Sense | L3: 10.80ms  ← breach detected by Phase 5
12:04:13  Sense: 10.80ms | Predicted (+30s): 6.56ms
12:04:13  [AI ALERT] Predicted breach: 6.56ms > 5.0ms -- rerouting proactively
12:04:13  [PREDICTIVE] Force reroute requested by Phase 7 AI brain
12:04:14  [ACTION] Reroute applied -- traffic now on Path B  (Local-Pref 100 -> 50)

Congestion removed - recovery with model/reactive race:

12:05:31  Sense | L3: 2.07ms  ← congestion cleared
12:05:38  [AI ALERT] Predicted: 6.34ms  ← model still sees rising trend from history
12:05:38  SLA RECOVERY: 2.07ms  ← Phase 6 reactive detects actual recovery
12:05:38  Normal routing restored -- traffic back on Path A  (Local-Pref 50 -> 100)

Phase 7 gate criterion met: Predicted RTT shown every 10s AND proactive reroute fired before reactive actuator on breach detection.

Next installment: The Trust Layer. The system that watches the AI. A formula that scores the reliability of the predictive brain's decisions and governs its authority over the network.