The Thinking Network (Installment 8): The Sensing Layer & Streaming Telemetry

Installment 8 of The Thinking Network. We give the network eyes and ears by replacing legacy SNMP with high-frequency streaming telemetry, Prometheus, and gNMI on SR Linux.

The Thinking Network (Installment 8): The Sensing Layer & Streaming Telemetry

Architecture Overview: Phase 5 (The Sensing Layer)Objective: Implement real-time network observability and state awareness.Core Technologies: Streaming Telemetry, gNMI / gRPC, Prometheus (Time-Series Database), and Nokia SR Linux.The Goal: Establish the data ingestion pipeline required for predictive AI modeling, allowing the network to monitor its own health at the microsecond level.

The fabric is healthy. IS-IS has converged. BGP sessions are established. The health audit passed 17 checks. The diamond network is running.

But it cannot see itself yet.

It has no way to know what latency looks like right now. It has no record of what latency looked like sixty seconds ago. It cannot tell whether BGP sessions are stable or whether one dropped and recovered while no one was watching.

Before a network can act on its own state, it has to read its own state. What we are building here is a modern network observability stack—moving from static polling to a streaming telemetry pipeline. Phase 5 builds the network's ability to see itself.


The Problem with Polling

The conventional approach to network monitoring is polling. A management system sends a request to a device every few minutes. The device responds. The management system stores the response and moves to the next device.

Polling works. It is also slow, incomplete, and fundamentally reactive. A BGP session that flaps and recovers between poll cycles is invisible. A latency spike that lasts thirty seconds and then clears is invisible. By the time a poll cycle captures a problem, that problem may have already caused service impact.

The alternative is streaming telemetry. Instead of asking the device periodically, you subscribe to the data you care about. The device sends it to you continuously, at its own cadence, the moment a value changes or a timer fires.

The protocol this lab uses is gNMI - gRPC Network Management Interface. Nokia SR Linux exposes its complete operational state via gNMI on port 57400. The sensing agent subscribes at boot and receives a continuous stream.


What Gets Measured

The sensing agent collects three things every ten seconds.

L2 EVPN latency - a ping from client-l2-1 to client-l2-2 through the bridged EVPN domain. This measures how fast the L2 forwarding plane is. In this lab it runs between 0.03ms and 0.2ms under normal conditions.

L3 VPRN latency - a ping from client-l3-1 to the L3 gateway at 10.100.4.1 on srl4. This is the primary SLA metric. The threshold is 5ms. This is what the actuator watches and what the predictive brain forecasts.

BGP session state - for each BGP neighbor on srl1, a gNMI query returns the current session state. The sensing agent converts this to a binary metric: 1.0 for established, 0.0 for anything else.

These three numbers, exported continuously to Prometheus, are the nervous system of everything that follows.


The gNMI Connection

The sensing agent opens a gNMI channel to each node using the node's management IP and the credentials from .env:

with gNMIclient(
    target=('172.20.20.11', 57400),
    username=SRL_USERNAME,
    password=SRL_PASSWORD,
    skip_verify=True
) as gc:
    bgp_path = '/network-instance[name=default]/protocols/bgp/neighbor'
    response = gc.get(path=[bgp_path], datatype='state')

The ssl_target_name_override warning that appears in the terminal output is expected. It means the TLS connection is using skip_verify=True - appropriate for a lab environment where we control both ends and do not need certificate validation.

The BGP state for each neighbor comes back as structured data. The sensing agent extracts the session state field, maps it to 1.0 or 0.0, and exports it as a labeled Prometheus gauge.

The Client IP Problem

When Phase 5 first started, L3 latency read 0.0ms on every cycle.

The get_ping_rtt function returns 0.0 on failure. L3 consistently returning 0.0 meant the ping was failing, not succeeding instantly.

The direct test confirmed it:


docker exec clab-nbl-diamond-v1-client-l3-1 ping -c 1 -W 1 10.100.4.1
1 packets transmitted, 0 received, 100% packet loss

The L3-VPRN gateway on srl4 was up and configured correctly - 10.100.4.1/24 on ethernet-1/4.0, state up. The problem was on the client side.

client-l3-1 connects to srl1 via eth1. That interface had no IPv4 address and no route to the 10.100.4.0/24 subnet. The container was connected to the fabric but had no way to reach anything through it.

The fix:


docker exec clab-nbl-diamond-v1-client-l3-1 ip addr add 10.100.1.2/24 dev eth1
docker exec clab-nbl-diamond-v1-client-l3-1 ip route add 10.100.4.0/24 via 10.100.1.1

After those two commands, L3 latency came in at 1.80ms on the next sensing cycle.

This is a known gap in the current lab setup - these IP and route assignments do not persist across ContainerLab restarts. When the lab redeploys, the commands need to run again. A post-deploy script is the right fix. It is on the backlog.

What Prometheus Receives

With the sensing agent running and the client IP configured, the Prometheus endpoint at http://localhost:9999/metrics exports:


# HELP nokia_latency_l2_ms L2 EVPN latency in milliseconds
# TYPE nokia_latency_l2_ms gauge
nokia_latency_l2_ms{lab="nbl-diamond-v1"} 0.056

# HELP nokia_latency_l3_ms L3 VPRN latency in milliseconds (primary SLA metric)
# TYPE nokia_latency_l3_ms gauge
nokia_latency_l3_ms{lab="nbl-diamond-v1"} 2.06

# HELP nokia_bgp_status BGP session state (1=established, 0=down)
# TYPE nokia_bgp_status gauge
nokia_bgp_status{lab="nbl-diamond-v1",neighbor="172.2.255.255",node="srl1"} 1.0
nokia_bgp_status{lab="nbl-diamond-v1",neighbor="172.3.255.255",node="srl1"} 1.0
nokia_bgp_status{lab="nbl-diamond-v1",neighbor="172.4.255.255",node="srl1"} 1.0

Three BGP sessions, all 1.0. L2 at 0.056ms. L3 at 2.06ms. Both well under the 5ms threshold.

The Prometheus container at 172.20.20.20:9090 scrapes this endpoint every five seconds and stores the time series. Every value this lab has ever measured is in that database. The predictive AI layer in Phase 7 will query that history to train its model.

Why This Architecture Matters

There is a difference between a network that is monitored and a network that is aware.

A monitored network has an external system watching it from outside. The network itself has no knowledge of its own state. It cannot act on what it cannot read.

An aware network has sensing built into its own operational layer. The sensing agent runs alongside the protocols. The data it collects is available to every other component of the intelligence stack - the actuator, the predictive brain, the trust monitor. The network can read itself and give that reading to systems that act on it.

This is Phase 5's contribution. Not the individual metrics - those are straightforward. The contribution is making the network's state legible to everything above it.

Build Record

June 14, 2026 - Dell Precision 3571 - Garuda Linux rolling - SR Linux v26.3.2

2026-05-17 11:22:12 [INFO] Phase 5: Unified Sensing Agent | Lab: nbl-diamond-v1
2026-05-17 11:22:12 [INFO] SLA threshold: 5.0ms
2026-05-17 11:22:12 [INFO] Prometheus exporter started on port 9999
2026-05-17 11:22:12 [INFO] Starting sense loop...
2026-05-17 11:22:12 [INFO] Sense | L2: 0.03ms | L3: 1.80ms | SLA: 5.0ms

Client IP fix applied:


docker exec clab-nbl-diamond-v1-client-l3-1 ip addr add 10.100.1.2/24 dev eth1
docker exec clab-nbl-diamond-v1-client-l3-1 ip route add 10.100.4.0/24 via 10.100.1.1

Prometheus output verified:

nokia_latency_l2_ms{lab="nbl-diamond-v1"} 0.056
nokia_latency_l3_ms{lab="nbl-diamond-v1"} 2.06
nokia_bgp_status{...,neighbor="172.2.255.255",node="srl1"} 1.0
nokia_bgp_status{...,neighbor="172.3.255.255",node="srl1"} 1.0
nokia_bgp_status{...,neighbor="172.4.255.255",node="srl1"} 1.0

Phase 5 gate criterion met: nokia_latency_l3_ms returning real values AND all BGP sessions showing 1.0.

Next installment: The First Act. The actuator reads the sensing data, detects a breach, and reaches into BGP. The network does something on its own for the first time.