The Thinking Network

The Thinking Network (Installment 7): Automated Network Health Audits

Installment 7 of The Thinking Network. We run a programmatic health audit on the SR Linux fabric, validating IS-IS adjacencies and BGP sessions before initiating the AI intelligence layer.

Thomas Harvey

10 Jun 2026 • 6 min read

Architecture Overview: Phase 7 (Health Audit)

Objective: Programmatically validate the deterministic state of the network prior to enabling autonomous control.

Core Technologies: Python automation, state verification, gNMI telemetry, and BGP/IS-IS adjacency auditing.

The Goal: Execute a 17-point automated health check across the containerized fabric to prevent the AI layer from ingesting garbage data from an unverified control plane.

The lab deployed cleanly.

Thirteen containers. Four Nokia SR Linux routers. Four Linux client workloads. A full telemetry stack - Prometheus, Grafana, Loki, Promtail, gNMIc. The ContainerLab table printed in the terminal, every row showing the same status: running.

Then the health audit ran.

Fifteen failures.

Why There Is a Health Audit

Before the intelligence layer starts, the fabric has to be healthy. Not assumed healthy. Verified healthy.

This is not a bureaucratic step. It is a design principle. An autonomous system that makes decisions based on network state needs accurate information about network state. If IS-IS adjacencies are not formed, if BGP sessions are not established, if the routing table is wrong - the AI layer will read garbage data and produce garbage decisions. The health audit is the gate between building the fabric and trusting it.

Phase 4 checks six things:

IS-IS adjacencies are up on all four nodes
BGP sessions are established on all four nodes
The route table on srl1 contains loopbacks for all other nodes via IS-IS
L2 EVPN traffic passes end-to-end between client containers
L3 VPRN traffic reaches the gateway
The gNMI server is running and reachable on all four nodes

Every one of these has to pass before Phase 5 - the sensing layer - is allowed to start.

The First Run

Health Audit Results: 2 PASSED | 15 FAILED

Two things passed: the L2 and L3 data plane tests. Client containers could ping each other through the fabric. That told us the containers were up and the fundamental forwarding was working.

Everything else failed. IS-IS down. BGP down. Routes missing. gNMI unreachable.

The first instinct when a health audit fails at this scale is to assume a configuration problem. The startup configs were wrong. The topology file had an error. Something fundamental was broken.

That instinct was wrong.

What Was Actually Happening

The lab had been running for less than ten minutes when the first audit ran. Nokia SR Linux is a carrier-grade network operating system. It does not rush. After Docker reports a container as running, the NOS inside that container is still initializing - loading its internal applications, processing the startup configuration, forming its understanding of the world.

The documentation for this lab says to wait two to four minutes after deployment before running the health audit. The audit ran at nine minutes. And still failed.

That ruled out timing.

The Command Syntax Problem

The health audit script was written against documentation for a different version of SR Linux. This lab runs version 26.3.2. The commands in the script were for an older release.

SR Linux's CLI is structured. Commands parse token by token. If a token is not valid in the current context, the parser says so clearly and lists the options that are valid.

The IS-IS check was running:

show network-instance default protocols isis instance 1 adjacency

The response:

Parsing error: Unknown token 'instance'.
Options are: adjacency, database, hostnames, interface, summary

The word instance does not exist in this CLI path at this point in the command. The correct syntax is:

show network-instance default protocols isis adjacency

One word removed. The output that came back showed two adjacencies, both in state up, exactly as expected.

The network was healthy. The script had the wrong words.

The Case Sensitivity Problem

Fixing the command was not enough. The script checked for the string UP in the output. The actual output used up in lowercase.

# What the script checked
is_up = "UP" in output

# What SR Linux returned
| ethernet-1/1.0 | 0000.0000.0002 | L2 | 172.254.254.2 | up |

Python string matching is case-sensitive. "UP" in output returns False when the output contains up. The adjacencies were there, the check was looking for the wrong case.

Fixed to:

is_up = "up" in output.lower()

The BGP Command Problem

The BGP check was running:

show network-instance default protocols bgp neighbor summary

SR Linux parsed summary as an IP address argument and rejected it:

Error: Invalid ipaddress argument: summary

The correct command drops summary:

show network-instance default protocols bgp neighbor

The output that came back showed three established sessions on every node. The iBGP full mesh was up. It had been up the entire time.

The Route Table Problem

The route check was looking for loopback addresses in the output of:

show network-instance default route-table ipv4-unicast

ipv4-unicast is not a valid token in this context. After working through the CLI options - show network-instance default ipv4 revealed a subcommand called route - the correct command turned out to be:

show network-instance default ipv4 route

The output showed exactly what was expected. All three remote loopbacks present, all via IS-IS, all with correct next-hops.

The gNMI Problem

The gNMI check was running:

show system grpc-server

grpc-server is not a valid token in show system. The options are aaa, application, lldp, logging, network-instance.

The right command is:

show system application

That command returns a table of every process running on the node. grpc_server appears in it with state running. The check was updated to look for that string.

The Second Run

With all four fixes applied:

Health Audit Results: 17 PASSED | 0 FAILED

All checks passed -- fabric is healthy
Phase 4 gate criterion MET
Next step: python3 phase5_sensing.py

IS-IS adjacencies up on all four nodes. BGP sessions established on all four nodes. All loopback routes present via IS-IS. L2 and L3 data plane verified. gNMI server confirmed running on all four nodes.

The fabric was healthy. It had been healthy since shortly after deployment. The audit script had been asking the wrong questions in the wrong language.

What This Is Actually About

There is a version of this story that frames the troubleshooting as a problem. The script was wrong, we fixed it, moving on.

That is not the right frame.

What happened here is exactly what this series is documenting. A system that is claimed to be autonomous has to be built from verifiable components. Before trusting the intelligence layer to make decisions, every piece of the stack underneath it has to be interrogated directly - not assumed, not estimated, not taken on faith.

The health audit failed because the questions were wrong. Finding the right questions required going directly to the network. Running commands by hand. Reading the actual output. Updating the script to match reality rather than documentation.

This is how you build something you can trust. Not by assuming it works. By checking.

The fabric is healthy. The gate is open. The sensing layer starts next.

Build Record

Deployed May 17, 2026 - Dell Precision 3571 - Garuda Linux rolling - ContainerLab 0.71.0 - SR Linux v26.3.2

First audit run:

Health Audit Results: 2 PASSED | 15 FAILED

Fixes applied:

isis instance 1 adjacency → isis adjacency (invalid token)
"UP" in output → "up" in output.lower() (case sensitivity)
bgp neighbor summary → bgp neighbor (invalid argument)
route-table ipv4-unicast → ipv4 route (invalid token)
show system grpc-server → show system application (invalid token)
gNMI check updated to look for grpc_server with state running

Final audit run:

2026-05-17 10:48:04 [INFO] [PASS] SRL1: IS-IS adjacency
2026-05-17 10:48:05 [INFO] [PASS] SRL2: IS-IS adjacency
2026-05-17 10:48:05 [INFO] [PASS] SRL3: IS-IS adjacency
2026-05-17 10:48:06 [INFO] [PASS] SRL4: IS-IS adjacency
2026-05-17 10:48:06 [INFO] [PASS] SRL1: BGP neighbors (4/3 established)
2026-05-17 10:48:07 [INFO] [PASS] SRL2: BGP neighbors (4/3 established)
2026-05-17 10:48:07 [INFO] [PASS] SRL3: BGP neighbors (4/3 established)
2026-05-17 10:48:08 [INFO] [PASS] SRL4: BGP neighbors (4/3 established)
2026-05-17 10:48:09 [INFO] [PASS] srl1 has route to srl2 loopback (172.2.255.255)
2026-05-17 10:48:09 [INFO] [PASS] srl1 has route to srl3 loopback (172.3.255.255)
2026-05-17 10:48:09 [INFO] [PASS] srl1 has route to srl4 loopback (172.4.255.255)
2026-05-17 10:48:11 [INFO] [PASS] L2-EVPN: client-l2-1 -> 172.20.20.32
2026-05-17 10:48:15 [INFO] [PASS] L3-VPRN: client-l3-1 -> 10.100.4.1
2026-05-17 10:48:16 [INFO] [PASS] SRL1: gNMI server
2026-05-17 10:48:17 [INFO] [PASS] SRL2: gNMI server
2026-05-17 10:48:17 [INFO] [PASS] SRL3: gNMI server
2026-05-17 10:48:17 [INFO] [PASS] SRL4: gNMI server

Health Audit Results: 17 PASSED | 0 FAILED
Phase 4 gate criterion MET

Next installment: The Sensing Layer. gNMI streams real-time data from every node. The network begins to read itself.