Most failures in connected systems do not happen in the lab. They happen at the seams - the moment a device leaves a bench where the power is clean, the network is fast and the engineer is standing next to it, and enters a world of brownouts, dead spots, half-flashed firmware and a support engineer three hundred kilometres away holding a phone. The question we keep returning to in this strand of Product Labs is uncomfortable for anyone with an electronics background: how much of reliable IoT is actually an electronics problem, and how much is an operations problem wearing an electronics costume? Our working answer, tested against real builds rather than slideware, is that the electronics is the easy half. Provisioning, connectivity, diagnostics and field support are where systems quietly rot. This page is a field note on what we explore, what we build, and where our convictions are still being corrected by contact with reality.
What we explore and build
Our work here begins, unusually, with space rather than silicon. A real deployment lives on a site with geometry - boundaries, obstructions, vantage points, dead zones - and we treat that geometry as authoritative input. We take spatial data exported from survey and mapping tools, convert it through a pipeline (KML into GeoJSON into zone grids), and run coverage operations directly: which cells a camera or sensor can actually see, where the blind spots fall, and how many nodes it takes to close them. We have learned to do a surprising amount of this in the browser, treating geospatial computation as a first-class design surface rather than a back-office GIS task. Crucially, the diagrams we hand to clients and installers are generated programmatically from that same authoritative spatial data, not redrawn by hand - so the topology map, the coverage model and the bill of materials cannot silently disagree with each other.
From there the work moves to the nodes themselves. We build over-the-air provisioning and firmware distribution for ESP32-class devices, including the awkward but essential capability of serial-over-USB flashing directly from an Android handset running Termux - because the reality of field work is that the most reliable computer on site is often the phone in the installer's pocket. We design the hardware bill of materials, model the power budget, plan the network topology, and place an on-site control node that can gate, stage and observe the fleet without depending on a perfect connection back to the cloud. None of this is exotic. It is the unglamorous plumbing that decides whether a deployment survives its first storm.
The industry context: this is a discipline, not a marketing claim
It would be easy to treat all of this as bespoke craft. It is not - and the standards literature makes that clear. Device identity, for example, has a named, well-specified pattern. AWS IoT Core documents fleet provisioning 'by claim': devices ship with a shared claim certificate, and on first connection the platform registers each device through a provisioning template and issues it a unique X.509 client certificate, using the CreateKeysAndCertificate, CreateCertificateFromCsr and RegisterThing APIs. A compromised claim certificate can be deactivated to block future provisioning, though not devices already enrolled. That is exactly the practitioner pattern we build toward - bootstrapping a generic identity into a per-device one - now with a vendor specification to hold ourselves against.
The same maturity exists for device lifecycle. The Eclipse Sparkplug specification defines an MQTT topic namespace, payload encoding and session-state management for industrial IoT, using birth and death certificates (NBIRTH/NDEATH for edge nodes, DBIRTH/DDEATH for devices) layered on MQTT's Last Will and Testament so the whole system has real-time awareness of which devices are alive, what they can do and what state they hold. This sharpens a conviction we hold across our edge and AI work: a device's lifecycle is itself a contract carried over the wire, not ad-hoc telemetry to be parsed hopefully later. That idea connects directly to how we think about Event Contracts as the Coordination Layer for Mixed Human and Agent Teams in software systems - the same discipline of making availability and capability explicit, whether the participant is a microcontroller or an autonomous agent.
Firmware distribution is equally specified. Mender's OTA guidance prescribes atomic A/B dual-partition image updates with automatic rollback to the last good state, signed-artifact verification, resumable transfers and phased rollouts that avoid every device polling and updating at once - and quantifies the stakes, estimating that around 8.5% of devices in a large fleet can fail within three years when supported by a poorly designed update mechanism. The discipline of field observability follows the same logic: a fleet is only as recoverable as it is legible, which means capturing crashes, faults and reboot reasons, resource and flash-wear trends, and connectivity signals such as signal strength, reconnect attempts and disconnections. The serial tooling underneath much of this is unremarkable by design - an open-source flashing utility driving the ESP32 ROM bootloader runs perfectly well as a Python command-line tool on an Android/Termux host against a /dev/tty serial port, which is precisely why the phone-in-the-pocket workflow is viable rather than a gimmick.
What we are learning
The thesis that keeps surviving contact with the field is that edge reliability is governed by what happens when things are not ideal - and that this is the primary axis of failure, not a finishing polish. Gartner has predicted that by 2025 half of enterprise edge computing solutions deployed before a cohesive strategy is in place will fail to meet their deployment-time, functionality and cost goals, and that edge proofs-of-concept must explicitly test for scale and tolerance to disconnection. That matches what we see: the demo that worked on the bench is not evidence of anything. The proof is the node that loses power mid-update, reconnects on a weak signal, and comes back to a known-good state without a truck roll.
Connected systems fail at the seams, not in the silicon. Reliable IoT is as much an operations problem as an electronics one - and the deployments that survive are the ones designed for the bad day, not the demo day.
We are also learning where the genuinely new compute fits. Edge AI on microcontroller-class hardware is now a measurable engineering discipline, not a buzzword: MLPerf Tiny from MLCommons benchmarks inference on devices running roughly 10-250 MHz at a few milliwatts with models around 100 kB and below, across keyword spotting, visual wake words, tiny image classification and machine-sound anomaly detection. And the hardware choice has measured consequences - a 2024 study of heterogeneous edge SoCs found NPUs roughly 58.6% faster at matrix-vector multiplication and up to 3.2x faster on video and LLM-style workloads at lower power, while GPUs were faster on some matrix and LSTM workloads but more energy-hungry, the broader lesson being that you match the workload to the accelerator rather than betting on one compute unit. This is why our power budgeting and BOM decisions are now AI-aware from the start, and why we resist treating a device as a thin client when a few milliwatts of local inference could remove a round trip on a flaky link.
An honest note on the stage
This strand is at Production stage, which we mean precisely: patterns described here are running in real deployments, not concepts on a whiteboard. We are not, however, claiming an AI track record we have not earned - the field is young, and we are seasoned delivery, architecture and product engineers applying emerging tools in the open. The spatial pipeline, OTA provisioning and on-site control node have earned their place by surviving real sites. The edge-AI elements are newer and held more loosely; we find the trade-offs the benchmarks describe persuasive, but our own evidence is still accumulating. We try to be candid about that line. The same honesty applies to firmware: the discipline holds regardless of fleet size - signed firmware, encrypted transport, fail-safe rollback, staged testing on a representative sample and power-aware scheduling - and getting any of those wrong is how a fleet becomes a liability rather than an asset.
Where this points
The reason this work sits inside Product Labs rather than in a hardware silo is that the lessons generalise. The seams that break connected devices are the same seams that break distributed software: implicit contracts, optimistic assumptions about availability, and diagnostics added as an afterthought. The discipline of generating diagrams from authoritative data, gating change through a control node, and treating lifecycle as a contract is the same discipline we bring to The Integration Seam Is Where AI-Generated Software Breaks: Payments, Identity and the Limits of Generation and to our wider Delivery Architecture: The Translation Layer thinking. We will keep publishing what we learn here as the deployments age - especially the failures, which are where the genuine engineering lives.