09-21-2021 06:04 AM
TL/DR - I've got a home networking issue that I've been trying to diagnose for over a year, and after an exhaustive troubleshooting approach I've come to a theory that Verizon is limiting the number of active outbound (or inbound for that matter) TCP/IP sessions per FIOS customer. If anyone else is having symptoms like this, I'd love to hear it.
A year and a half ago I moved into a house with a lot of space, and deployed my own Cisco RV345 router, DNSMASQ DNS/DHCP server, and used the extensive coax infrastructure in the house to create a "MoCA backbone" with ActionTec APs terminating in 6 locations throughout the house. Initially, this worked incredibly well. I had stellar WiFi coverage, excellent throughput, and solid reliability.
Slowly over the first 6-9 months in the house, I started building out my "smarthome" environment that included a lot of Wemo devices, Google Nest cameras/doorbells, Amazon Alexa's, WiFi enabled garage door openers, etc. Devices add up quickly and around Christmas last year with a big batch of Wemo smart plugs I was around 100 individual WiFi enabled devices online. It was at that time everything went south.
Nest thermostats would go "offline"... then a simple reset would bring them back immediately.... then the other would go "offline"... a handful of my Wemo devices would always show offline, and turning off "everything" at once (about 50 Wemo devices) would always leave a bunch on. Wemo's have really caused me a LOT of pain over the years with reliability issues, and I suspected it was something to do with them and maybe a hardware version on the Nest thermostats after seeing a bunch of posts online.
Yes, I did countless hours of troubleshooting, rebooting, inspecting responses from my dnsmasq server for both DNS and DHCP, network scans, etc. etc. The only thing in common was that it was the "IoT" devices that have a cloud-based control method that would suffer, and routinely it was like pushing on a balloon... reset one IoT device, it would come online, and another would go off. As far as concurrent connections on all my Cisco, Actiontec, etc. rig I was well within limits.
So, I kinda lost it in frustration. 🙂 Removed a bunch of Wemo plugs that I wasn't using after the holidays, RMA'd my 3 Google Nest thermostats, and decided to swap out the entire rig for a new professional Ubiquiti system and I deployed 10 access points through the house and front & back yard. In addition, I started finally decommissioning all the Wemo switches and plugs, and went Kasa instead. I still am using my own dnsmasq server for DNS and DHCP so I can see what's transpiring.
Initially everything was awesome. Ubiquiti is expensive, but the insight is excellent, deployment a snap, and performance especially with handovers between APs is commercial-grade. My Kasa switches and plugs were WAY better than Wemo and working flawlessly, so I decided to put the heavy push on and install a lot more.
Things were working perfectly until last week I deployed about a dozen Kasa plugs throughout the bedrooms for local lamp control. The next day, I noticed several of my MyQ garage doors were showing "offline", so I reset the WiFi on them but was still having issues. Then I opened up my Next app to see one of the thermostats was offline... then I noticed a couple of Kasa plugs were showing offline. It's at this point I did a little mental math and realize I'm at about 120 WiFi enabled devices with a decent amount of them all requiring "cloud based control" for app usage.
So, the hair pulling begins again. I've been up for two days solid, and I have literally tried everything. COMPLETELY re-installed my Ubiquiti rig, converted off my custom dnsmasq deployment and using the Ubiquiti built in DNS/DHCP, rebooted everything, running network scans and finding every system on my network is online, and with Ubiquiti I'm getting good details on traffic to prove that I have no issues with systems accessing my WiFi or getting an IP address. My network is solid, but yet a good 5-10 Kasa/Google/MyQ/other IoT devices using a cloud management service are showing as "offline" at anytime... if I reset them, they come back but then some other device will go offline shortly after.
Two COMPLETELY different networking deployments, replacement Wemo for Kasa, bigger WiFi rollout, etc. etc. but as soon as I hit the same number of "cloud controlled IoT devices" that tripped me up 6 months ago on my Cisco+Actiontec rig , I'm in the same situation with random devices going offline. Honestly, the only conclusion I can come to is that Verizon has some sort of limit on inbound or outbound active sessions and as I scale up I'm hitting that limit. I've been doing very significant IP networking projects over several decades, and while all signs are pointing to this I'd love to hear if anyone else is experiencing something similar.
Troubleshooting this further is super challenging... I'm going to rip out the last batch of Kasa plugs to see if things calm down, but if they do I still come back to my theory that Verizon limits the number of active inbound/outbound sessions per residential customer. Happy to hear any theories otherwise!
09-21-2021 06:53 AM
I'm on the Ubiquity system now, but I can see 2,463 concurrent flow entries. 268 in ESTABLISHED, 1,661 in TIME_WAIT, 268 udp, a bunch of others. What is really interesting is when I run a conntrack analysis I see a bunch of UDP sessions listed that are marked as "UNREPLIED". These are actually corresponding to devices with cloud services that are showing as "down" in their respective apps.