My Journey recovering from a major UniFi Controller failure
I have quite a complicated UniFi Networking Setup segregating over 20 different networks and was using a well-known Third-Party UniFi…
I have quite a complicated UniFi Networking Setup segregating over 20 different networks and was using a well-known Third-Party UniFi Hosting Provider since 2018.
3 days ago they had updated their UniFi Controller, which I was using for free as a long-term Beta Member from years ago, it is broken, they did not admit it neither offered to help.
I requested to provide me with a backup since they have cut the Site Migrate and Backup functionality out of my account, they refused, so I was in need to re-do my network…
Since my current config was still running on the UniFi Gear I thought about a smooth transition instead of rebuilding from scratch. Since I didn’t want to completely overhaul my Rack just to have my proxmox node delivering the unifi controller as a VM be always connected to the gear and mess up my HA setup for it.
I have my UniFi Controller published for control traffic to serve other locations I support (Friends, Family) so I had to make sure it was always available from the Internet, this would also make my migration process easier, since the gear only needs internet to be adopted and re-integrated into the network
The Before
Above you see the basic Map view of the network and after that the Network Setup seen from a broader look.
The Plan
- Install the new UniFi Controller into a VM on one proxmox host.
- pre-configure the vlans and networks from documentation
- Publish the Controller via pfSense/HAProxy
- changing unifi controller dhcp ip in usg via cmdline
- reset Access Points one by one
- adopt them by the new controller
- reset the switch2 and adopt it
- prepare USG ports on switch2
- move USG Patch cables (one by one keeping uplink)
- Reset Coreswitch (activating the STP blocked secondary link between switch1,switch2)
- adopt coreswitch and configure for DSL modem, USG und trunk ports
- move proxmox node and one AP to switch2
- reset switch 1, adopt it and configure NAS,Proxmox,UAP ports
- move proxmox and AP back
The Result
All went smoothly because I had planned for redundancy of internet, proxmox uplink und switch links so I never had to connect via ethernet to any switch to configure something being isolated from the rest of the network.
Well no, I had one issue to overcome, when resetting the DSL switch I lost connectivity to the unifi controller vm, since it’s uplink was to the DSL connected pfsense, so I moved the VM into the VLAN of the modem to directly connect it the the Provider-NAT, just for the adoption of the core-switch.
But wait where is the USG ?
So this was a bit trickier, I had to make sure the controller had internet access while the central router was down, how I made it work ?
You remember the LTE Uplink being available to pfSense as a Fallback for the published services ?
Well here in Germany our LTE providers don’t give us public IPs with Port Forwarding, while the contract was already a Home Internet Access not just a Data plan.
So I have a cloud hosted VM being a VPN exit node for the LTE connection, I simply added my DSL pfSense via VPN to the same instance and configured NAT to failover to each of the links when one goes down, effectively bypassing my CloudFlare Load-balancer listening on Port Forwarding on the DSL.
With this setup I could reset the USG while the controller was connected trough the VPN connections bypassing the USG
How I connected to it after the reset?
I created a WLAN to be connected to the USGs LAN Native Network where I connected with my Mac to configure It.
After the USG was adopted I managed to migrate my entire unifi network to a new controller without any interruption in any hosted service.
The side-effect I can now cancel my CloudFlare Load-Balancer subscription!
Originally published at https://www.pierewoehl.de on June 12, 2021.