Monitoring, managing and troubleshooting large scale networks. Almost four years ago I came to NANOG and mostly complained about the state of monitoring networks, par for the course for me. A lot has changed since then, we've solved many of the problems I addressed. Perhaps more importantly, we've fundamentally changed how we manage, monitor and troubleshoot our network. We plan to share what we learned, what went well, and best of all, what went oh so terribly wrong. Our driving philosophy behind this effort is that by taking an engineering approach to operations, you can greatly reduce the time to discover, mitigate and resolve issues on your network. We analyzed our faults, our pain points and the work that consumed most of our time. This allowed us to prioritize what we tackled first, we were surprised by what we learned caused the most outages, and how much impact minor network issues can have when they fall in the right place. From this, today, the majority of the faults that occur in our network are automatically detected, and mitigated all without human intervention.Notes:
Facebook engineering with some great war stories on microbursts and FB apps that violate TCP backoff behavior.
They had a DB app that would detect congestion and move traffic to other links. So they would find congestion events that migrated to other network bundles between other pairs of gear.
Had to fix the algorithm in the DB app (Tau) and fixed the bizarre congestion issue that behaved liked a broken TCP state machine. Took them months to years to find and resolve. Issue 1: service centers were complaining and network group not finding issue first.
- Created new detection suite to look for loss and latency (they re-invented smokeping alerting) They found that acceptable loss of TCP/IP was .1%
Tiny amount of loss could result in 50 percent reduction in throughput at minimal RTT
Any loss is seen as congestion, and TCP aggressively backs off. Tried various TCP algos - Vegas worst, Illinois best, 4x difference in perf
Next - DB team reported that they couldn't recover from network outages - wouldn't ramp back up. Found that diff algo in recover was 15x - reno didn't recover, cubic recovered rapidly.
Facebook deployed cubic everywhere.
Tuned their detection system to lower amount of alerts - interface issues are primary to look at. They have a set of tools that locates interface issues and automatically resolve them by moving traffic to other links and open a ticket. Auto-resolution. No new change control or SLAs, no new staff.
Set of scripts called "FBAR" - processes 3.6B syslog msgs, 1 percent are real issues, then FBAR interacts with FB devices 750K times per month and 97 percent resolution.
Would need 150 engineers to solve the same set of issues in the same amount of time.
https://www.facebook.com/groups/netengcode/
No comments:
Post a Comment