One of the many things I did last summer (2020) was set up a up HAProxy as a load-balancer for my CS department’s incoming ssh connections. I completed this project in partnership with my CS Department’s wonderful Sysadmin Jeff Knerr. This page is a modified mirror of Jeff’s page that describes our setup.
The Swarthmore CS department maintains our own servers (i.e., dns, web, email, ldap, etc) and lab computers. These 100 or so lab computers are spread out over the various lab locations on campus–all of which that run linux.
To access the lab computers remotely, students ssh in. This is an especially popular feature for students who either (1) aren’t on campus or (2) want to work from elsewhere on campus. The former now especially relevant as our institution had halved the number of students on campus and eliminated in person computer lab access as a response to the pandemic.
To make this easy for students, we used one address (e.g., lab.myschool.edu) that students could ssh to, which would load-balance the ssh connections across our various lab computers. Additionally, if any computers happen to reboot or be down, we wanted them automatically (and quickly) removed from the load-balancing rotation, and put back into the rotation when they were up again.
- One server (we use debian) runs HAProxy (v1.8.19) and has the hostname you want the students to use when connecting (e.g.,
- Nobody actually logs in to the HAProxy server except admins, so you don’t need to set up student accounts on the HAProxy server
- Set up HAProxy to bind to port 22 (see the
listen sshpart of the config file below)
- Also set up sshd on the HAProxy server to run on a different port (we use
/etc/ssh/sshd_config) so your admins can still get to it (
ssh -p 9000 lab.myschool.edu)
- Set up the same ssh host keys for all of the lab computers you want in the load-balancing rotation, otherwise your users will get WARNING SSH HOST KEY CHANGED messages each time they get sent to a different computer
- You probably want to put all of these ssh host pub keys in a
/etc/ssh/ssh_known_hosts2file, and distribute it to all of your lab computers
For example, here’s a simplified section of our
ssh_known_hosts2 file (with fake names and IP addresses), only showing one of the key types (ed25519) for each host:
lab,lab.myschool.edu,188.8.131.52 ssh-ed25519 AAAAO3Nzaer56DI1NTE5AAAAIJfPzJHRiiiwhrGposISykHMLvpcowKnjRbUxb028Klx root@hostA hostA,hostA.myschool.edu,184.108.40.206 ssh-ed25519 AAAAO3Nzaer56DI1NTE5AAAAIJfPzJHRiiiwhrGposISykHMLvpcowKnjRbUxb028Klx root@hostA hostB,hostB.myschool.edu,220.127.116.11 ssh-ed25519 AAAAO3Nzaer56DI1NTE5AAAAIJfPzJHRiiiwhrGposISykHMLvpcowKnjRbUxb028Klx root@hostA hostC,hostC.myschool.edu,18.104.22.168 ssh-ed25519 AAAAO3Nzaer56DI1NTE5AAAAIJfPzJHRiiiwhrGposISykHMLvpcowKnjRbUxb028Klx root@hostA
So each lab computer (hostA, hostB, hostC) has the same
ssh_host_ed25519_key.pub files (in
/etc/ssh). Do the same for any other host key types you use (rsa, ecdsa, etc). If a student ssh’s to any of those hosts, with any name or number (hostA, hostB.myschool.edu, 22.214.171.124), they should see the same ssh host key.
And here’s a simplified version of our
/etc/haproxy/haproxy.cfg file (from a server running Debian 10 (Buster)):
global log /dev/log local0 log /dev/log local1 notice chroot /var/lib/HAProxy stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners stats timeout 30m maxconn 2500 user HAProxy group HAProxy daemon defaults log global mode tcp timeout connect 10s timeout client 36h timeout server 36h option dontlognull errorfile 400 /etc/haproxy/errors/400.http errorfile 403 /etc/haproxy/errors/403.http errorfile 408 /etc/haproxy/errors/408.http errorfile 500 /etc/haproxy/errors/500.http errorfile 502 /etc/haproxy/errors/502.http errorfile 503 /etc/haproxy/errors/503.http errorfile 504 /etc/haproxy/errors/504.http listen ssh bind *:22 balance leastconn mode tcp option tcp-check tcp-check expect rstring SSH-2.0-OpenSSH.* server hostA 126.96.36.199:22 check inter 10s fall 2 rise 1 server hostB 188.8.131.52:22 check inter 10s fall 2 rise 1 server hostC 184.108.40.206:22 check inter 10s fall 2 rise 1 # lots more hosts here (we currently have 42 hosts in here) listen stats mode http maxconn 15 bind *:443 ssl crt /etc/letsencrypt/live/lab.myschool.edu/haproxy.pem stats enable stats show-node stats uri /stats stats refresh 10s stats auth adminname:adminpassword stats hide-version
Some things to note in the above config:
- We set the timeouts to 36 hours, since students sometimes set up long-running (overnight) jobs, and we figured they would check back in on them after 12-24 hours
- For each server line, the
check inter 10s fall 2 rise 1part controls how HAProxy checks (every 10 sec) for offline (2 failed checks) and online (1 successful check) hosts
tcp-checkline means it is looking for a string that starts with SSH-2.0-OpenSSH when checking if the ssh service is up on each host. Try
telnet hostname 22to see what your sshd prints.
- Not shown here, we also use zabbix (
system.run[netstat -a -n | grep ESTABLISHED | wc -l]) and grafana to make a pretty dashboard showing connections to HAProxy vs time, so we can see how the service is used during the week
Two semesters in and the load balancing server is doing great. Students have used the service continuously, with sometimes close to 150 current connections distributed over 40 computers! (That’s a lot for our tiny CS department).
I setup the HAProxy server (at the behest of Jeff) to replace a round-robin DNS that was trying to load-balance across 10 machines. That worked, but was the number of hosts we could use was too few. Moreover, it was slow to respond to unreachable hosts (we had to notice the host was down, then change the dns records to remove the host, then worry about cached dns data).
Should You Adopt This Set Up?
Maybe. Here is a list of pros and cons that we assembled:
|Single hostname for students to remember (
||One single point of failure if our HAProxy server goes down (but students can still ssh directly to any lab computer)|
|Hosts quickly and automatically taken out of the rotation if offline, added back when online||All computers need to have the same ssh host keys (not too hard if you already manage them with ansible)|
|ssh load-sharing across 40+ computers|
|Easy to monitor (see stats example below)|
Helpful Guides and Documentation
There were several crucial documents that we followed to get this up and running. First and foremost was the HAProxy documentation. We also consulted Eugene Petrenko’s blog post on Load Balancing SSH and Evan Carmi’s blog post on Setup HAProxy stats over HTTPS.