Azure Load Balancer to become more efficient

26th February 2018 Anthony Mashford 0 Comments

Azure recently introduced an advanced, more efficient Load Balancer platform. This platform adds a whole new set of abilities for customer workloads using the new Standard Load Balancer. One of the key additions the new Load Balancer platform brings, is a simplified, more predictable and efficient outbound port allocation algorithm.

While already integrated with Standard Load Balancer, we are now bringing this advantage to the rest of Azure.

Load Balancer and Source NAT

Azure deployments use one or more of three scenarios for outbound connectivity, depending on the customer’s deployment model and the resources utilized and configured. Azure uses Source Network Address Translation (SNAT) to enable these scenarios. When multiple private IP addresses or roles share the same public IP (public IP address assign to Load Balancer or automatically assigned public IP address for standalone VMs), Azure uses port masquerading SNAT (PAT) to translate private IP addresses to public IP addresses using the ephemeral ports of the public IP address. PAT does not apply when Instance Level Public IP addresses (ILPIP) are assigned.

For the cases where multiple instances share a public IP address, each instance behind an Azure Load Balancer VIP is pre-allocated a fixed number of ephemeral ports to be used for PAT (SNAT ports), which it uses for masquerading outbound flows. The number of pre-allocated ports per instance is determined by the size of backend pool, see new SNAT Algorithm section for details.

Why are we changing the SNAT algorithm!

The existing deployments in Azure enjoy the version of SNAT port allocation that is designed to be dynamic in nature. This version allocates 160 ports per instance of outbound ports to start with and follow an on-demand model thereafter. The backend instances will initiate connections using these ports, and free these ports for reuse after 4 minutes of idle time in the default configuration. And if there are multiple simultaneous outbound connections exhausting the allocated SNAT ports, the requesting instances are allocated an additional small number of ports depending on availability and rate of request.

This model works well for services with a distributed model, creating uniform outbound flows, or the services that need to establish flows with multiple different external endpoints. However, for services that need multiple simultaneous flows with a few external destinations, the initial port allocation gets exhausted in a short period, and they experience intermittent connection failures. It is very challenging for services to predict exactly how many ports they’ll get and connections they’ll be able to initiate. With the on-demand model, the ports are not evenly distributed. This results in longer pending state for SNAT port allocation for some of the instances in the pool. The challenges listed above are addressed by the newer algorithm.

New SNAT algorithm

The new Azure Load Balancer platform introduces a more robust, simple, and predictable port allocation algorithm. In this model, all the available ports are pre-allocated, and evenly distributed amongst the backend pool of the Load Balancer depending on the pool size. Each IP configuration gets a pre-determined number of ports. Your services can make decisions on the distribution of connections amongst the backend pool instances and make an efficient use of resources. The change will assist customers in designing their services better and with fewer scaling limitations.

The following table shows the number of SNAT ports allocated based on the size of the backend pool:

Pool Size	Pre-Allocated SNAT ports per IP configuration
1 – 50	1024
51 – 100	512
101 – 200	256
201 – 400	128
401 – 800	64
801 – 1000	32

For more details on the allocation, please refer to the Ephemeral port pre-allocation for port masquerading SNAT (PAT) section of Understanding outbound connections in Azure article.

Migration

We plan to adopt this new allocation algorithm across Azure, making it easier to manage the SNAT allocation for platform as well for customers. Migration of existing deployments to the new SNAT port allocation algorithm are targeted for Summer 2018.

This will completely replace the older algorithm. A more detailed schedule will be announced in the future.

What type of SNAT allocation do I get?

For our customers deploying in Azure, this might bring up a question of what type of port allocation will they see with their services! Let’s categorize these into different scenarios:

New deployments

All the new deployments in Azure will subscribe to the newer port allocation model described above. This applies to both Standard SKU Load Balancer based as well as Basic SKU Load Balancer based deployments and any Classic cloud service deployments.

Furthermore, if you have existing deployments, but are migrating or redeploying those services, the newer instances will all provision with the new port allocation algorithm.

Existing deployments

All the existing deployments will keep using the older SNAT port allocation scheme for now. However, existing deployments will be migrated to the new algorithm as well, which will change the experience for the existing deployments and make it consistent everywhere.

How does it affect my services?

SNAT port allocation plays an important role in outbound connectivity for Azure instances, like discussed above. So far, the services enjoyed an on-demand allocation of ports, where starting from a small number of 160 ports, some instances could potentially go to a very high number of port allocation in the tens of thousands while others in the pool or availability set only consume a small number. Usually, large numbers or high rate of port allocations also cause intermittent failures.

However, with the new allocation model, each IP configuration will get a fixed number of SNAT ports, which will be selected for outbound flows. Once the available ports are exhausted, no more allocation will be possible. This might impact the services that initiate a very high number of simultaneous SNAT connections from individual instances. If your services fall under this category, you might want to rethink the service design and look for the possible mitigations. The Managing SNAT Port Exhaustion section in the Understanding Outbound Connections article expands on better SNAT port management techniques.

With the new model, if you are scaling up, the number of SNAT ports allocated per instance will drop to half once the instance count increases than the current pool size. This could also affect the services by resetting existing connections and freeing up some ports for redistribution.

What should I do right now?

Review and familiarize yourself with the scenarios and patterns described in Managing SNAT port exhaustion for guidance on how to design for reliable and scalable scenarios.

How do I get exception from this migration?

The port allocation algorithm is a platform level change. No exceptions will be granted to the customers. However, we do understand that you are running critical production workloads in Azure and want to ensure a safer implementation of mitigations or wait for a critical period before changing the service logic. Please reach out to Azure Support in such scenarios with your deployment information and we’ll work with you to ensure no disruption to your services.

Source: Azure Blog Feed

Mashford's Musings