Oops! Something went wrong while submitting the form.
We use cookies to improve your browsing experience on our website, to show you personalised content and to analize our website traffic. By browsing our website, you consent to our use of cookies. Read privacy policy.
Autoscaling, a key feature of Kubernetes, lets you improve the resource utilization of your cluster by automatically adjusting the application’s resources or replicas depending on the load at that time.
This blog talks about Pod Autoscaling in Kubernetes and how to set up and configure autoscalers to optimize the resource utilization of your application.
Horizontal Pod Autoscaling
What is the Horizontal Pod Autoscaler?
The Horizontal Pod Autoscaler (HPA) scales the number of pods of a replica-set/ deployment/ statefulset based on per-pod metrics received from resource metrics API (metrics.k8s.io) provided by metrics-server, the custom metrics API (custom.metrics.k8s.io), or the external metrics API (external.metrics.k8s.io).
Prerequisite
Verify that the metrics-server is already deployed and running using the command below, or deploy it using instructions here.
HPA fetches per-pod resource metrics (like CPU, memory) from the resource metrics API and calculates the current metric value based on the mean values of all targeted pods. It compares the current metric value with the target metric value specified in the HPA spec and produces a ratio used to scale the number of desired replicas.
A. Setup: Create a Deployment and HPA resource
In this blog post, I have used the config below to create a deployment of 3 replicas, with some memory load defined by “--vm-bytes", "850M”.
Lets create an HPA resource for this deployment with multiple metric blocks defined. The HPA will consider each metric one-by-one and calculate the desired replica counts based on each of the metrics, and then select the one with the highest replica count.
We have defined the minimum number of replicas HPA can scale down to as 1 and the maximum number that it can scale up to as 10.
Target Average Utilization and Target Average Values implies that the HPA should scale the replicas up/down to keep the Current Metric Value equal or closest to Target Metric Value.
HPA calculates pod utilization as total usage of all containers in the pod divided by total request. It looks at all containers individually and returns if container doesn't have request.
The calculated Current Metric Value for memory, i,e., 894188202666m, is higher than the Target Average Value of 500Mi, so the replicas need to be scaled up.
The calculated Current Metric Value for CPU i.e., 36%, is lower than the Target Average Utilization of 50, so hence the replicas need to be scaled down.
Replicas are calculated based on both metrics and the highest replica count selected. So, the replicas are scaled up to 6 in this case.
HPA using Custom metrics
We will use the prometheus-adapter resource to expose custom application metrics to custom.metrics.k8s.io/v1beta1, which are retrieved by HPA. By defining our own metrics through the adapter’s configuration, we can let HPA perform scaling based on our custom metrics.
A.Setup: Install Prometheus Adapter
Create prometheus-adapter.yaml with the content below:
Here, the current calculated metric value is 18666m. The m represents milli-units. So, for example, 18666m means 18.666 which is what we expect ((33 + 11 + 10 )/3 = 18.666). Since it's less than the target average value (i.e., 50), the HPA scales down the replicas to make the Current Metric Value : Target Metric Value ratio closest to 1. Hence, replicas are scaled down to 2 and later to 1.
Vertical Pod Autoscaling
What is Vertical Pod Autoscaler?
Vertical Pod autoscaling (VPA) ensures that a container’s resources are not under- or over-utilized. It recommends optimized CPU and memory requests/limits values, and can also automatically update them for you so that the cluster resources are efficiently used.
Architecture
VPA consists of 3 components:
VPA admission controller Once you deploy and enable the Vertical Pod Autoscaler in your cluster, every pod submitted to the cluster goes through this webhook, which checks whether a VPA object is referencing it.
VPA recommender The recommender pulls the current and past resource consumption (CPU and memory) data for each container from metrics-server running in the cluster and provides optimal resource recommendations based on it, so that a container uses only what it needs.
VPA updater The updater checks at regular intervals if a pod is running within the recommended range. Otherwise, it accepts it for update, and the pod is evicted by the VPA updater to apply resource recommendation.
Installation
If you are on Google Cloud Platform, you can simply enable vertical-pod-autoscaling:
Use the same deployment config to create a new deployment with "--vm-bytes", "850M". Then create a VPA resource in Recommendation Mode with updateMode : Off
minAllowed is an optional parameter that specifies the minimum CPU request and memory request allowed for the container.
maxAllowed is an optional parameter that specifies the maximum CPU request and memory request allowed for the container.
B. Check the Pod’s Resource Utilization
Check the resource utilization of the pods. Below, you can see only ~50 Mi memory is being used out of 1000Mi and only ~30m CPU out of 1000m. This clearly indicates that the pod resources are underutilized.
Target: The recommended CPU request and memory request for the container that will be applied to the pod by VPA.
Uncapped Target: The recommended CPU request and memory request for the container if you didn’t configure upper/lower limits in the VPA definition. These values will not be applied to the pod. They’re used only as a status indication.
Lower Bound: The minimum recommended CPU request and memory request for the container. There is a --pod-recommendation-min-memory-mb flag that determines the minimum amount of memory the recommender will set—it defaults to 250MiB.
Upper Bound: The maximum recommended CPU request and memory request for the container. It helps the VPA updater avoid eviction of pods that are close to the recommended target values. Eventually, the Upper Bound is expected to reach close to target recommendation.
The Target Recommendation can not go below the minAllowed defined in the VPA spec.
E. Stress Loading Pods
Let’s recreate the deployment with memory request and limit set to 2000Mi and "--vm-bytes", "500M".
Gradually stress load one of these pods to increase its memory utilization. You can login to the pod and run stress --vm 1 --vm-bytes 1400M --timeout 120000s.
Limits v/s Request VPA always works with the requests defined for a container and not the limits. So, the VPA recommendations are also applied to the container requests, and it maintains a limit to request ratio specified for all containers.
For example, if the initial container configuration defines a 100m Memory Request and 300m Memory Limit, then when the VPA target recommendation is 150m Memory, the container Memory Request will be updated to 150m and Memory Limit to 450m.
Selective Container Scaling
If you have a pod with multiple containers and you want to opt-out some of them, you can use the "Off" mode to turn off recommendations for a container.
You can also set containerName: "*" to include all containers.
Both the Horizontal Pod Autoscaler and the Vertical Pod Autoscaler serve different purposes and one can be more useful than the other depending on your application’s requirement.
The HPA can be useful when, for example, your application is serving a large number of lightweight (low resource-consuming) requests. In that case, scaling number of replicas can distribute the workload on each of the pod. The VPA, on the other hand, can be useful when your application serves heavyweight requests, which requires higher resources.
Autoscaling, a key feature of Kubernetes, lets you improve the resource utilization of your cluster by automatically adjusting the application’s resources or replicas depending on the load at that time.
This blog talks about Pod Autoscaling in Kubernetes and how to set up and configure autoscalers to optimize the resource utilization of your application.
Horizontal Pod Autoscaling
What is the Horizontal Pod Autoscaler?
The Horizontal Pod Autoscaler (HPA) scales the number of pods of a replica-set/ deployment/ statefulset based on per-pod metrics received from resource metrics API (metrics.k8s.io) provided by metrics-server, the custom metrics API (custom.metrics.k8s.io), or the external metrics API (external.metrics.k8s.io).
Prerequisite
Verify that the metrics-server is already deployed and running using the command below, or deploy it using instructions here.
HPA fetches per-pod resource metrics (like CPU, memory) from the resource metrics API and calculates the current metric value based on the mean values of all targeted pods. It compares the current metric value with the target metric value specified in the HPA spec and produces a ratio used to scale the number of desired replicas.
A. Setup: Create a Deployment and HPA resource
In this blog post, I have used the config below to create a deployment of 3 replicas, with some memory load defined by “--vm-bytes", "850M”.
Lets create an HPA resource for this deployment with multiple metric blocks defined. The HPA will consider each metric one-by-one and calculate the desired replica counts based on each of the metrics, and then select the one with the highest replica count.
We have defined the minimum number of replicas HPA can scale down to as 1 and the maximum number that it can scale up to as 10.
Target Average Utilization and Target Average Values implies that the HPA should scale the replicas up/down to keep the Current Metric Value equal or closest to Target Metric Value.
HPA calculates pod utilization as total usage of all containers in the pod divided by total request. It looks at all containers individually and returns if container doesn't have request.
The calculated Current Metric Value for memory, i,e., 894188202666m, is higher than the Target Average Value of 500Mi, so the replicas need to be scaled up.
The calculated Current Metric Value for CPU i.e., 36%, is lower than the Target Average Utilization of 50, so hence the replicas need to be scaled down.
Replicas are calculated based on both metrics and the highest replica count selected. So, the replicas are scaled up to 6 in this case.
HPA using Custom metrics
We will use the prometheus-adapter resource to expose custom application metrics to custom.metrics.k8s.io/v1beta1, which are retrieved by HPA. By defining our own metrics through the adapter’s configuration, we can let HPA perform scaling based on our custom metrics.
A.Setup: Install Prometheus Adapter
Create prometheus-adapter.yaml with the content below:
Here, the current calculated metric value is 18666m. The m represents milli-units. So, for example, 18666m means 18.666 which is what we expect ((33 + 11 + 10 )/3 = 18.666). Since it's less than the target average value (i.e., 50), the HPA scales down the replicas to make the Current Metric Value : Target Metric Value ratio closest to 1. Hence, replicas are scaled down to 2 and later to 1.
Vertical Pod Autoscaling
What is Vertical Pod Autoscaler?
Vertical Pod autoscaling (VPA) ensures that a container’s resources are not under- or over-utilized. It recommends optimized CPU and memory requests/limits values, and can also automatically update them for you so that the cluster resources are efficiently used.
Architecture
VPA consists of 3 components:
VPA admission controller Once you deploy and enable the Vertical Pod Autoscaler in your cluster, every pod submitted to the cluster goes through this webhook, which checks whether a VPA object is referencing it.
VPA recommender The recommender pulls the current and past resource consumption (CPU and memory) data for each container from metrics-server running in the cluster and provides optimal resource recommendations based on it, so that a container uses only what it needs.
VPA updater The updater checks at regular intervals if a pod is running within the recommended range. Otherwise, it accepts it for update, and the pod is evicted by the VPA updater to apply resource recommendation.
Installation
If you are on Google Cloud Platform, you can simply enable vertical-pod-autoscaling:
Use the same deployment config to create a new deployment with "--vm-bytes", "850M". Then create a VPA resource in Recommendation Mode with updateMode : Off
minAllowed is an optional parameter that specifies the minimum CPU request and memory request allowed for the container.
maxAllowed is an optional parameter that specifies the maximum CPU request and memory request allowed for the container.
B. Check the Pod’s Resource Utilization
Check the resource utilization of the pods. Below, you can see only ~50 Mi memory is being used out of 1000Mi and only ~30m CPU out of 1000m. This clearly indicates that the pod resources are underutilized.
Target: The recommended CPU request and memory request for the container that will be applied to the pod by VPA.
Uncapped Target: The recommended CPU request and memory request for the container if you didn’t configure upper/lower limits in the VPA definition. These values will not be applied to the pod. They’re used only as a status indication.
Lower Bound: The minimum recommended CPU request and memory request for the container. There is a --pod-recommendation-min-memory-mb flag that determines the minimum amount of memory the recommender will set—it defaults to 250MiB.
Upper Bound: The maximum recommended CPU request and memory request for the container. It helps the VPA updater avoid eviction of pods that are close to the recommended target values. Eventually, the Upper Bound is expected to reach close to target recommendation.
The Target Recommendation can not go below the minAllowed defined in the VPA spec.
E. Stress Loading Pods
Let’s recreate the deployment with memory request and limit set to 2000Mi and "--vm-bytes", "500M".
Gradually stress load one of these pods to increase its memory utilization. You can login to the pod and run stress --vm 1 --vm-bytes 1400M --timeout 120000s.
Limits v/s Request VPA always works with the requests defined for a container and not the limits. So, the VPA recommendations are also applied to the container requests, and it maintains a limit to request ratio specified for all containers.
For example, if the initial container configuration defines a 100m Memory Request and 300m Memory Limit, then when the VPA target recommendation is 150m Memory, the container Memory Request will be updated to 150m and Memory Limit to 450m.
Selective Container Scaling
If you have a pod with multiple containers and you want to opt-out some of them, you can use the "Off" mode to turn off recommendations for a container.
You can also set containerName: "*" to include all containers.
Both the Horizontal Pod Autoscaler and the Vertical Pod Autoscaler serve different purposes and one can be more useful than the other depending on your application’s requirement.
The HPA can be useful when, for example, your application is serving a large number of lightweight (low resource-consuming) requests. In that case, scaling number of replicas can distribute the workload on each of the pod. The VPA, on the other hand, can be useful when your application serves heavyweight requests, which requires higher resources.
Velotio Technologies is an outsourced software product development partner for top technology startups and enterprises. We partner with companies to design, develop, and scale their products. Our work has been featured on TechCrunch, Product Hunt and more.
We have partnered with our customers to built 90+ transformational products in areas of edge computing, customer data platforms, exascale storage, cloud-native platforms, chatbots, clinical trials, healthcare and investment banking.
Since our founding in 2016, our team has completed more than 90 projects with 220+ employees across the following areas:
Building web/mobile applications
Architecting Cloud infrastructure and Data analytics platforms