Table of contents
- The Beginning
- k8s API
- The kubeconfig file
- The kubectl
- The API-Server
- Control loops
tl;dr If all went well, you should see three nginx pods running on your cluster
But, what is going on under the hood
A really good thing about K8s is that it hides its complexities by providing simple abstractions and user-friendly APIs. I think that to fully understand the values it offers, it's good to know a bit about its internals. In this post, I'll trace the path a K8s deployment object takes from the time we enter
kubectl create deploy nginx --image=nginx --replicas=3 to when we finally see the three running nginx pods.
The Kubernetes API is an HTTP REST API. This API is the real Kubernetes user interface. Kubernetes is fully controlled through this API. This means that every Kubernetes operation is exposed as an API endpoint and can be executed by an HTTP request to this endpoint.
Kubernetes is a fully resource-centered system. That means, Kubernetes maintains an internal state of resources, and all Kubernetes operations are CRUD operations on these resources. Kubernetes is controlled by manipulating these resources (and Kubernetes figures out what to do based on the current state of resources).
api-server talks HTTP(S) and exposes some REST endpoints. The
kubeconfig file in your directory contains information about your clusters' endpoints, users' auth info, and a map of clusters and users. Almost all k8s related components/clients make use of this file to connect to the cluster. Default
kubeconfig file resides in users
home dir at
kubectl create deploy nginx --image=nginx --replicas=3
While the server-side validation is favored and is de-facto. The first thing that
kubectl will do is perform some trivial client-side validations. This ensures that the requests that should always fail (like invalid spec key, wrong indentation, etc) will fail fast without unnecessary server load.
At this point
kubectl will serialize the request and start forming the HTTP request it will send to
User creds are almost always stored in the
kubeconfig file. To locate the config file, kubectl does the following
- If a
--kubeconfigflag is provided, use that
- If the
KUBECONFIGenv var is set, use that
- otherwise, look at the default (
After parsing the file, it then determines the current context to use, the current cluster to point to, and any auth information associated with the current user.
API groups and version negotiation
K8s uses a versioned API that is categorized into "API groups". An API group is meant to categorize similar resources so that they're easier to reason about. It also provides a better alternative to a singular monolithic API. The API group of a Deployment is named
apps, and its most recent version is
After kubectl generates the object, it starts to find the appropriate API group and version for it. This discovery stage is called version negotiation and involves kubectl scanning the
/apis path on the remote API to retrieve all possible API groups. Since
kube-apiserver exposes its schema document (in OpenAPI format) at this path, it's easy for clients to perform their own discovery. kubectl caches the discovery result at
After the API Group and version discovery, kubectl will validate if the resource we are going to create is actually supported by the server. And then it'll
POST to the appropriate API. Generally the API path will be
So, in our case request will be sent to
/apis/apps/v1/namespaces/default/deployments with POST data being the object data. You can check this with kubectl's
So our request has been sent, woaah! What now? We have already talked about the
kube-apiserver briefly but this is where it takes the central stage.
All K8s clusters have two categories of users: service accounts (managed by K8s) and normal users. For normal users, when the apiserver starts up, it looks at all the CLI flags the user-provided and assembles a list of suitable authenticators. For example, if a
--client-ca-file has been passed in, it appends the x509 authenticator; if it sees
--token-auth-file provided, it appends the token authenticator to the list. Additionally webhook-token-authentication, authenticating-proxy can be used for custom authenticator plugins.
As HTTP requests are made to the API server, plugins attempt to associate the following attributes with the request:
- Username: a string that identifies the end-user. Common values might be kube-admin or firstname.lastname@example.org.
- UID: a string that identifies the end-user and attempts to be more consistent and unique than username.
- Groups: a set of strings, each of which indicates the user's membership in a named logical collection of users. Common values might be
All values are opaque to the authentication system and only hold significance when interpreted by an authorizer.
You can enable multiple authentication methods at once. You should usually use at least two methods:
- service account tokens for service accounts
- at least one other method for normal user authentication.
When multiple authenticator modules are enabled, the first module to successfully authenticate the request short-circuits evaluation. The API server does not guarantee the order authenticators run in.
Let's talk about a couple of authenticators next
X509 Client Certs
Client certificate authentication is enabled by passing the
--client-ca-file=SOMEFILE option to API server. If a client certificate is presented and verified against above CA, the common name(
CN) of the subject is used as the user name for the request. As of Kubernetes 1.4, client certificates can also indicate a user's group memberships using the certificate's organization(
For example, following request would create a CSR for the username "nakam", belonging to two groups, "devops" and "devs".
openssl req -new -key key.pem -out csr.pem -subj "/CN=nakam/O=devops/O=devs"
Static Token File
The API server reads bearer tokens from a file when given the
--token-auth-file=SOMEFILE option on the command line. Currently, tokens last indefinitely, and the token list cannot be changed without restarting the API server.
The token file is a csv file with a minimum of 3 columns: token, user name, user uid, followed by optional group names.
When authenticating from an http client, the API server expects an
Authorization header with a value of
Authorization: Bearer 31ada4fd-adec-460c-809a-9e56ceb75269
Service Account Tokens
A service account is an automatically enabled authenticator that uses signed bearer tokens to verify requests. The plugin takes two optional flags:
--service-account-key-fileA file containing a PEM encoded key for signing bearer tokens. If unspecified, the API server's TLS private key will be used.
--service-account-lookupIf enabled, tokens which are deleted from the API will be revoked. Service accounts are usually created automatically by the API server and associated with pods running in the cluster through the ServiceAccount Admission Controller. Bearer tokens are mounted into pods at well-known locations but are perfectly valid to use outside the cluster and can be used to create identities for long standing jobs that wish to talk to the Kubernetes API.
Service accounts authenticate with the username
system:serviceaccount:(NAMESPACE):(SERVICEACCOUNT), and are assigned to the groups
Custom authenticator plugins can be specified as Webhooks. This is how
AWS IAM users are authenticated to EKS clusters using aws-iam-authenticator, the authenticator gets its configuration information from the
aws-auth ConfigMap in
Once the authentication is done, the authenticator can send extra information to the api-server to set the client identity. The API server can be configured to identify users from request header values, such as
X-Remote-User. An authenticating proxy, can set the request header value.
--requestheader-username-headersRequired, case-insensitive. Header names to check, in order, for the user identity. The first header containing a value is used as the username.
--requestheader-group-headers1.6+. Optional, case-insensitive. "X-Remote-Group" is suggested. Header names to check, in order, for the user's groups. All values in all specified headers are used as group names.
For example, with this configuration:
GET / HTTP/1.1 X-Remote-User: fido X-Remote-Group: dogs X-Remote-Group: dachshunds
would result in this user info:
name: fido groups: - dogs - dachshunds
In order to prevent header spoofing, the authenticating proxy is required to present a valid client certificate to the API server for validation against the specified CA before the request headers are checked.
--requestheader-client-ca-fileRequired. PEM-encoded certificate bundle. A valid client certificate must be presented and validated against the certificate authorities in the specified file before the request headers are checked for user names.
--requestheader-allowed-namesOptional. List of Common Name values (CNs). If set, a valid client certificate with a CN in the specified list must be presented before the request headers are checked for user names. If empty, any CN is allowed.
If every authenticator fails, the request fails and an error is returned. If authentication succeeds, the Authorization header is removed from the request, and user information is added to its context. This gives future steps (such as authorization and admission controllers) the ability to access the previously established identity of the user.
Okay, so. Request has been sent and we have successfully defended to kube-apiserver that we are who we say we are, what a relief 😌 !!
But we might not have permissions to do what we might want to do (after all, identity and permissions are not the same thing).
The kube-apiserver handles authorization the same way it did authentication: based on flag inputs, it will assemble a chain of authorizers that will be run against every incoming request. When multiple authorization modules are configured, each is checked in sequence. If any authorizer approves or denies a request, that decision is immediately returned and no other authorizer is consulted. If all modules have no opinion on the request, then the request is denied. A deny returns an HTTP status code 403.
Kubernetes reviews only the following API request attributes:
- user - The user string provided during authentication.
- group - The list of group names to which the authenticated user belongs.
- extra - A map of arbitrary string keys to string values, provided by the authentication layer.
- API - Indicates whether the request is for an API resource.
- Request path - Path to miscellaneous non-resource endpoints like
- API request verb - API verbs like
deletecollectionare used for resource requests.
- HTTP request verb - Lowercased HTTP methods like
deleteare used for non-resource (requests to endpoints other than
/apis/<group>/<version>/...are considered "non-resource requests") requests.
- Resource - The ID or name of the resource that is being accessed (for resource requests only)
- Subresource - The subresource that is being accessed (for resource requests only).
- Namespace - The namespace of the object that is being accessed (for namespaced resource requests only).
- API group - The API Group being accessed (for resource requests only). An empty string designates the core API group.
The Kubernetes API server may authorize a request using one of several authorization modes:
A special-purpose authorization mode that grants permissions to kubelets based on the pods they are scheduled to run.
Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within an enterprise. In this context, access is the ability of an individual user to perform a specific task, such as view, create, or modify a file.
When specified, RBAC (Role-Based Access Control) uses the
rbac.authorization.k8s.io API group to drive authorization decisions, allowing admins to dynamically configure permission policies through the Kubernetes API.
To enable RBAC, start the apiserver with
A WebHook is an HTTP callback: an HTTP POST that occurs when something happens.
Check API Access
You can check if you are authorised to perform an action using
auth can-i subcommand. The command uses the
SelfSubjectAccessReview API in
authorization.k8s.io group to determine if the current user can perform a given action, and works regardless of the authorization mode used.
kubectl auth can-i create deployments --namespace dev
Administrators can combine this with user impersonation to determine what action other users can perform.
kubectl auth can-i list secrets --namespace dev --as dave kubectl auth can-i list pods --namespace target --as system:serviceaccount:dev:default
At this point, from kuber-apiserver's perspective it believes we are who we say we are and are permitted to continue. But with Kubernetes, other parts of the system have strong opinions about what should and should not be permitted to happen. This is where admission controllers enter the picture.
Admission controllers intercept the request to ensure that it matches the wider expectations and rules of the cluster. They are the last control before an object is persisted to the etcd. Admission controllers may be "validating", "mutating", or both. Mutating controllers may modify related objects to the requests they admit; validating controllers may not. They limit requests to create, delete, modify objects. They do not limit requests to read objects.
The way admission controllers work is similar to the way authenticators and authorizers work, but with one difference: unlike authenticator and authorizers chains, if a single admission controller fails, the entire request is rejected immediately and an error is returned to the end-user.
Dynamic Admission Controls
In addition to compiled-in admission plugins (which we saw above), admission plugins can be developed as extensions and run as webhooks configured at runtime. These webhooks are HTTP callbacks that receive admission requests and do something with them. You can define two types of admission webhooks, validating and mutating admission webhook. Mutating admission webhooks are invoked first, and can modify objects sent to the API server to enforce custom defaults. After all object modifications are complete, and after the incoming object is validated by the API server, validating admission webhooks are invoked and can reject requests to enforce custom policies.
By this point K8s has fully vetted our request and permitted it to forth by storing it in etcd. K8s uses etcd as storage backend to store all cluster data.
In our case (deploying 3 nginx replicas - in case you forgot about that 😉).
- A create handler was created
- Resource was decoded/deserialized
- Auditing was done, admission controllers were run
- If everything went well, api-server saves the resource to etcd by delegating to the storage provider.
- Any create errors are caught and, finally, the storage provider performs a get call to ensure the object was actually created.
- The HTTP response is constructed and sent back.
At this point, end-user will get
deployment/nginx created message.
kubectl create deploy nginx --image=nginx --replicas=2 deployment.apps/nginx created
But, it just means that a deployment object is persisted to the storage. There are no nginx pods running yet.
K8s makes use of the Control Loops or the Controllers to bring the current state closer to the desired state. A controller tracks at least one K8s resource type, these objects have
spec field that represents the desired state. Controllers are run in parallel by the
In our case,
replicaset controllers will be used.
After a Deployment record is stored to etcd and initialized, it is made visible via kube-apiserver. When this new resource is available, it is detected by the Deployment controller, whose job it is to listen out for changes to Deployment records. In our case, the controller registers a specific callback for create events via an informer (An informer is a pattern that allows controllers to subscribe to storage events and easily list resources they're interested in. blog post).
This handler will be executed when our Deployment first becomes available and will start by adding the object to an internal work queue. By the time it gets around to processing our object, the controller will inspect our Deployment and realise that there are no
ReplicaSet records associated with it. It does this by querying kube-apiserver with label selectors.
After realising none exist, it will begin a scaling process to start resolving state. It does this by rolling out (i.e. creating) a ReplicaSet resource(note: all the steps starting with auth, authz to persisting object in etcd would happen for ReplicaSet as well), assigning it a label selector, and giving it the revision number of 1. The ReplicaSet's
PodSpec is copied from the Deployment's manifest, as well as other relevant metadata.
The status is then updated and it then re-enters the same reconciliation loop waiting for the deployment to match a desired, completed state. Since the Deployment controller is only concerned about creating ReplicaSets, this reconciliation stage needs to be continued by the next controller, the ReplicaSet controller.
Now we have a deployment object, a replicaset object but still no pods. This is where ReplicaSet controller comes in. It monitors the lifecycle of ReplicaSets and their dependent resources (Pods). Like most other controllers, it does this by triggering handlers on certain events(event we're interested in is creation).
When a ReplicaSet is created (courtesy of the Deployments controller) the RS controller inspects the state of the new ReplicaSet and realizes there is a skew between what exists and what is required. It then seeks to reconcile this state by bumping the number of pods that belong to the ReplicaSet.
Kubernetes enforces object hierarchies through Owner References. Not only does this ensure that child resources are garbage-collected once a resource managed by the controller is deleted (cascading deletion), it also provides an effective way for parent resources to not fight over their children. So, the pods will have owner ref to RS and RS would have ref to Deployment.
Now, we have a Deployment, a RS and three pods objects in etcd. Our pods, however, are stuck in
Pending state because they haven't been scheduled to a node.
Scheduler controller to rescue!
The scheduler runs as a standalone component and operates in the same way as other controllers. It filters pods that have an empty NodeName field in their PodSpec and attempts to find a suitable Node that the pod can reside on. The way the default scheduling algorithm works is the following:
- filtering - finds the set of Nodes where it's feasible to schedule the Pod. For example, the
PodFitsResourcesfilter checks whether a candidate Node has enough available resource to meet a Pod's specific resource requests.
- scoring - the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules. The highest ranked node is then selected for scheduling.
Once the algorithm finds a node, the scheduler sends a bind object with the pod and the node ref to the apiserver. API-server sets the
NodeName to the one in the bind object and sets its
PodScheduled status condition to
Filtering and scoring behavior of the scheduler can be configured using
- Scheduling Policies: allow you to configure Predicates for filtering and Priorities for scoring.
- Scheduling Profiles: allow you to configure Plugins that implement different scheduling stages.
OR you can write your own custom scheduler and use it by specifying
schedulerName field in
Okay, let's see what has been done so far: the HTTP request passed authentication, authorization, and admission control stages; a Deployment, ReplicaSet, and three Pod resources were persisted to etcd; and, finally, each Pod was scheduled to a suitable node. However, the state we are talking about exists purely in etcd. The next steps involve distributing this state across the worker nodes, which is the whole point of a distributed system like Kubernetes! The way this happens is through a component called the
Kubelet runs on every node on the cluster and is responsible for managing the lifecycle of pods. This means it handles all of the translation logic between the abstraction of a "Pod" (which is really just a Kubernetes concept) and its building blocks, containers. It also handles all of the associated logic around mounting volumes, container logging, and many more important things.
Like a controller, kubelet periodically queries the apiserver for pods whose
NodeName matches the name of the node kubelet is running on. Once it has that list, it detects changes by comparing against its own internal cache and begins to synchronise the state. Let's take a look at what that synchronization process looks like when the pod is being created(we are creating nginx pods, remember?):
- Record pod worker start latency since kubelet first saw the pod.
- Call generateAPIPodStatus to prepare an
v1.PodStatusfor the pod, which represents the state of a Pod's current Phase. The Phase of a Pod is a high-level summary of where the Pod is in its lifecycle. Examples include
- Update the status of the pod in the status manager, which is tasked with asynchronously updating the etcd record via the apiserver.
- Next, it is checked whether pod can be run. A series of softAdmitHandlers are checked, these include enforcing AppArmor profiles and [NO_NEW_PRIVS (github.com/kubernetes/kubernetes/blob/v1.23..) (pod-security-policies)[kubernetes.io/docs/concepts/policy/pod-secu... Pods denied at this stage will stay in the Pending state indefinitely.
- Create Cgroups for the pod and apply resource parameters. This is to enable better Quality of Service (QoS) handling for pods.
- Create a mirror pod if the pod is a static pod, and does not already have a mirror pod
- Data directories are created for the pod. These include the pod directory (usually
/var/run/kubelet/pods/<podID>), its volumes directory (
<podDir>/volumes) and its plugins directory (
- Wait for volumes to attach/mount defined in
Spec.Volumes. Depending on the type of volume being mounted, some pods will need to wait longer (e.g. cloud or NFS volumes).
- Fetch the pull secrets for the pod defined in
- Call the container runtime's SyncPod callback, the container runtime then runs the container
Now, our containers (not
pod anymore) are ready to be launched. The software that does this launching is called the Container Runtime (
Since v1.5.0, the kubelet has been using a concept called CRI (Container Runtime Interface) for interacting with concrete container runtimes. The CRI is a plugin interface which enables the kubelet to use a wide variety of container runtimes, without having a need to recompile the cluster components. CRI is the main protocol for the communication between the kubelet and Container Runtime.
The Kubernetes Container Runtime Interface (CRI) defines the main gRPC protocol for the communication between the cluster components kubelet and container runtime.
During, Container Runtime(CR) syncPod call, kubectl makes following gRPCs to the underlying CR following the CRI
- Compute sandbox and container changes.
- Kill pod sandbox if necessary - when sandbox changes or containers in it are dead.
- Create ephemeral containers - These are used to debug actual application containers.
- Create init containers.
- Create normal containers.
In Docker (container runtime) creating a sandbox involves creating a
pause container. The
pause container provides a way to create
cgroups and various other linux namespaces (network, IPC, PID etc) and to share them among all the containers of the pod. That's why we can reach all the containers of a pod at
localhost from one another. Pod networking is setup using a CNI plugin during sandbox creation.
Pod networking and CNI
Our Pod is now created: a pause container which hosts all of the namespaces to allow inter-pod communication and our nginx container. But how does networking work and how is it set up?
Kubelet delegates the task of networking setup to a
CNI plugin. CNI stands for Container Network Interface and operates in a similar way to the Container Runtime Interface. CNI is an abstraction that allows different network providers to use different networking implementations for containers.
During, the pod sandbox creation kubelet calls the registered CNI plugin with the pod specific data (its name, network namespace etc). Once the plugin finishes its job, the pod should be able to reach other pods on the same host.
By this time pods on same node can communicate with each other. But, what if two pods on different machines want to communicate.
There are several ways to accomplish this. For instance, AWS EKS does it using native VPC networking while solutions like
flannel do it using overlay networking.
Note: Inter-host networking isn't configured with every pod creation. It's usually done during cluster creation.
As we talked earlier, once the
PodSandbox is created, kubelet will
- Kill the sandbox if necessary
- Create ephemeral containers - These are used to debug actual application containers.
- Create init containers.
- Create normal containers.
normal containers are created in the same way
- Pull container image
- Create Container via CRI.
podSandboxConfigare passed to the CR.
- Container is then registered with cpu and memory manager. This allows containers in guaranteed QoS pods with integer CPU requests access to exclusive CPUs on the node.
- The container is then started
- If there are any post-start container lifecycle hooks, they are run
Is this it? I guess so.
After all this, we should have three nginx pods running in our cluster.
kubectl create deploy nginx --image=nginx --replicas=3 deployment.apps/nginx created kubectl get deploy nginx --show-labels NAME READY UP-TO-DATE AVAILABLE AGE LABELS nginx 3/3 3 3 80s app=nginx kubectl get pods -lapp=nginx NAME READY STATUS RESTARTS AGE nginx-6799fc88d8-4qcft 1/1 Running 0 89s nginx-6799fc88d8-dlngw 1/1 Running 0 89s nginx-6799fc88d8-gntj5 1/1 Running 0 89s