Tuesday, January 18, 2022

Debugging why the deletion of PVC in Kubernetes is running indefinitely

This is related to troubleshooting Kubernetes deployment and requires good working knowledge about Kubernetes.

Issue

It all started with the issue of HELM chart deployment failing that are triggered from DevOps pipelines. The pipelines were using --atomic flag to make sure the failed deployments are getting rolled back. 

The deployments were working till last Friday and no changes to the HELM charts after that date.

Troubleshooting

The observable missing that we found when running kubectl get all is as follows.

0/3 nodes are available: 3 persistentvolumeclaim "my-azurefile-pvc" is being deleted.           Normal   NotTriggerScaleUp  2m42s (x91 over 17m)  cluster-autoscaler  pod didn't trigger scale-up: 1 persistentvolumeclaim "my-azurefile-pvc" is being deleted

This shows there is a PersistentVolumeClaim hanging around that needs to be deleted. Here is the PVC
The container is not starting due to PVC is not ready and its waiting.

The PVC was in the terminating state. Since it's in the terminating state but not terminating, the next step is to delete the my-azure-pvc and redeploy the HELM chart. When we tried to delete it using the below command, the command was not getting completed.

kubectl delete pvc/my-azurefile-pvc -n <namespace> --force

Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.persistentvolumeclaim "my-azurefile-pvc" force deleted

Googling that revealed a possibility that it may have finalizers. We checked it and saw there is a finalizer.


Now we have to remove the finalizer by using the below command. The first command tried with namespace then it showed a wired error message as follows.

kubectl patch pvc my-azurefile-pvc -p '{"metadata":{"finalizers":null}}' -n <namespace>


Error from server (BadRequest): invalid character 'm' looking for beginning of object key string

There was no explanation I got from the internet where is this invalid 'm' character. 

Then we tried without namespace. Then it says the PVC resource is not available in default or I don't have the permission. Really I don't have permission in the default namespace instead I have permission to the namespace allocated to the application's dev instance. The behavior is wired as it's not able to find the resource in the right namespace.

kubectl patch pvc my-azurefile-pvc -p '{"metadata":{"finalizers":null}}'

Then get into research mode again. Then found that if there are any other containers in the cluster using the same PVC, the PVC resource cannot be deleted. We checked all the containers and could see there is another container from different HELM deployment is using it.

kubectl describe pod/<pod id> -n <namespace>

Uninstalled the deployment that created the above pod and the issue was solved.

Root cause

Not sure what caused the PVC to be not deleted in the first place. Maybe someone played in the cluster other than updating HELM charts from the DevOps pipeline.


No comments: