Mastering the Art of Troubleshooting Stuck TMC Self-Managed Deployments in VCD

Troubleshooting a Stuck TMC Self-Managed Deployment in VCD

As a follow-up to my previous blog post on the VCD Extension for Tanzu Mission Control, I would like to share some troubleshooting tips for when your self-managed deployment gets stuck during configuration. Specifically, I will discuss how to resolve issues that arise when passing an incorrect value for the DNS zone, leading to a stuck deployment that does not terminate automatically.

Background

———-

In my previous post, I covered the end-to-end deployment steps for TMC self-managed in VCD. During configuration, I made a mistake by passing an incorrect value for the DNS zone, which led to a stuck deployment that did not terminate automatically. After waiting for a couple of hours, I realized that the task was still running and preventing me from installing it with the correct configuration.

Symptoms

———-

On checking the pods in the tmc-local namespace, I found that many of them were stuck in either ‘CreateContainerConfigError’ or ‘CrashLoopBackOff’ states. Additionally, when I checked the failed task ‘Execute global ‘post-create’ action,’ I noticed that the installer was complaining about the tmc package installation reconciliation failure.

Causes and Resolution

———————–

After discussing the problem with the Engineering team, we determined that this is a known issue with the solution addon agent in VCD. The subtask timed out after 2 hours, but the task status was not updated because VCD killed its agent and uses a fixed time of 2 hours for addon agents. To resolve the issue, we need to set a smaller timeout value when creating the TMC instance, for example, 5400s.

Steps to Resolve

——————-

To troubleshoot and resolve this issue, follow these steps:

1. (Optional) Export Environment Variables: Before running the below commands in a production environment, consult the GSS team. You can export the environment variables to check if there are any issues with the DNS zone or other configuration settings.

2. Generate VCD Auth Token: To generate a VCD auth token, run the following command:

“`

vcd authentication generate

“`

3. Retrieve the TMC-SM RDE: To retrieve the TMC self-managed (SM) resource definition exchange (RDE), run the following command:

“`

vcd tmc-sm rde retrieve

“`

4. Mark the TMC-SM Instance as Failed: To mark the TMC-SM instance as failed, run the following command:

“`

vcd tmc-sm instance fail –name

“`

5. Forcefully Fail the TMC-Sm Instance: After forcefully failing the TMC-SM instance, the deletion went fine, and the instance was cleaned.

Conclusion

———-

In this blog post, I discussed how to troubleshoot a stuck TMC self-managed deployment in VCD when configuration fails due to an incorrect DNS zone value. By understanding the known issue with the solution addon agent in VCD, setting a smaller timeout value, and following the steps outlined above, you can resolve this issue and successfully deploy TMC self-managed in VCD.

Stay tuned for my next post, where I will discuss another troubleshooting scenario that I encountered in my lab. Feel free to share this on social media if it is worth sharing. Don’t forget to subscribe to this blog by providing your email address below to receive notifications of new posts by email.