This script has been created for cases where multiple Nodes Not Ready alerts received on the AKS cluster and because of the ephemeral storage or internal remediator that trigger the node re-imagined, it is pretty difficult to find the cause for these issues.
This is deployed as a DaemonSet and will deliver two scripts in /tmp directory of every Node.
- The Python script will use the following environment variables for a detailed configuration:
- - NODE_BACKUP : If this variable is set to True, will skip the performance and connectivity testing and will execute only the backup operation of Nodes Logs in the Azure Storage File Share
- - GLOBAL_DELAY : Integer value for time between execution of the testing loop. If not configured explicitly, is defaulted to 10s
- - CPU_MAX : Threshold value for maximum CPU utilization. If reeded value is higher that this configured value, it will trigger the logs exporting
- - MEM_MAX : Same a CPU_MAX, just for Memory utilization
- - RUN_FOR : Integer value for the time this loop will be run. It is configured in minutes. If not explicitly configured, is defaulted to 5 min
For implementation, we need an Azure Storage Account with a share name of logs. Will use the connection string as the environment variable CONN_STR inside the YAML manifest file. If we want to execute only the node extract operations (backup.sh script), we will configure the environment variable NODE_BACKUP as True
Ex. export NODE_BACKUP=True
There are two functions called by default on this operation:
It takes as parameters the host and port, in our example is 127.0.0.1 and kubelet port 10250. If unable to open a socket on this port during a loop, it will trigger the node_backup() and upload() functions for sending the required logs to File Share. To prevent multiple uploads is implemented a Boolean flag (is_uploaded) which will prevent sending of logs if this operation already has been done
By using psutils package will get the CPU value (5 sec interval) and Memory consumption in percentage. If these values are higher than the env values configured or default values implemented in code, will also trigger the upload Node backup process.
Will trigger tar gzip archive cration for /var/log/journal, /tmp/aksreport /var/log/containers
This function will take as parameter the name of archive created in the node_backup function and will use the Azure File Share Python SDK for opening a connection to Storage Account and upload the file