Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to get pod logs from AKS cluster #97

Closed
nitinkhandelwal26 opened this issue Oct 28, 2020 · 16 comments
Closed

Unable to get pod logs from AKS cluster #97

nitinkhandelwal26 opened this issue Oct 28, 2020 · 16 comments
Labels
question Further information is requested

Comments

@nitinkhandelwal26
Copy link

nitinkhandelwal26 commented Oct 28, 2020

Hello Team,

Try to get the pod log using :
kubectl logs csi-secrets-store-9wr95 -n cluster-baseline-settings -c secrets-store
and getting below output:
Error from server: Get https://aks-npuser01-42213062-vmss00000c:10250/containerLogs/cluster-baseline-settings/csi-secrets-store-9wr95/s
ecrets-store: dial tcp 10.10.128.197:10250: i/o timeout

Where in portal its showing like
CSI_POD

get-pods-clusterbaseline

Don't know why I am unable to fetch logs from any pods, Can you please help me here..

@nitinkhandelwal26
Copy link
Author

nitinkhandelwal26 commented Oct 28, 2020

Matrix server pod:
matrix-server-pods

Traefik deployment:

deploydescribetraffic

Traefik Pods in portal:
TraefikPods

Traefik Kubectl Get Logs Error:
TraefikLogError

User Permissions:

AKS permissions

@nitinkhandelwal26
Copy link
Author

This issue seems to be related :
Azure/AKS#1544

@nitinkhandelwal26
Copy link
Author

We have created Subnet Level NSG and when we removed NSG from Subnet level its working,
Resolved-Thanks

Can you please guide which ports/rules should be applied to subnet level NSG to make it work with NSG...

@neilpeterson
Copy link
Contributor

@nitinkhandelwal26 I see that you have closed this issue. Did you find the information that you need?

@nitinkhandelwal26
Copy link
Author

Thanks @neilpeterson for support, no still I didn't got that information, but my issue got resolved by removing subnet level NSG, i still need to find out that.. if you have any information on that then it will be really helpful...
@ckittel working on one PR related to dependancies on public images, might be that will fix this issue... as we would not need to reach public repos... but still don't know which are those ports and IPs which we need to allow in subnet level NSG to fetch logs from pods in aks cluster..

@ckittel
Copy link
Member

ckittel commented Oct 29, 2020

We don't document any subnet level NSG specific port requirements that I'm aware of outside of our general egress guidance. Obviously AKS applies NSG rules to the NICs in your cluster, but if you're applying at the subnet level as you said, your responsibility is to ensure they don't interfere with normal healthy traffic.

If you find a ruleset that works for you, do please share. I'm guessing you'll see your ruleset look similar to the documented ruleset in the link above + any additional configuration added to the cluster that demands even more openings.

Out of curiosity, what specific problem are you looking to solve with added subnet-level NSG rules that the NIC-level NSG rules + egress via FW doesn't already solve for?

@ckittel ckittel added the question Further information is requested label Oct 29, 2020
@nitinkhandelwal26
Copy link
Author

@ckittel Our network team providing the HUB - Spoke has implemented this.

@nitinkhandelwal26
Copy link
Author

Will share surely the ruleset once fixing that.. currently removed NSGs to make it work. thank you @ckittel for your support..

@Cogax
Copy link

Cogax commented Sep 30, 2021

I do also have this issue. Kubelet (Port 10250) is not reachable from kube-apiserver. I can get pods but I can't access logs (timeout error as mentioned above). I added an Inbound Rule for Ports mentioned here https://docs.microsoft.com/en-us/azure/aks/limit-egress-traffic and here https://kubernetes.io/docs/reference/ports-and-protocols/ in the spoke-nodepools NSG. It's still not working. What's wrong with my approach?

@ckittel
Copy link
Member

ckittel commented Sep 30, 2021

@Cogax thanks for reporting.

We have a faq entry on this error, but it's not very detailed.

We've seen some inconsistency in network rules depending on if your cluster is running konnectivity or not. It seems to depend on what region to you deploy into (as far as we can tell so far).

I'm curious when you run kubectl get deployments -n kube-system does konnectivity-agent show up for you, or do you see tunnelfront instead?

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
azure-policy           1/1     1            1           46h
azure-policy-webhook   1/1     1            1           46h
coredns                2/2     2            2           46h
coredns-autoscaler     1/1     1            1           46h
konnectivity-agent     1/1     1            1           46h
metrics-server         1/1     1            1           46h
omsagent-rs            1/1     1            1           46h

I just deployed this cluster two days ago and I can freely get logs (see Examples below) across various pods across the two node pools without that error. But I know we've run into the error you've seen before as well, so any help in triage will be appreciated. Let's start with konnectivity vs tunnelfront and see if that helps narrow this down.

Also, what region did you deploy into?

cc: @abossard

Examples:

❯ kubectl logs aspnetapp-deployment-56b77c4f79-tptcj -n a0008
{"EventId":50,"LogLevel":"Warning","Category":"Microsoft.AspNetCore.DataProtection.Repositories.EphemeralXmlRepository","Message":"Using an in-memory repository. Keys will not be persisted to storage.","State":{"Message":"Using an in-memory repository. Keys will not be persisted to storage.","{OriginalFormat}":"Using an in-memory repository. Keys will not be persisted to storage."}}

❯ kubectl logs aad-pod-identity-nmi-xsbc2 -n cluster-baseline-settings
I0928 14:04:33.062515       1 main.go:111] running NMI in namespaced mode: false
I0928 14:04:33.062526       1 nmi.go:53] initializing in standard mode
I0928 14:04:33.062533       1 probes.go:41] initialized health probe on port 8085
I0928 14:04:33.062539       1 probes.go:44] started health probe

❯ kubectl logs azure-policy-webhook-59b69cfc84-pq42w -n kube-system
{"level":"info","ts":"2021-09-28T13:56:30.436919146Z","msg":"fetching secret","log-id":"0008c6fa-b-1","method":"github.com/Azure/azure-policy-kubernetes/pkg/webhook.(*k8sClient).getSecret"}
{"level":"info","ts":"2021-09-28T13:56:30.455736905Z","msg":"fetching validating webhook configuration","log-id":"0008c6fa-b-2","method":"github.com/Azure/azure-policy-kubernetes/pkg/webhook.(*k8sClient).getValidationWebhookConfig"}

@ckittel ckittel reopened this Sep 30, 2021
@Cogax
Copy link

Cogax commented Sep 30, 2021

@ckittel

Thanks for your quick response. I found your FAQ article while researching this issue but it didn't help me. I opened all Inbound Traffic (Any Source, Destination, Port, etc) on alls NSG's but it had no effect.

I removed the whole setup so I can't give you exact answers. I will recreate it later an check if the issue still exists. Some information I have at the moment:

  • region was westeurope for everything
  • azure subscription is a test subscription with limited quotas
  • because of quota limitations, I used only 1 node (min=1, max=1) for system nodepool as well as user nodepool:
"agentPoolProfiles": [
{
  "name": "npsystem",
  "count": 1,
  "vmSize": "Standard_DS2_v2",
  "osDiskSizeGB": 80,
  "osDiskType": "Ephemeral",
  "osType": "Linux",
  "minCount": 1,
  "maxCount": 1,
  "vnetSubnetID": "[variables('vnetNodePoolSubnetResourceId')]",
  "enableAutoScaling": true,
  "type": "VirtualMachineScaleSets",
  "mode": "System",
  "scaleSetPriority": "Regular",
  "scaleSetEvictionPolicy": "Delete",
  "orchestratorVersion": "[parameters('kubernetesVersion')]",
  "enableNodePublicIP": false,
  "maxPods": 30,
  "availabilityZones": ["1", "2", "3"],
  "upgradeSettings": {
    "maxSurge": "33%"
  },
  "nodeTaints": ["CriticalAddonsOnly=true:NoSchedule"]
},
{
  "name": "npuser01",
  "count": 1,
  "vmSize": "Standard_DS3_v2",
  "osDiskSizeGB": 120,
  "osDiskType": "Ephemeral",
  "osType": "Linux",
  "minCount": 1,
  "maxCount": 1,
  "vnetSubnetID": "[variables('vnetNodePoolSubnetResourceId')]",
  "enableAutoScaling": true,
  "type": "VirtualMachineScaleSets",
  "mode": "User",
  "scaleSetPriority": "Regular",
  "scaleSetEvictionPolicy": "Delete",
  "orchestratorVersion": "[parameters('kubernetesVersion')]",
  "enableNodePublicIP": false,
  "maxPods": 30,
  "availabilityZones": ["1", "2", "3"],
  "upgradeSettings": {
    "maxSurge": "33%"
  }
}
  • I did not applied flux. Insead, I executed those YAML files manually:
kubectl apply -f kubernetes/new-aks/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/cluster-baseline-settings/kured.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/cluster-baseline-settings/aad-pod-identity.yaml
kubectl apply -f kubernetes/new-aks/cluster-manifests/a0008/ingress-network-policy.yaml
  • In my console I had a list of all Pods. The issue appeared when I wanted to deploy traeffik components. They were always on ContainerCreating and Pending. This was the reason why I wanted to check the logs of the aad-pod-identity-mni pod. This is the list:
NAMESPACE                   NAME                                                   READY   STATUS              RESTARTS   AGE
a0008                       traefik-ingress-controller-54ff76688d-c4n2t            0/1     ContainerCreating   0          13h
a0008                       traefik-ingress-controller-54ff76688d-nm5k9            0/1     Pending             0          15h
cluster-baseline-settings   aad-pod-identity-mic-59545c8bc7-75d66                  1/1     Running             0          13h
cluster-baseline-settings   aad-pod-identity-mic-59545c8bc7-7mbx2                  1/1     Running             0          13h
cluster-baseline-settings   aad-pod-identity-nmi-mzvz4                             1/1     Running             0          13h
cluster-baseline-settings   kured-swvrh                                            1/1     Running             0          13h
cluster-baseline-settings   kured-xghl8                                            1/1     Running             0          13h
default                     node-debugger-aks-npuser01-58613137-vmss000001-q8fq9   1/1     Running             0          29m
gatekeeper-system           gatekeeper-audit-6856c7d886-clp5d                      1/1     Running             0          13h
gatekeeper-system           gatekeeper-controller-7bff99d7dc-2dn28                 1/1     Running             0          13h
gatekeeper-system           gatekeeper-controller-7bff99d7dc-hr4xd                 1/1     Running             0          13h
kube-system                 aks-link-79f56b9565-5n2v8                              1/1     Running             0          149m
kube-system                 aks-link-79f56b9565-dsgh6                              1/1     Running             0          149m
kube-system                 aks-secrets-store-csi-driver-9lmpv                     3/3     Running             2          13h
kube-system                 aks-secrets-store-csi-driver-vwzp9                     3/3     Running             2          13h
kube-system                 aks-secrets-store-provider-azure-vswz4                 1/1     Running             0          13h
kube-system                 aks-secrets-store-provider-azure-zq49c                 1/1     Running             0          13h
kube-system                 azure-cni-networkmonitor-6gnqm                         1/1     Running             0          13h
kube-system                 azure-cni-networkmonitor-z9gw6                         1/1     Running             0          13h
kube-system                 azure-ip-masq-agent-c2rtt                              1/1     Running             0          13h
kube-system                 azure-ip-masq-agent-lh9vp                              1/1     Running             0          13h
kube-system                 azure-npm-4qrs2                                        1/1     Running             0          13h
kube-system                 azure-npm-btqv7                                        1/1     Running             0          13h
kube-system                 azure-policy-6f77469b44-6pn2w                          1/1     Running             0          13h
kube-system                 azure-policy-webhook-59b69cfc84-2gl7l                  1/1     Running             0          13h
kube-system                 coredns-86846667d7-lqbsv                               1/1     Running             0          13h
kube-system                 coredns-86846667d7-r6wtz                               1/1     Running             0          13h
kube-system                 coredns-autoscaler-5f85dc856b-xzfkt                    1/1     Running             0          13h
kube-system                 csi-azuredisk-node-l4rqh                               3/3     Running             0          13h
kube-system                 csi-azuredisk-node-zd676                               3/3     Running             0          13h
kube-system                 csi-azurefile-node-h5shc                               3/3     Running             0          13h
kube-system                 csi-azurefile-node-wxh45                               3/3     Running             0          13h
kube-system                 kube-proxy-cv2wr                                       1/1     Running             0          13h
kube-system                 kube-proxy-qgpbn                                       1/1     Running             0          13h
kube-system                 metrics-server-6bc97b47f7-g5hwr                        0/1     Running             249        13h
kube-system                 omsagent-4sp2t                                         1/1     Running             0          13h
kube-system                 omsagent-f8sxs                                         1/1     Running             0          13h
kube-system                 omsagent-rs-7c5979787c-849m4                           1/1     Running             0          13h
  • I don't know if there was an konnectivity-agent running

Hope that helps. I will recreate the whole setup I did and update this issue later.

@ckittel
Copy link
Member

ckittel commented Sep 30, 2021

I think we're on to something. westeurope has been the problem every time so far. You've got aks-link in the output above, which means your cluster is indeed not running konnectivity.

We made a change to this repo to migrate to konnectivity's network rulesets (basically all over 443) instead what tunnelfront/aks-link required (see #199 - specifically the removal of the rules in hub-regionA.json). That's probably what's causing this, if I had to guess. Usually that manifests other errors as well, and not just log fetching though, which is an interesting wrinkle in this.

See related conversation happening @ #223 where @brk3 had a similar observation (also in westeurpoe and found a workaround by adding back the rules we removed when AKS moved to konnectivity (comment: #223 (comment))

@ccyflai
Copy link

ccyflai commented Nov 8, 2021

Encountered the problem to get pod logs until I allow the node IPs to access port 9000 of API server in hub firewall network rule. This is documented in below. I would suggest amending the ARM template hub-regionA.json for that.

https://docs.microsoft.com/en-us/azure/aks/limit-egress-traffic#azure-global-required-network-rules

@ckittel
Copy link
Member

ckittel commented Nov 8, 2021

@ccyflai -- can I ask what region you were deploying to? Just want to see if the pattern continues to emerge here.

Glad you added that extra firewall rule to proceed. Don't forget to remove it once konnectivity is used within your cluster, as it won't be necessary anymore.

@ccyflai
Copy link

ccyflai commented Nov 8, 2021

I deployed in southeastasia.

@ckittel
Copy link
Member

ckittel commented Nov 30, 2021

It looks like konnectivity is rolling out more broadly now. Since the egress affordances for aks-link have been replaced with the simplified egress rules found in this reference implementation for konnectivity, I'm going to close this issue. But if your region doesn't use konnectivity, then the conversation above will help. It's just a matter of timing between the two, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants