When SIGKILL Isn't Enough: Debugging an Unkillable Docker Container

May 2026

I self-host a Matrix server where I chat with some of my friends. Recently, I had an incident where my client suddenly lost connection to the server, and while I thought it was a small issue (possibly with the network on my phone) that would resolve itself, it seemed to persist.


At that point, I had to log into my server and try to pin down the cause of this outage. My Matrix application runs inside a Docker container, and so the natural thing was to have a look at the container logs. They did not show sign of anything wrong; everything was fine, and that was odd since it was clear that I could not send any messages.


Any attempts to restart/kill the services were also futile since I kept on getting the error below

Error response from daemon: cannot kill container: tried to kill container, but did not receive an exit event  

This was starting to get interesting! Whatever had messed up my container process must have hijacked its ability to respond to kill signals as well. Not even a sudo kill -9 <pid> was working at this point. Inspecting the container also showed that it was up but unhealthy.


Further inspection of the container process revealed something interesting. As shown in the code snippet below, my process had entered, and remained in D-state

$ cat /proc/2738535/status  
Name:    python  
Umask:    0022  
State:    D (disk sleep) # the smoking gun!!
...

Without going into much detail, a process enters D-state when it is waiting on I/O. This is completely normal but if a process stays in this state for a long period of time, then it is more often a sign of something wrong. Common causes of this include slow/failing hardware, issues with network file systems and mounts, kernel bugs (which was most likely the case here1 ) etc.


While in this state, the process becomes unresponsive and hence cannot be woken up from userspace. Because of that, it won’t even respond to even kill signals, which explains why our container process could not be killed


Fix

The easiest fix at this point was to just reboot the machine and that is exactly what I did.


More reading

Chris Down has a very cool blog post going over D-state processes. Please check it out!



  1. I have intentionally tried to keep this post as short as possible, but looking at the kernel logs I pinned down the issue to a call to __split_huge_pmd that was triggered by the Python process calling madvise() ↩︎