Hi All,
I have an older CentOS 7.4 system that is used for computationally heavy work. It has a 32G root filesystem, of which 33% is consumed. Lately, one particular set of jobs (run through the SGE batch scheduler) seems to cause a peculiar condition to occur in which the root filesystem space is exhausted but I can't find any files that I can use to identify which process is causing the problem. I'm guessing that the problem is being triggered by a certain set of jobs since it only occurs when those jobs are running.
I would like to identify what the process is that is causing this condition to occur and my standard approach is to identify which files are using the otherwise empty space and then identify which process is using the files. It's not working, I haven't been able to identify the files. They aren't there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
I know that space can be consumed by deleted files for which file handles are still open, not visible by ls but detectable but by using techniques like: lsof | grep deleted and for sizing find /proc/*/fd -ls 2>/dev/null | grep '(deleted)' | \ sed 's!.*(/proc[^ ]*).*!\1!' | xargs wc -c | sort -nr Applying this technique during one of these episodes shows nothing of interest.
I'm definitely missing something. Are there any other techniques for identifying other "invisible" files that I can look for?
Thanks in advance for any advice.
centos2@foxengines.net wrote:
It's not working, I haven't been able to identify the files. They
aren't
there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
Sparse files? How are you determining how much free space you have?
On Thu, Oct 08, 2020 at 11:12:54AM -0400, Yves Bellefeuille wrote:
centos2@foxengines.net wrote:
It's not working, I haven't been able to identify the files. They
aren't
there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
Sparse files? How are you determining how much free space you have?
Thanks for your response.
I didn't attempt to find sparse files specifically but there were no files (or dot-files) at the top level of / that contained any significant data.
There sum of the sizes of all of the directories at the top level of / reported by du did not match the amount of disk space used at the time of the problem. I don't have a transcript of that session but I was using commands like:
find / -maxdepth 1 -xdev -type d | while read; do du -shx $d; done
I poked around in /var and /tmp a lot but didn't find anything that would contradict the output of the previous command.
At this point I started searching for deleted files for which the space had not been reclaimed. Finding nothing I though there was something I hadn't run into before and didn't know what to look for.
I'm not confident I understand your meaning in the second sentence. I didn't try to determine how much free space I had because there wasn't any. The root filesystem was at 100% capacity and services were failing. I was just trying to find out what had taken it all since normal usage is around 33% or so, according to df. Rebooting the computer eliminates the problem. When it comes back up, the disk usage is again at 33%. Whatever it is, vanishes during a reboot.
On Thu, Oct 08, 2020 at 11:12:54AM -0400, Yves Bellefeuille wrote:
centos2@foxengines.net wrote:
It's not working, I haven't been able to identify the files. They
aren't
there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
Sparse files? How are you determining how much free space you have?
Thanks for your response.
I didn't attempt to find sparse files specifically but there were no files (or dot-files) at the top level of / that contained any significant data.
There sum of the sizes of all of the directories at the top level of / reported by du did not match the amount of disk space used at the time of the problem. I don't have a transcript of that session but I was using commands like:
find / -maxdepth 1 -xdev -type d | while read; do du -shx $d; done
What does
lsof | grep DEL
show?
Regards, Simon
On Thu, Oct 08, 2020 at 06:38:55PM +0200, Simon Matter wrote:
On Thu, Oct 08, 2020 at 11:12:54AM -0400, Yves Bellefeuille wrote:
centos2@foxengines.net wrote:
It's not working, I haven't been able to identify the files. They
aren't
there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
Sparse files? How are you determining how much free space you have?
Thanks for your response.
I didn't attempt to find sparse files specifically but there were no files (or dot-files) at the top level of / that contained any significant data.
There sum of the sizes of all of the directories at the top level of / reported by du did not match the amount of disk space used at the time of the problem. I don't have a transcript of that session but I was using commands like:
find / -maxdepth 1 -xdev -type d | while read; do du -shx $d; done
What does
lsof | grep DEL
show?
Well, it doesn't show anything now because I ran out of time to troubleshoot and had to reboot the computer which eliminated whatever was using the space. I didn't look for this pattern but I did look for "deleted". I will add this pattern to the list for the next time the problem occurs. I was hoping I would get some responses like this--thanks!
On Thu, Oct 08, 2020 at 11:12:54AM -0400, Yves Bellefeuille wrote:
centos2@foxengines.net wrote:
It's not working, I haven't been able to identify the files. They
aren't
there. All attempts to measure disk usage of / by files shows that the disk usage is only a percentage of available space and that there should be space available.
This could also be helpful:
Regards, Simon
On Thu, Oct 08, 2020 at 12:31:34PM -0400, centos2@foxengines.net wrote:
find / -maxdepth 1 -xdev -type d | while read; do du -shx $d; done
If you want to use du to find sparse files, add --apparent-size.
Does the filesystem have a fixed number of inodes? Perhaps the problem is the number of files, not their sizes.
Does the filesystem have an explicit free list? If so, I'd expect there to be tools that could tell you how much was on it.