[CentOS] SATA tape drive on 4.7

Mon Mar 23 04:39:14 UTC 2009
Ross Walker <rswwalker at gmail.com>

On Mar 22, 2009, at 11:04 PM, C Linus Hicks <linush at verizon.net> wrote:

> I have gotten a hard system lockup several times using star on my  
> system
> with a Quantum DLT-V4 SATA tape drive. I am including what got  
> recorded
> in /var/log/messages from the most recent event showing stack trace.
>
> Output from uname -a:
>
> Linux lh10 2.6.9-78.0.13.ELsmp #1 SMP Wed Jan 14 15:55:36 EST 2009
> x86_64 x86_64 x86_64 GNU/Linux
>
> Mar 22 22:02:59 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'star', page 00000102382b1ec8)
> Mar 22 22:02:59 lh10 kernel: flags:0x0100000c mapping: 
> 00000102146a4240 mapcount:2 count:0
> Mar 22 22:02:59 lh10 kernel: Backtrace:
> Mar 22 22:02:59 lh10 kernel:
> Mar 22 22:02:59 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:02:59 lh10 kernel:         
> <ffffffffa00d17b8>{:st:sgl_unmap_user_pages+67}  
> <ffffffffa00d17df>{:st:release_buffering+27}
> Mar 22 22:02:59 lh10 kernel:        <ffffffffa00d6196>{:st:st_write 
> +2692} <ffffffff80194b39>{inode_update_time+159}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff8017beaa>{vfs_write 
> +207} <ffffffff8017bf92>{sys_write+69}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:02:59 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
> Mar 22 22:02:59 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'star', page 000001023eb70ed8)
> Mar 22 22:02:59 lh10 kernel: flags:0x0100000c mapping: 
> 00000102146a4240 mapcount:2 count:0
> Mar 22 22:02:59 lh10 kernel: Backtrace:
> Mar 22 22:02:59 lh10 kernel:
> Mar 22 22:02:59 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:02:59 lh10 kernel:         
> <ffffffffa00d17b8>{:st:sgl_unmap_user_pages+67}  
> <ffffffffa00d17df>{:st:release_buffering+27}
> Mar 22 22:02:59 lh10 kernel:        <ffffffffa00d6196>{:st:st_write 
> +2692} <ffffffff80194b39>{inode_update_time+159}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff8017beaa>{vfs_write 
> +207} <ffffffff8017bf92>{sys_write+69}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:02:59 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
> Mar 22 22:02:59 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'star', page 000001023d3bc220)
> Mar 22 22:02:59 lh10 kernel: flags:0x0100000c mapping: 
> 00000102146a4240 mapcount:2 count:0
> Mar 22 22:02:59 lh10 kernel: Backtrace:
> Mar 22 22:02:59 lh10 kernel:
> Mar 22 22:02:59 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:02:59 lh10 kernel:         
> <ffffffffa00d17b8>{:st:sgl_unmap_user_pages+67}  
> <ffffffffa00d17df>{:st:release_buffering+27}
> Mar 22 22:02:59 lh10 kernel:        <ffffffffa00d6196>{:st:st_write 
> +2692} <ffffffff80194b39>{inode_update_time+159}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff8017beaa>{vfs_write 
> +207} <ffffffff8017bf92>{sys_write+69}
> Mar 22 22:02:59 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:02:59 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
> Mar 22 22:02:59 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'star', page 0000010238c68e80)
> Mar 22 22:02:59 lh10 kernel: flags:0x0100000c mapping: 
> 00000102146a4240 mapcount:2 count:0
> Mar 22 22:03:00 lh10 kernel: Backtrace:
> Mar 22 22:03:00 lh10 kernel:
> Mar 22 22:03:00 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:03:00 lh10 kernel:         
> <ffffffffa00d17b8>{:st:sgl_unmap_user_pages+67}  
> <ffffffffa00d17df>{:st:release_buffering+27}
> Mar 22 22:03:00 lh10 kernel:        <ffffffffa00d6196>{:st:st_write 
> +2692} <ffffffff80194b39>{inode_update_time+159}
> Mar 22 22:03:00 lh10 kernel:        <ffffffff8017beaa>{vfs_write 
> +207} <ffffffff8017bf92>{sys_write+69}
> Mar 22 22:03:00 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:03:00 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
> Mar 22 22:03:00 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'star', page 000001023c14d570)
> Mar 22 22:03:00 lh10 kernel: flags:0x0100000c mapping: 
> 00000102146a4240 mapcount:2 count:0
> Mar 22 22:03:00 lh10 kernel: Backtrace:
> Mar 22 22:03:00 lh10 kernel:
> Mar 22 22:03:00 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:03:00 lh10 kernel:         
> <ffffffffa00d17b8>{:st:sgl_unmap_user_pages+67}  
> <ffffffffa00d17df>{:st:release_buffering+27}
> Mar 22 22:03:00 lh10 kernel:        <ffffffffa00d6196>{:st:st_write 
> +2692} <ffffffff80194b39>{inode_update_time+159}
> Mar 22 22:03:00 lh10 kernel:        <ffffffff8017beaa>{vfs_write 
> +207} <ffffffff8017bf92>{sys_write+69}
> Mar 22 22:03:00 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:03:00 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
> Mar 22 22:03:00 lh10 kernel: Bad page state at free_hot_cold_page  
> (in process 'X', page 000001023c14d570)
> Mar 22 22:03:00 lh10 kernel: flags:0x01020008 mapping: 
> 00000101ec811360 mapcount:0 count:0
> Mar 22 22:03:00 lh10 kernel: Backtrace:
> Mar 22 22:03:00 lh10 kernel:
> Mar 22 22:03:00 lh10 kernel: Call Trace:<ffffffff8015f0cf>{bad_page 
> +112} <ffffffff8015f7b1>{free_hot_cold_page+130}
> Mar 22 22:03:00 lh10 kernel:        <ffffffff8015f86a>{__pagevec_free 
> +39} <ffffffff80165a5d>{release_pages+366}
> Mar 22 22:03:01 lh10 kernel:        <ffffffff8016d921>{__vma_link 
> +66} <ffffffff8016dd53>{vma_adjust+815}
> Mar 22 22:03:01 lh10 kernel:         
> <ffffffff80165e4b>{__pagevec_mark_accessed+394}  
> <ffffffff80166091>{lru_add_drain+91}
> Mar 22 22:03:01 lh10 kernel:        <ffffffff8016e971>{do_munmap 
> +355} <ffffffff8016f55e>{sys_brk+136}
> Mar 22 22:03:01 lh10 kernel:        <ffffffff801102f6>{system_call 
> +126}
> Mar 22 22:03:01 lh10 kernel: Trying to fix it up, but a reboot is  
> needed
>
> star had been running about 20 to 30 minutes at the time of the  
> crash. I
> have seen two or three crashes previously all denoting "Bad page state
> at free_hot_cold_page."
>
> This appears to have been reported as a bug in 2.6.12:
>
> https://lists.linux-foundation.org/pipermail/bugme-new/2005-August/012862.html
>
> It was updated March 10, 2008 as rejected unreproducible. I have seen
> the problem three times in three days now without trying to reproduce
> it. Can I get any help on getting a fix?

This seems it may be a bad memory issue. When star is running it is  
causing a lot of io to be cached putting pressure on regions of memory  
that might otherwise go unused.

Think about running a memtest first or just swap the memory out for  
other memory completely compatible with your chipset since memory is  
probably cheaper then your time is these days.

You can always test that memory offline and if it's ok redeploy it  
otherwise a memtest on a machine with a large memory could take days.

-Ross