當(dāng)前位置：系統(tǒng)之家 > 系統(tǒng)教程 > Linux系統(tǒng)內(nèi)核崩潰排查

Linux系統(tǒng)內(nèi)核崩潰如何排查？(2)

時(shí)間：2015-03-06 17:08:48 作者：qipeng 來源：系統(tǒng)之家 1. 掃描二維碼隨時(shí)看資訊 2. 請(qǐng)使用手機(jī)瀏覽器訪問： https://m.xitongzhijia.net/xtjc/20150306/40328.html 手機(jī)查看評(píng)論反饋

　　后面又繼續(xù)分析內(nèi)核中出現(xiàn)的另一個(gè)錯(cuò)誤，“BUG： soft lockup – CPU#N stuck for 4278190091s！［qmgr/master：進(jìn)程號(hào)］”，對(duì)上面的錯(cuò)誤信息我做了一點(diǎn)點(diǎn)處理，CPU#后面的N是對(duì)應(yīng)的一個(gè)具體的cpu編號(hào)，這個(gè)在每一臺(tái)服務(wù)器是不一樣的，還有就是最后中括號(hào)中的進(jìn)程和進(jìn)程號(hào)碼不同，不過就是qmgr和master。如下統(tǒng)計(jì)：

　　IP 107 108 109 110 111 112 113 114

　　選項(xiàng)

　　日志

　　時(shí)間13:01:2014:03:3414:05:4414:22:4414:19:5814:17:1214:22:49

　　14:19:58錯(cuò)誤日志類型和進(jìn)程1qmgr1master

　　2qmgr1qmgr

　　2master1 qmgr

　　1qmgr

　　2master1qmgr

　　2master

　　錯(cuò)誤類型1就是上面提到的不會(huì)一起內(nèi)核掛起的錯(cuò)誤，2就是現(xiàn)在分析的這個(gè)錯(cuò)誤，會(huì)導(dǎo)致linux內(nèi)核panic�？梢钥闯鲋挥�107和110當(dāng)時(shí)是沒有掛起的。

　　接著上面的內(nèi)核出錯(cuò)日志分析，發(fā)現(xiàn)一個(gè)很大的相同點(diǎn)，就是4278190091s這個(gè)值。首先解釋一下這個(gè)值代表的意義，通常情況下如果一個(gè)cpu超過10s沒有喂狗（執(zhí)行watchdog程序）就會(huì)拋出soft lockup（軟死鎖）錯(cuò)誤并且掛起內(nèi)核。但是這個(gè)值盡然是4278190091s，并都是一樣的。完全可以理解為是一個(gè)固定的錯(cuò)誤，為了驗(yàn)證自己的想法，我就在RedHat官方網(wǎng)站搜索這個(gè)錯(cuò)誤信息，讓我非常激動(dòng)的是，盡然找到了相同的bug（url：https://access.redhat.com/knowledge/solutions/68466），然后查看錯(cuò)誤的redhat版本和內(nèi)核版本，都和我們的一樣（redhat6.2和CentOS6.2對(duì)應(yīng)）。錯(cuò)如信息和解決方案如下：

　　Does Red Hat Enterprise Linux 6 or 5 have a reboot problem which is caused by sched_clock（） overflow around 208.5 days？

　�。║pdated 21 Feb 2013， 5:11 AM GMT RateSelect ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5Cancel ratingCancel ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5. Average： 5 （1 vote）。 Show Follow

　　Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

　　rhel5 Content in rhel5 by Marc Milgram Content in rhel6 Content in rhel6 by Marc Milgram Content in kernel Content in kernel by

　　Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

　　Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

　　Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

　　0 Issue（問題）

　　•Linux Kernel panics when sched_clock（） overflows after an uptime of around 208.5 days.

　　•Red Hat Enterprise Linux 6.1 system reboots with sched_clock（） overflow after an uptime of around 208.5 days

　　•This symptom may happen on the systems using the CPU which has TSC.

　　•A process showing BUG： soft lockup - CPU#N stuck for 4278190091s！

　　Environment（環(huán)境）

　　•Red Hat Enterprise Linux 6

　　◦Red Hat Enterprise Linux 6.0， 6.1 and 6.2 are affected

　　◦several kernels affected， see below

　　◦TSC clock source - **see root cause

　　•Red Hat Enterprise Linux 5

　　◦Red Hat Enterprise Linux 5.3， 5.6， 5.8： please refer to the resolution section for affected kernels

　　◦Red Hat Enterprise Linux 5.0， 5，1， 5.2， 5.4， 5.5 ，5.7： all kernels affected

　　◦Red Hat Enterprise Linux 5.9 and later are not affected

　　◦TSC clock source - **see root cause

　　•An approximate uptime of around 208.5 days.

　　Resolution（解決方案）

　　•Red Hat Enterprise Linux 6

　　◦Red Hat Enterprise Linux 6.x： update to kernel-2.6.32-279.el6 （from RHSA-2012-0862） or later. This kernel is already part of RHEL6.3GA. This fix was implemented with （private） bz765720.

　　◦Red Hat Enterprise Linux 6.2： update to kernel-2.6.32-220.4.2.el6 （from RHBA-2012-0124） or later. This fix was implemented with （private） bz781974.

　　◦Red Hat Enterprise Linux 6.1 Extended Update Support： update to kernel-2.6.32-131.26.1.el6 （from RHBA-2012-0424） or later. This fix was implemented with （private） bz795817.

　　•Red Hat Enterprise Linux 5

　　◦architecture x86_64/64bit

　　■Red Hat Enterprise Linux 5.x： upgrade to kernel-2.6.18-348.el5 （from RHBA-2013-0006） or later. RHEL5.9GA and later already contain this fix.

　　■Red Hat Enterprise Linux 5.8.z： upgrade to kernel-2.6.18-308.11.1.el5 （from RHSA-2012-1061） or later.

　　■Red Hat Enterprise Linux 5.6.z： upgrade to kernel-2.6.18-238.40.1.el5 （from RHSA-2012-1087） or later.

　　■Red Hat Enterprise Linux 5.3.z： upgrade to kernel-2.6.18-128.39.1.el5 （from RHBA-2012-1093） or later.

　　◦architecture x86/32bit

　　■Red Hat Enterprise Linux 5.x： upgrade to kernel-2.6.18-348.el5 （from RHBA-2013-0006） or later. RHEL5.9GA and later already contain this fix.

　　■Red Hat Enterprise Linux 5.8.z： upgrade to kernel-2.6.18-308.13.1.el5 （from RHSA-2012-1174） or later.

　　■Red Hat Enterprise Linux 5.6.z： upgrade to kernel-2.6.18-238.40.1.el5 （from RHSA-2012-1087） or later.

　　■Red Hat Enterprise Linux 5.3.z： upgrade to kernel-2.6.18-128.39.1.el5 （from RHBA-2012-1093） or later.

　　Root Cause（根本原因）

　　•An insufficiently designed calculation in the CPU accelerator in the previous kernel caused an arithmetic overflow in the sched_clock（） function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter （TSC） clock source.

　　•This problem will occur only when system uptime becomes 208.5 days or exceeds 208.5 days.

　　•This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under these circumstances.

　　•On Red Hat Enterprise 5， this problem is a timing issue and very very rare to happen.

　　•**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast access clock whereas the HPET and PMTimer are both slow access clocks. Using notsc would be a significant performance hit.

　　Diagnostic Steps

　　Note：

　　This issue could likely happen in numerous locals that deal with time

　　in the kernel. For example， a user running a non-Red Hat kernel had the

　　kernel panic with a soft lockup in __ticket_spin_lock.

　　通過上面的信心我們完全可以確認(rèn)這個(gè)是linux內(nèi)核的一個(gè)bug，這個(gè)bug的原因上面也相信描述了，就是對(duì)于x86_64體系結(jié)構(gòu)的內(nèi)核版本，如果啟動(dòng)時(shí)間超過208.5天就會(huì)導(dǎo)致溢出。

　　雖然得到了上面的信息證實(shí)了內(nèi)核panic的原因，不過自己想了解一下淘寶的內(nèi)核工程師是否也應(yīng)該遇到過同樣的問題，所以就在qq上找以前聊過的淘寶內(nèi)核工程師確認(rèn)這個(gè)問題。結(jié)果證明：他們也遇到過同樣的錯(cuò)誤，并且也不能重現(xiàn)，解決方案還是升級(jí)內(nèi)核版本。

　　4.總結(jié)

　　上面就是Linux內(nèi)核崩潰的排查方法介紹了，通過本文的介紹能夠了解到Linux內(nèi)核的排查是比較困難的，需要一定的耐心和技術(shù)。

標(biāo)簽內(nèi)核

發(fā)表評(píng)論

共0條

驗(yàn)證碼

沒有更多評(píng)論了

評(píng)論就這些咯，讓大家也知道你的獨(dú)特見解

立即評(píng)論

以上留言僅代表用戶個(gè)人觀點(diǎn)，不代表系統(tǒng)之家立場

Linux系統(tǒng)內(nèi)核崩潰如何排查？(2)

相關(guān)教程

發(fā)表評(píng)論

其他版本軟件

熱門教程

人氣教程排行

Linux系統(tǒng)推薦

猜你想搜

Linux系統(tǒng)內(nèi)核崩潰如何排查？(2)

相關(guān)教程

發(fā)表評(píng)論

其他版本軟件

熱門教程

人氣教程排行

Linux系統(tǒng)推薦

猜你想搜

Linux系統(tǒng)內(nèi)核崩潰如何排查？(2)