系統(tǒng)之家 - 系統(tǒng)光盤下載網(wǎng)站!

當(dāng)前位置:系統(tǒng)之家 > 系統(tǒng)教程 > Linux系統(tǒng)內(nèi)核崩潰排查

Linux系統(tǒng)內(nèi)核崩潰如何排查?

時(shí)間:2015-03-06 17:08:48 作者:qipeng 來(lái)源:系統(tǒng)之家 1. 掃描二維碼隨時(shí)看資訊 2. 請(qǐng)使用手機(jī)瀏覽器訪問(wèn): https://m.xitongzhijia.net/xtjc/20150306/40328.html 手機(jī)查看 評(píng)論

  Linux內(nèi)核如果奔潰,將導(dǎo)致Linux系統(tǒng)kernel崩潰,自己的電腦還好,如果是公司的電腦將造成不小的損失,下面小編就給大家介紹下Linux系統(tǒng)內(nèi)核崩潰的排查方法,一起來(lái)了解下吧。

 Linux系統(tǒng)內(nèi)核崩潰如何排查?

  1.概述

  某年某月某日某項(xiàng)目的線上分布式文件系統(tǒng)服務(wù)器多臺(tái)linux系統(tǒng)kernel崩潰,嚴(yán)重影響了某項(xiàng)目對(duì)外提供服務(wù)的能力,在公司造成了不小影響。通過(guò)排查線上問(wèn)題基本確定了是由于linux內(nèi)核panic造成的原因,通過(guò)兩個(gè)階段的問(wèn)題排查,基本上確定了linux內(nèi)核panic的原因。排查問(wèn)題的主要手段就是網(wǎng)上查找資料和根據(jù)內(nèi)核錯(cuò)誤日志分析并且構(gòu)造條件重現(xiàn)。本文檔就是對(duì)自己在整個(gè)問(wèn)題排查過(guò)程中的總結(jié)。

  2.第一階段

  因?yàn)閯偝霈F(xiàn)問(wèn)題的時(shí)候大家都比較緊急,每天加班都很晚,也制定了很多問(wèn)題重現(xiàn)和定位原因的計(jì)劃。我第一階段連續(xù)堅(jiān)持了兩周分析問(wèn)題原因,由于第一階段自己所做的功能基本上全部形成了詳細(xì)的分析文檔,所以下面主要總結(jié)一下自己在第一階段都采取了一些什么樣的措施以及到達(dá)了什么效果。

  第一階段自己也分了幾步走,當(dāng)然最先想到的是重現(xiàn)線上的現(xiàn)象,所以我首先查看了線上的內(nèi)核錯(cuò)誤日志,根據(jù)日志顯示主要是qmgr和master兩個(gè)進(jìn)程導(dǎo)致的內(nèi)核panic(至少日志信息是這么提示的)。當(dāng)然還結(jié)合當(dāng)時(shí)服務(wù)器的現(xiàn)象,load比較高,對(duì)外不能提供服務(wù)。所以自己首先想到的就是通過(guò)寫程序模擬不斷發(fā)送郵件(因?yàn)閝mgr和master進(jìn)程都與發(fā)送郵件相關(guān)的進(jìn)程),當(dāng)程序運(yùn)行起來(lái)的時(shí)候,自己小小的激動(dòng)了一下,就是load上去了,服務(wù)器的對(duì)外服務(wù)能力變慢了(ssh登錄),當(dāng)時(shí)的線上接近線上現(xiàn)象,但是后面內(nèi)核一直沒(méi)有panic,哪怕頻率在快,而且內(nèi)核也沒(méi)有什么錯(cuò)誤信息。后面漸漸的就排除了這個(gè)原因。

  因?yàn)槌鲥e(cuò)的服務(wù)器都安裝了分布式文件系統(tǒng),大家就懷疑是由于分布式文件系統(tǒng)導(dǎo)致了內(nèi)核panic,但是通過(guò)觀察業(yè)務(wù)監(jiān)控信息發(fā)現(xiàn)那個(gè)時(shí)段分布式文件系統(tǒng)沒(méi)有什么特殊的信息,而且數(shù)據(jù)流也不是很大。不過(guò)我還是使用幾臺(tái)虛擬機(jī)安裝了分布式文件系統(tǒng),并且寫了一個(gè)java程序并且不斷的通過(guò)分布式文件系統(tǒng)客戶端寫入文件到分布式文件系統(tǒng)集群,同時(shí)也把郵件發(fā)送程序啟動(dòng),盡量模擬線上的環(huán)境,跑了很多次很長(zhǎng)時(shí)間也沒(méi)有出現(xiàn)線上的現(xiàn)象,所以也沒(méi)有什么更好的手段去重現(xiàn)線上的現(xiàn)象了。

  由于重現(xiàn)現(xiàn)象失敗了,所以只有根據(jù)內(nèi)核的錯(cuò)誤信息直接去分析原因了。分析步驟很簡(jiǎn)單,首先找到出錯(cuò)的錯(cuò)誤代碼,然后分析上下文相關(guān)的代碼,分析的詳細(xì)過(guò)程在去年的文檔也體現(xiàn)出來(lái)了。

  根據(jù)代碼的分析和網(wǎng)上類似的bug基本上定位就是計(jì)算cpu調(diào)度的時(shí)間溢出,導(dǎo)致watchdog進(jìn)程拋出panic錯(cuò)誤,內(nèi)核就掛起了。根據(jù)分析定位的原因,我又通過(guò)修改內(nèi)核代碼去構(gòu)造時(shí)間溢出的條件,就是通過(guò)內(nèi)核模塊去修改系統(tǒng)調(diào)用時(shí)間的計(jì)數(shù)值,修改是成功了,可惜內(nèi)核也直接死掉了。所以直接修改內(nèi)核代碼來(lái)重現(xiàn)也失敗了。

  后面也陸續(xù)咨詢了很多公司外面熟悉內(nèi)核的技術(shù)人員,他們根據(jù)我們提供的信息業(yè)給出了自己的分析,但是也沒(méi)有很好的重現(xiàn)方法和確切的定位錯(cuò)誤原因,而且不同的人給出的結(jié)論差異也比較大。

  所以第一個(gè)階段連續(xù)堅(jiān)持跟蹤這個(gè)問(wèn)題2-3周的時(shí)間也沒(méi)有一個(gè)確切的結(jié)果。

  3.第二階段

  新的一年開(kāi)始了,第一天又開(kāi)始準(zhǔn)備跟蹤這個(gè)問(wèn)題了。一開(kāi)始也制定了簡(jiǎn)單的計(jì)劃,我對(duì)自己的計(jì)劃就是每天5-8點(diǎn)分析定位內(nèi)核問(wèn)題,當(dāng)然也順便學(xué)習(xí)內(nèi)核相關(guān)知識(shí)。

  這一次一開(kāi)始自己便換了一個(gè)角度去思考問(wèn)題,去年是基于單臺(tái)服務(wù)器去分析內(nèi)核日志錯(cuò)誤信息,已經(jīng)沒(méi)有什么好的方式了。所以準(zhǔn)備同時(shí)分析所有出錯(cuò)服務(wù)器的日志(幸好以前找運(yùn)維要了所有出錯(cuò)服務(wù)器的內(nèi)核日志并且保存下來(lái)了,不然怎么死的都不知道),找出他們的共同點(diǎn)。首先他們的共同點(diǎn)就是出現(xiàn)了trace子系統(tǒng)打印的警告信息“Delta way too big!…。。”的信息,但是根據(jù)相關(guān)信息,這個(gè)是不會(huì)導(dǎo)致linux系統(tǒng)掛起的。而且確實(shí)我們線上的服務(wù)器并不是全部都ssh不上去,不過(guò)還是在RedHat官方網(wǎng)站找到類似的bug(url:

  https: //access.redhat.com/knowledge/solutions/70051),并且給出了解決方案。bug信息和解決方案如下:

  why kernel is throwing “Delta way too big” out with

  WARNING: at kernel trace ring_buffer,c:1988 rb_reserve_next_event+0x2ce/0×370 messages

  0 Issue kernel is throwing ”Delta way too big” out with kernel oops on server

  Environment(環(huán)境)

  •Red Hat Enterprise Linux 6 service pack 1

  Resolution(解決方案)

  The warning ”Delta way too big” warning might appear on a system with unstable shed clock right after the system is resumed and tracingwas enabled during the suspend.

  Since it’s not realy bug, and the unstable sched clock is working fast and reliable otherwise, We suggested to keep using the sched clock in any case and just to make note in the warning itself.or disables tracing by #echo 0 》 /sys/kernel/debug/tracing/tracing_on

  Root Cause(根本原因) this case was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this case), they are to do with problems calculating a time stamp for a ring buffer entry.

  Message comes from here and appears to be indicating problems with time stability.

  1966 static int

  1967 rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,

  1968 u64 *ts, u64 *delta)

  1969 {

  1970 struct ring_buffer_event *event;

  1971 static int once;

  1972 int ret;

  1973

  1974 if (unlikely(*delta 》 (1ULL 《《 59) && !once++)) {

  1975 int local_clock_stable = 1;

  1976 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK

  1977 local_clock_stable = sched_clock_stable;

  1978 #endif

  1979 printk(KERN_WARNING ”Delta way too big! %llu”

  1980 “ ts=%llu write stamp = %llu\n%s”,

  1981 (unsigned long long)*delta,

  1982 (unsigned long long)*ts,

  1983 (unsigned long long)cpu_buffer-》write_stamp,

  1984 local_clock_stable ? ”“ :

  1985 “If you just came from a suspend/resume,\n”

  1986 “please switch to the trace global clock:\n”

  1987 ” echo global 》 /sys/kernel/debug/tracing/trace_clock\n”);

  1988 WARN_ON(1);

  This called from rb_reserve_next_event() here.

  2122 /*

  2123 * Only the first commit can update the timestamp.

  2124 * Yes there is a race here. If an interrupt comes in

  2125 * just after the conditional and it traces too, then it

  2126 * will also check the deltas. More than one timestamp may

  2127 * also be made. But only the entry that did the actual

  2128 * commit will be something other than zero.

  2129 */

  2130 if (likely(cpu_buffer-》tail_page == cpu_buffer-》commit_page &&

  2131 rb_page_write(cpu_buffer-》tail_page) ==

  2132 rb_commit_index(cpu_buffer))) {

  2133 u64 diff;

  2134

  2135 diff = ts - cpu_buffer-》write_stamp;

  2136

  2137 /* make sure this diff is calculated here */

  2138 barrier();

  2139

  2140 /* Did the write stamp get updated already? */

  2141if (unlikely(ts 《 cpu_buffer-》write_stamp))

  2142 goto get_event;

  2143

  2144 delta = diff;

  2145 if (unlikely(test_time_stamp(delta))) {

  2146

  2147 commit = rb_add_time_stamp(cpu_buffer, &ts, &delta); 《—- HERE

  This has to do with time stamping for ring buffer entries.

  通過(guò)上面的信息可以看出,其實(shí)和我去年分析的代碼和方式一模一樣,只是判斷原因方面我不敢確定,畢竟重現(xiàn)不了,只能是推測(cè)。

  后面又繼續(xù)分析內(nèi)核中出現(xiàn)的另一個(gè)錯(cuò)誤,“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:進(jìn)程號(hào)]”,對(duì)上面的錯(cuò)誤信息我做了一點(diǎn)點(diǎn)處理,CPU#后面的N是對(duì)應(yīng)的一個(gè)具體的cpu編號(hào),這個(gè)在每一臺(tái)服務(wù)器是不一樣的,還有就是最后中括號(hào)中的進(jìn)程和進(jìn)程號(hào)碼不同,不過(guò)就是qmgr和master。如下統(tǒng)計(jì):

  IP 107 108 109 110 111 112 113 114

  選項(xiàng)

  日志

  時(shí)間13:01:2014:03:3414:05:4414:22:4414:19:5814:17:1214:22:49

  14:19:58錯(cuò)誤日志類型和進(jìn)程1qmgr1master

  2qmgr1qmgr

  2master1 qmgr

  1qmgr

  2master1qmgr

  2master1qmgr

  2master1qmgr

  2master

  錯(cuò)誤類型1就是上面提到的不會(huì)一起內(nèi)核掛起的錯(cuò)誤,2就是現(xiàn)在分析的這個(gè)錯(cuò)誤,會(huì)導(dǎo)致linux內(nèi)核panic。可以看出只有107和110當(dāng)時(shí)是沒(méi)有掛起的。

  接著上面的內(nèi)核出錯(cuò)日志分析,發(fā)現(xiàn)一個(gè)很大的相同點(diǎn),就是4278190091s這個(gè)值。首先解釋一下這個(gè)值代表的意義,通常情況下如果一個(gè)cpu超過(guò)10s沒(méi)有喂狗(執(zhí)行watchdog程序)就會(huì)拋出soft lockup(軟死鎖)錯(cuò)誤并且掛起內(nèi)核。但是這個(gè)值盡然是4278190091s,并都是一樣的。完全可以理解為是一個(gè)固定的錯(cuò)誤,為了驗(yàn)證自己的想法,我就在RedHat官方網(wǎng)站搜索這個(gè)錯(cuò)誤信息,讓我非常激動(dòng)的是,盡然找到了相同的bug(url:https://access.redhat.com/knowledge/solutions/68466),然后查看錯(cuò)誤的redhat版本和內(nèi)核版本,都和我們的一樣(redhat6.2和CentOS6.2對(duì)應(yīng))。錯(cuò)如信息和解決方案如下:

  Does Red Hat Enterprise Linux 6 or 5 have a reboot problem which is caused by sched_clock() overflow around 208.5 days?

 。║pdated 21 Feb 2013, 5:11 AM GMT RateSelect ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5Cancel ratingCancel ratingGive it 1/5Give it 2/5Give it 3/5Give it 4/5Give it 5/5. Average: 5 (1 vote)。 Show Follow

  Follow this page KCS Solution content KCS Solution content by Marc Milgram Content in panic Content in panic by Marc Milgram Content in

  rhel5 Content in rhel5 by Marc Milgram Content in rhel6 Content in rhel6 by Marc Milgram Content in kernel Content in kernel by

  Marc Milgram Content in Red Hat Enterprise Linux Content in Red Hat Enterprise Linux by Marc Milgram Content in Kernel

  Content in Kernel by Marc Milgram Content in Virtualization Content in Virtualization by Marc Milgram Content in

  Troubleshoot Content in Troubleshoot by Marc Milgram Second Sidebar

  0 Issue(問(wèn)題)

  •Linux Kernel panics when sched_clock() overflows after an uptime of around 208.5 days.

  •Red Hat Enterprise Linux 6.1 system reboots with sched_clock() overflow after an uptime of around 208.5 days

  •This symptom may happen on the systems using the CPU which has TSC.

  •A process showing BUG: soft lockup - CPU#N stuck for 4278190091s!

  Environment(環(huán)境)

  •Red Hat Enterprise Linux 6

  ◦Red Hat Enterprise Linux 6.0, 6.1 and 6.2 are affected

  ◦several kernels affected, see below

  ◦TSC clock source - **see root cause

  •Red Hat Enterprise Linux 5

  ◦Red Hat Enterprise Linux 5.3, 5.6, 5.8: please refer to the resolution section for affected kernels

  ◦Red Hat Enterprise Linux 5.0, 5,1, 5.2, 5.4, 5.5 ,5.7: all kernels affected

  ◦Red Hat Enterprise Linux 5.9 and later are not affected

  ◦TSC clock source - **see root cause

  •An approximate uptime of around 208.5 days.

  Resolution(解決方案)

  •Red Hat Enterprise Linux 6

  ◦Red Hat Enterprise Linux 6.x: update to kernel-2.6.32-279.el6 (from RHSA-2012-0862) or later. This kernel is already part of RHEL6.3GA. This fix was implemented with (private) bz765720.

  ◦Red Hat Enterprise Linux 6.2: update to kernel-2.6.32-220.4.2.el6 (from RHBA-2012-0124) or later. This fix was implemented with (private) bz781974.

  ◦Red Hat Enterprise Linux 6.1 Extended Update Support: update to kernel-2.6.32-131.26.1.el6 (from RHBA-2012-0424) or later. This fix was implemented with (private) bz795817.

  •Red Hat Enterprise Linux 5

  ◦architecture x86_64/64bit

  ■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.

  ■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.11.1.el5 (from RHSA-2012-1061) or later.

  ■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.

  ■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.

  ◦architecture x86/32bit

  ■Red Hat Enterprise Linux 5.x: upgrade to kernel-2.6.18-348.el5 (from RHBA-2013-0006) or later. RHEL5.9GA and later already contain this fix.

  ■Red Hat Enterprise Linux 5.8.z: upgrade to kernel-2.6.18-308.13.1.el5 (from RHSA-2012-1174) or later.

  ■Red Hat Enterprise Linux 5.6.z: upgrade to kernel-2.6.18-238.40.1.el5 (from RHSA-2012-1087) or later.

  ■Red Hat Enterprise Linux 5.3.z: upgrade to kernel-2.6.18-128.39.1.el5 (from RHBA-2012-1093) or later.

  Root Cause(根本原因)

  •An insufficiently designed calculation in the CPU accelerator in the previous kernel caused an arithmetic overflow in the sched_clock() function. This overflow led to a kernel panic or any other unpredictable trouble on the systems using the Time Stamp Counter (TSC) clock source.

  •This problem will occur only when system uptime becomes 208.5 days or exceeds 208.5 days.

  •This update corrects the aforementioned calculation so that this arithmetic overflow and kernel panic can no longer occur under these circumstances.

  •On Red Hat Enterprise 5, this problem is a timing issue and very very rare to happen.

  •**Switching to another clocksource is usually not a workaround for most of customers as the TSC is a fast access clock whereas the HPET and PMTimer are both slow access clocks. Using notsc would be a significant performance hit.

  Diagnostic Steps

  Note:

  This issue could likely happen in numerous locals that deal with time

  in the kernel. For example, a user running a non-Red Hat kernel had the

  kernel panic with a soft lockup in __ticket_spin_lock.

  通過(guò)上面的信心我們完全可以確認(rèn)這個(gè)是linux內(nèi)核的一個(gè)bug,這個(gè)bug的原因上面也相信描述了,就是對(duì)于x86_64體系結(jié)構(gòu)的內(nèi)核版本,如果啟動(dòng)時(shí)間超過(guò)208.5天就會(huì)導(dǎo)致溢出。

  雖然得到了上面的信息證實(shí)了內(nèi)核panic的原因,不過(guò)自己想了解一下淘寶的內(nèi)核工程師是否也應(yīng)該遇到過(guò)同樣的問(wèn)題,所以就在qq上找以前聊過(guò)的淘寶內(nèi)核工程師確認(rèn)這個(gè)問(wèn)題。結(jié)果證明:他們也遇到過(guò)同樣的錯(cuò)誤,并且也不能重現(xiàn),解決方案還是升級(jí)內(nèi)核版本。

  4.總結(jié)

  上面就是Linux內(nèi)核崩潰的排查方法介紹了,通過(guò)本文的介紹能夠了解到Linux內(nèi)核的排查是比較困難的,需要一定的耐心和技術(shù)。

標(biāo)簽 內(nèi)核

發(fā)表評(píng)論

0

沒(méi)有更多評(píng)論了

評(píng)論就這些咯,讓大家也知道你的獨(dú)特見(jiàn)解

立即評(píng)論

以上留言僅代表用戶個(gè)人觀點(diǎn),不代表系統(tǒng)之家立場(chǎng)

其他版本軟件

熱門教程

人氣教程排行

Linux系統(tǒng)推薦

官方交流群 軟件收錄