Fatal Trap 12 when attempting to detach drive

jcooper · Feb 1, 2013

Hello All,

This is more of a 'what should I have done' type of question than a complaint.

I had a drive that reported a pending sector via smartctl. The status reported for the drive in the pool was 'null'. I replaced the drive via the GUI with a stand-by drive and the pool reslivered. Once that was done I clicked 'detach' and the system stopped responding. Checked the console and saw the Fatal trap 12 and the debugger prompt.

Code:

Fatal Trap 12: page fault while in kernel mode
cpuid=4; apic id = 04
fault virtual address = 0x8
fault code = supervisor read data, page not present
instruction point = 0x20:0xffffffff8052e845
stack pointer = 0x28:ffffff80000f1a30
frame pointer = 0x28:ffffff80000f1b70
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2 (g_event)
[thread pid 2 tid 100021 ]
Stopped at g_part_ctlreq+0x1375: cmpq $0x80c1a180,0x8(%r13)

smartctl data from the drive prior to system hang:

Code:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       733
  3 Spin_Up_Time            0x0027   180   169   021    Pre-fail  Always       -       3975
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       90
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   199   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       2327
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       90
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       88
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1
194 Temperature_Celsius     0x0022   128   110   000    Old_age   Always       -       19
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1403         -
# 2  Extended offline    Completed without error       00%        46         -

Unfortunately I didn't notice when the error actually occurred and only noticed when I logged into the GUI to do another task. I've since setup email alerts. I did grab this snippet from the logs:

Code:

Jan 25 02:11:13 freenas smartd[2936]: Device: /dev/da1 [SAT], 1 Currently unreadable (pending) sectors
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c3 b1 0 1 0 0 length 131072 SMID 716 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bd b1 0 1 0 0 length 131072 SMID 745 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 be b1 0 1 0 0 length 131072 SMID 75 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bf b1 0 1 0 0 length 131072 SMID 564 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c0 b1 0 1 0 0 length 131072 SMID 665 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: 
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c1 b1 0 1 0 0 length 131072 SMID 926 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 c2 b1 0 1 0 0 length 131072 SMID 1000 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bb b1 0 1 0 0 length 131072 SMID 361 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 bc b1 0 1 0 0 length 131072 SMID 782 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): WRITE(10). CDB: 2a 0 cb 27 ba b1 0 1 0 0 
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): CAM status: SCSI Status Error
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): SCSI status: Check Condition
Jan 25 02:19:48 freenas kernel: (da1:mps0:0:5:0): SCSI sense: ILLEGAL REQUEST asc:21,0 (Logical block address out of range)
Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 length 0 SMID 140 terminated ioc 804b scsi 0 state c xfer 0
Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 e8 e0 86 90 0 0 10 0 length 8192 SMID 170 terminated ioc 804b scsi 0 state c xfer 0
Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 e8 e0 84 90 0 0 10 0 length 8192 SMID 615 terminated ioc 804b scsi 0 state c xfer 0
Jan 25 02:20:00 freenas kernel: 
Jan 25 02:20:00 freenas kernel: (da1:mps0:0:5:0): READ(10). CDB: 28 0 0 40 2 90 0 0 10 0 length 8192 SMID 421 terminated ioc 804b scsi 0 state c xfer 0
Jan 25 02:20:39 freenas kernel: (da1:mps0:0:5:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 652 terminated ioc 804b scsi 0 state 0 xfer 0
Jan 25 02:20:39 freenas kernel: mps0: mpssas_alloc_tm freezing simq
Jan 25 02:20:39 freenas kernel: mps0: mpssas_remove_complete on handle 0x000a, IOCStatus= 0x0
Jan 25 02:20:39 freenas kernel: mps0: mpssas_free_tm releasing simq
Jan 25 02:20:39 freenas kernel: (da1:mps0:0:5:0): lost device - 4 outstanding, 1 refs

Would I have avoided this issue by setting the bad drive to offline before doing the replace and then detach if necessary? Or is there something else in play in this situation?

Thanks

jcooper · Feb 1, 2013

Followup question:

I assume it is safe to restart the machine?

paleoN · Feb 1, 2013

jcooper said:
I assume it is safe to restart the machine?

What other choice to you have at this point? The system did panic, yes?

jcooper · Feb 1, 2013

Fair point. :)

Important Announcement for the TrueNAS Community.

Fatal Trap 12 when attempting to detach drive

jcooper

Cadet

jcooper

Cadet

paleoN

Wizard

jcooper

Cadet

Similar threads