Tuesday, September 1, 2015

Debugging Process Hang

A few month ago, when I was launching my Visual Studio, it took almost 2 minutes until it was loaded completely. It seemed like a process hang. I wondered what the problem was, so I decided to dump the system into a complete dump file.

I looked for a running thread of msdev.exe after I opened the dump file with WinDbg.

0: kd> !process 0 3f msdev.exe
PROCESS fffffa800fce6b30
    SessionId: 1  Cid: 1a10    Peb: 7efdf000  ParentCid: 1268
    DirBase: 1dfa1c000  ObjectTable: fffff8a00862a110  HandleCount: 314.
    Image: MSDEV.EXE
    VadRoot fffffa80107d3c90 Vads 367 Clone 0 Private 12752. Modified 197443. Locked 0.
    DeviceMap fffff8a007ec5160
    Token                             fffff8a0157176f0
    ElapsedTime                       00:00:45.997
    UserTime                          00:00:00.000
    KernelTime                        00:00:00.062
    ...

        THREAD fffffa80126a0060  Cid 1a10.19d8  Teb: 000000007efdb000 Win32Thread: fffff900c2d53c20 RUNNING on processor 3

        Not impersonating
        DeviceMap                 fffff8a007ec5160
        Owning Process            fffffa800fce6b30       Image:         MSDEV.EXE
        Attached Process          N/A            Image:         N/A
        Wait Start TickCount      131878         Ticks: 0
        Context Switch Count      156311         IdealProcessor: 7                 LargeStack
        UserTime                  00:00:21.762
        KernelTime                00:00:13.041
*** ERROR: Module load completed but symbols could not be loaded for MSDEV.EXE
        Win32 Start Address MSDEV (0x000000000040155c)
        Stack Init fffff8800c441530 Current fffff8800c441210
        Base fffff8800c442000 Limit fffff8800c435000 Call fffff8800c441580
        Priority 9 BasePriority 8 UnusualBoost 0 ForegroundBoost 0 IoPriority 2 PagePriority 5

        Child-SP          RetAddr           Call Site

        fffff880`0c440680 fffff880`012aa404 Ntfs!NtfsFindPrefix+0x222
        fffff880`0c440730 fffff880`012a7af3 Ntfs!NtfsFindStartingNode+0x6e4
        fffff880`0c440800 fffff880`0128a149 Ntfs!NtfsCommonCreate+0x3d3
        fffff880`0c4409d0 fffff880`01140367 Ntfs!NtfsNetworkOpenCreate+0x119
        fffff880`0c440cb0 fffff880`011438ba fltmgr!FltpPerformFastIoCall+0x357
        fffff880`0c440d10 fffff880`01161a9f fltmgr!FltpPassThroughFastIo+0xda
        fffff880`0c440d50 fffff800`035d3c72 fltmgr!FltpFastIoQueryOpen+0x10f
        fffff880`0c440df0 fffff800`035cf878 nt!IopParseDevice+0x1249
        fffff880`0c440f50 fffff800`035d0a96 nt!ObpLookupObjectName+0x588
        fffff880`0c441040 fffff800`035aef66 nt!ObOpenObjectByName+0x306
        fffff880`0c441110 fffff800`032d2e53 nt!NtQueryAttributesFile+0x145
        fffff880`0c4413a0 00000000`7754168a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame @ fffff880`0c4413a0)
        00000000`0008c9f8 00000000`73a2ae19 ntdll!NtQueryAttributesFile+0xa
        00000000`0008ca00 00000000`73a1d18f wow64!whNtQueryAttributesFile+0x91
        00000000`0008ca80 00000000`750c2776 wow64!Wow64SystemServiceEx+0xd7
        00000000`0008d340 00000000`73a1d286 wow64cpu!ServiceNoTurbo+0x2d
        00000000`0008d400 00000000`73a18a90 wow64!RunCpuSimulation+0xa
        00000000`0008d450 00000000`739e2c52 wow64!Wow64KiUserCallbackDispatcher+0x204
        00000000`0008d7a0 00000000`775411f5 wow64win!whcbfnDWORD+0xe2
        00000000`0008e190 00000000`739efe4a ntdll!KiUserCallbackDispatcherContinue (TrapFrame @ 00000000`0008e058)
        00000000`0008e218 00000000`739caf02 wow64win!ZwUserMessageCall+0xa
        00000000`0008e220 00000000`739e281f wow64win!whNT32NtUserMessageCallCB+0x32
        00000000`0008e270 00000000`739cb03e wow64win!Wow64DoMessageThunk+0x8b
        00000000`0008e2b0 00000000`73a1d18f wow64win!whNtUserMessageCall+0x12e
        00000000`0008e350 00000000`750c2776 wow64!Wow64SystemServiceEx+0xd7
        00000000`0008ec10 00000000`73a1d286 wow64cpu!ServiceNoTurbo+0x2d
        00000000`0008ecd0 00000000`73a1c69e wow64!RunCpuSimulation+0xa
        00000000`0008ed20 00000000`77534966 wow64!Wow64LdrpInitialize+0x42a
        00000000`0008f270 00000000`77531937 ntdll!LdrpInitializeProcess+0x17e3
        00000000`0008f760 00000000`7751c34e ntdll! ?? ::FNODOBFM::`string'+0x28ff0
        00000000`0008f7d0 00000000`00000000 ntdll!LdrInitializeThunk+0xe


The thread was consuming UserTime 21 seconds and KernelTime 13 seconds. The other threads were not meaningful because they were suspended all.

I could be aware of that the running thread was handling a file by calling NtQueryAttributesFile from the call stack. I wondered what the file was as well so I decided to get the file name from the first parameter of NtQueryAttributesFile.

Even though analyzing 64 bit call stack is a little tough task but I can suggest simplified steps as following.


1. Finding the stack location of the first parameter

This is the function prototype of NtQueryAttributesFile.

NTSTATUS 
NtQueryAttributesFile(
    POBJECT_ATTRIBUTES ObjectAttributes,
    PFILE_BASIC_INFORMATION FileInformation
);

The first parameter ObjectAttributes contains a file name so we need to find a location where ObjectAttributes saved from the code flow of NtQueryAttributesFile.

0: kd> uf nt!NtQueryAttributesFile
nt!NtQueryAttributesFile:
fffff800`035aee20 48895c2418      mov     qword ptr [rsp+18h],rbx
fffff800`035aee25 56              push    rsi
fffff800`035aee26 57              push    rdi
fffff800`035aee27 4154            push    r12
fffff800`035aee29 4881ec70020000  sub     rsp,270h
fffff800`035aee30 488b05a9fbe9ff  mov     rax,qword ptr [nt!_security_cookie (fffff800`0344e9e0)]
fffff800`035aee37 4833c4          xor     rax,rsp
fffff800`035aee3a 4889842460020000 mov     qword ptr [rsp+260h],rax
fffff800`035aee42 488bda          mov     rbx,rdx
fffff800`035aee45 488bf1          mov     rsi,rcx  ; saving 1st param (rsi <= rcx)
fffff800`035aee48 65488b042588010000 mov   rax,qword ptr gs:[188h]
fffff800`035aee51 0fb6b8f6010000  movzx   edi,byte ptr [rax+1F6h]
fffff800`035aee58 4084ff          test    dil,dil
fffff800`035aee5b 7432            je      nt!NtQueryAttributesFile+0x6e (fffff800`035aee8f)

nt!NtQueryAttributesFile+0x3d:
fffff800`035aee5d 488bca          mov     rcx,rdx
fffff800`035aee60 f6c207          test    dl,7
fffff800`035aee63 750e            jne     nt!NtQueryAttributesFile+0x53 (fffff800`035aee73)

nt!NtQueryAttributesFile+0x45:
fffff800`035aee65 488b0594c1f5ff  mov     rax,qword ptr [nt!MmUserProbeAddress (fffff800`0350b000)]
fffff800`035aee6c 483bd0          cmp     rdx,rax
fffff800`035aee6f 7308            jae     nt!NtQueryAttributesFile+0x58 (fffff800`035aee79)

nt!NtQueryAttributesFile+0x51:
fffff800`035aee71 eb09            jmp     nt!NtQueryAttributesFile+0x5b (fffff800`035aee7c)

nt!NtQueryAttributesFile+0x53:
fffff800`035aee73 e818070b00      call    nt!ExRaiseDatatypeMisalignment (fffff800`0365f590)
fffff800`035aee78 cc              int     3

nt!NtQueryAttributesFile+0x58:
fffff800`035aee79 488bc8          mov     rcx,rax

nt!NtQueryAttributesFile+0x5b:
fffff800`035aee7c 0fb601          movzx   eax,byte ptr [rcx]
fffff800`035aee7f 8801            mov     byte ptr [rcx],al
fffff800`035aee81 0fb64127        movzx   eax,byte ptr [rcx+27h]
fffff800`035aee85 884127          mov     byte ptr [rcx+27h],al
fffff800`035aee88 eb05            jmp     nt!NtQueryAttributesFile+0x6e (fffff800`035aee8f)

nt!NtQueryAttributesFile+0x6e:
fffff800`035aee8f 41bcb8000000    mov     r12d,0B8h
fffff800`035aee95 4d8bc4          mov     r8,r12
fffff800`035aee98 33d2            xor     edx,edx
fffff800`035aee9a 488d4c2450      lea     rcx,[rsp+50h]
fffff800`035aee9f e8ec6ed2ff      call    nt!memset (fffff800`032d5d90)
fffff800`035aeea4 41bb08000000    mov     r11d,8
fffff800`035aeeaa 6644895c2450    mov     word ptr [rsp+50h],r11w
fffff800`035aeeb0 664489642452    mov     word ptr [rsp+52h],r12w
fffff800`035aeeb6 b807000000      mov     eax,7
fffff800`035aeebb 6689842496000000 mov     word ptr [rsp+96h],ax
fffff800`035aeec3 c78424a800000001000000 mov dword ptr [rsp+0A8h],1
fffff800`035aeece c784249000000000402000 mov dword ptr [rsp+90h],204000h
fffff800`035aeed9 48899c24b0000000 mov     qword ptr [rsp+0B0h],rbx
fffff800`035aeee1 488d842410010000 lea     rax,[rsp+110h]
fffff800`035aeee9 48898424b8000000 mov     qword ptr [rsp+0B8h],rax
fffff800`035aeef1 c68424d100000001 mov     byte ptr [rsp+0D1h],1
fffff800`035aeef9 488d842450010000 lea     rax,[rsp+150h]
fffff800`035aef01 48898424d8000000 mov     qword ptr [rsp+0D8h],rax
fffff800`035aef09 4889b42480000000 mov     qword ptr [rsp+80h],rsi     <= saving rsi to a stack area
fffff800`035aef11 c78424e000000020000000 mov dword ptr [rsp+0E0h],20h
fffff800`035aef1c 65488b042588010000 mov   rax,qword ptr gs:[188h]
fffff800`035aef25 48ff8030030000  inc     qword ptr [rax+330h]
fffff800`035aef2c 65ff042564220000 inc     dword ptr gs:[2264h]
fffff800`035aef34 488d442440      lea     rax,[rsp+40h]
fffff800`035aef39 4889442430      mov     qword ptr [rsp+30h],rax
fffff800`035aef3e 488d442450      lea     rax,[rsp+50h]
fffff800`035aef43 4889442428      mov     qword ptr [rsp+28h],rax
fffff800`035aef48 c744242080000000 mov     dword ptr [rsp+20h],80h
fffff800`035aef50 4533c9          xor     r9d,r9d
fffff800`035aef53 440fb6c7        movzx   r8d,dil
fffff800`035aef57 488b15eac0f5ff  mov     rdx,qword ptr [nt!IoFileObjectType (fffff800`0350b048)]
fffff800`035aef5e 488bce          mov     rcx,rsi    <= rsi is passed to 1st param of ObOpenObjectByName
fffff800`035aef61 e82a180200      call    nt!ObOpenObjectByName (fffff800`035d0790)


We know that a first parameter of a function is passed through rcx on 64 bit platform so I traced the data flow of rcx from entry point of NtQueryAttributesFile.

I can summarize the code snippet like this.

fffff800`035aee45 mov rsi,rcx  ; saving 1st param (rsi <= rcx)
fffff800`035aef09 mov qword ptr [rsp+80h],rsi  ; saving rsi to [rsp+80]
fffff800`035aef5e mov rcx,rsi  ; rsi is passed to 1st param of ObOpenObjectByName

The first parameter of ObOpenObjectByName is also POBJECT_ATTRIBUTES.

NTSTATUS STDCALL
ObOpenObjectByName(
  POBJECT_ATTRIBUTES ObjectAttributes,
  POBJECT_TYPE ObjectType,
  PVOID ParseContext,
  KPROCESSOR_MODE AccessMode,
  ACCESS_MASK DesiredAccess,
  PACCESS_STATE PassedAccessState,
  PHANDLE Handle
);

I can be sure that rsi contains the first parameter of NtQueryAttributesFile because it is not changed between fffff8000'035aee45 and fffff800'035aef5e, and it is saved on [rsp+80h].


2. Finding the base pointer of stack frame

I need to know the rsp value when NtQueryAttributesFile is running in order to dereference [rsp+80h] in the stack frame. I can get the rsp of NtQueryAttributesFile from the call stack.

Child-SP          RetAddr           Call Site
fffff880`0c441040 fffff800`035aef66 nt!ObOpenObjectByName+0x306
fffff880`0c441110 fffff800`032d2e53 nt!NtQueryAttributesFile+0x145

WinDbg shows Child-SP kindly, it is rsp in the function so rsp of NtQueryAttributesFile is fffff880'0c441110.

On 64 bit Windows, rsp is a base pointer of stack frame like ebp on 32 bit platform. The rsp value is not changed until the function returns.


3. Checking the parameter

Now, we can read the value stored on [rsp+80].

0: kd> dq fffff880`0c441110+80
fffff880`0c441190  00000000`0008d2e0 00000000`00000000
fffff880`0c4411a0  00070000`00200000 00000000`00000000
...

0: kd> dq 00000000`0008d2e0
00000000`0008d2e0  00000032`00000030 00000000`00000000
00000000`0008d2f0  00000000`0008d2d0 00000000`00000040
00000000`0008d300  00000000`00000000 00000000`00000000

The 0x00000000'0008d2e0 is POBJECT_ATTRIBUTES ObjectAttributes.

0: kd> dt _OBJECT_ATTRIBUTES
ntdll!_OBJECT_ATTRIBUTES
   +0x000 Length           : Uint4B
   +0x008 RootDirectory    : Ptr64 Void
   +0x010 ObjectName       : Ptr64 _UNICODE_STRING
   +0x018 Attributes       : Uint4B
   +0x020 SecurityDescriptor : Ptr64 Void
   +0x028 SecurityQualityOfService : Ptr64 Void

I am interested in the ObjectName field, it has 0x00000000'0008d2d0.

0: kd> dt _UNICODE_STRING 00000000`0008d2d0
ntdll!_UNICODE_STRING
 "\??\D:\TEMP\Project\TestTool\dummy\dummy.cpp"
   +0x000 Length           : 0x5A
   +0x002 MaximumLength    : 0x21a
   +0x008 Buffer           : 0x00000000`04ffd1b8  "\??\D:\TEMP\Project\TestTool\dummy\dummy.cpp"


4. The cause of the process hang

"Oops, the path does not exist."

I realized something after I looked at the path. I had removed the path a few days ago before the process hang happened. The Visual Studio might have a bug that it could not deal with a removed path of recent workspaces correctly.

The process hang was fixed after I had deleted all temporary files of Visual Studio. :)


Sunday, November 16, 2014

Debugging memory corruption (Advanced)

Sometimes software debugging is too tired task if there is no evidence to find a root cause of the problem, in that case, we need to concentrate on collecting any little clues in the crash dump. Therefore, I’d like to name the subtitle of this article as “Looking for a needle in a haystack”.

Today, we are going to look at a crash dump caused by memory corruption. We can guess that this is a memory corruption case and mydrv.sys might corrupt the memory when we open the crash dump at a glance but it’s hard to find the source code line which has the bug in mydrv.sys because this crash dump does not have any call stack related with mydrv.sys. Generally, after memory corruption has been happened and then the kernel would be crashed later, so we have to invent some proper way in order to find out the root cause of the problem.

Let’s go with "!analyze -v"

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************
KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
This is a very common bugcheck.  Usually the exception address pinpoints
the driver/function that caused the problem.  Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003.  This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG.  This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG.  This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 8092e20b, The address that the exception occurred at
Arg3: f3a4ea60, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - 0x%08lx
FAULTING_IP: 
nt!ObpLookupDirectoryEntry+ee
8092e20b 394608          cmp     dword ptr [esi+8],eax
TRAP_FRAME:  f3a4ea60 -- (.trap 0xfffffffff3a4ea60)
ErrCode = 00000000
eax=0000e470 ebx=e1fdb600 ecx=e1753c2c edx=00000011 esi=00740065 edi=e1753be8
eip=8092e20b esp=f3a4ead4 ebp=f3a4eaec iopl=0         nv up ei pl nz na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010206
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608          cmp     dword ptr [esi+8],eax ds:0023:0074006d=????????
Resetting default scope
DEFAULT_BUCKET_ID:  DRIVER_FAULT
BUGCHECK_STR:  0x8E
PROCESS_NAME:  iexplore.exe
CURRENT_IRQL:  0
LAST_CONTROL_TRANSFER:  from 8085bba7 to 8087c480
STACK_TEXT:  
f3a4e62c 8085bba7 0000008e c0000005 8092e20b nt!KeBugCheckEx+0x1b
f3a4e9f0 808346b4 f3a4ea0c 00000000 f3a4ea60 nt!KiDispatchException+0x3a2
f3a4ea58 80834668 f3a4eaec 8092e20b badb0d00 nt!CommonDispatchException+0x4a
f3a4ea78 80935a06 00000000 f3a4eb67 f3a4eb7c nt!Kei386EoiHelper+0x186
f3a4eaec 8092c2ab e1753b98 f3a4eb14 00000050 nt!SepAccessCheck+0x41d
f3a4eb58 80939f97 00000000 f3a4ec3c 00000000 nt!ObpLookupObjectName+0x45f
f3a4ec48 809305c8 00000000 00000000 00000000 nt!ObReferenceObjectByName+0x8e
f3a4ed10 80920300 0559cd88 0315b948 0315b954 nt!NtSecureConnectPort+0x1d6
f3a4ed3c 80833bdf 0559cd88 0315b948 0315b954 nt!NtConnectPort+0x24
f3a4ed3c 7c9685ec 0559cd88 0315b948 0315b954 nt!KiFastCallEntry+0xfc
...
STACK_COMMAND:  kb
FOLLOWUP_IP: 
nt!ObpLookupDirectoryEntry+ee
8092e20b 394608          cmp     dword ptr [esi+8],eax
SYMBOL_STACK_INDEX:  0
SYMBOL_NAME:  nt!ObpLookupDirectoryEntry+ee
FOLLOWUP_NAME:  MachineOwner
MODULE_NAME: nt
IMAGE_NAME:  ntkrnlmp.exe
DEBUG_FLR_IMAGE_TIMESTAMP:  45ec14ca
FAILURE_BUCKET_ID:  0x8E_nt!ObpLookupDirectoryEntry+ee
BUCKET_ID:  0x8E_nt!ObpLookupDirectoryEntry+ee
Followup: MachineOwner
---------


According to the messages, this is a very common bugcheck but there is not useful information to debug.
There is no my module name on the call stack except NT kernel, so there are no hints to analyze this problem.

At the end of the messages, it points a function that the exception occurred at is nt!ObpLookupDirectoryEntry+ee, but it is not useful either because it’s really rare case that it is from NT kernel bug, most cases are from third party driver’s bug or mine. Therefore, we need to make every possible effort to find out any hint of the crash.

Let’s begin with .trap command.

0: kd> .trap 0xfffffffff3a4ea60

ErrCode = 00000000

eax=0000e470 ebx=e1fdb600 ecx=e1753c2c edx=00000011 esi=00740065 edi=e1753be8
eip=8092e20b esp=f3a4ead4 ebp=f3a4eaec iopl=0         nv up ei pl nz na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010206
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608          cmp     dword ptr [esi+8],eax ds:0023:0074006d=????????


It was crashed when the instruction at 0x8092e20b which dereferences 0x0074006d was executed.  
0x0074006d is invalid address so the exception could not be handled.

Where is the value 0x0074006d from?
We can trace back the last executed instructions.

0: kd> ub eip

nt!ObpLookupDirectoryEntry+0xd1:

8092e1eb f00fb111        lock cmpxchg dword ptr [ecx],edx
8092e1ef 85c0            test    eax,eax
8092e1f1 0f8554ccfdff    jne     nt!ObpLookupDirectoryEntry+0xd9 (8090ae4b)
8092e1f7 c746103412dddd  mov     dword ptr [esi+10h],0DDDD1234h
8092e1fe 8b33            mov     esi,dword ptr [ebx]    <=== esi = 00740065 from [ebx]
8092e200 85f6            test    esi,esi
8092e202 0f848a470000    je      nt!ObpLookupDirectoryEntry+0x122 (80932992)
8092e208 8b45f8          mov     eax,dword ptr [ebp-8]
0: kd> u eip
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608          cmp     dword ptr [esi+8],eax    <=== faulting address
8092e20e 0f851a690000    jne     nt!ObpLookupDirectoryEntry+0x112 (80934b2e)
8092e214 8b4604          mov     eax,dword ptr [esi+4]
8092e217 0fb648f4        movzx   ecx,byte ptr [eax-0Ch]

0: kd> r


Last set context:

eax=0000e470 ebx=e1fdb600 ecx=e1753c2c edx=00000011 esi=00740065 edi=e1753be8
eip=8092e20b esp=f3a4ead4 ebp=f3a4eaec iopl=0         nv up ei pl nz na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010206
nt!ObpLookupDirectoryEntry+0xee:
8092e20b 394608          cmp     dword ptr [esi+8],eax ds:0023:0074006d=????????


0x0074006d was from esi+8 where esi was 0x00740065.
0x00740065 in esi was from above address 0x8092e1fe, it was from [ebx].

The current value in ebx is 0xe1fdb600 and the data in the address should be a valid memory pointer, but the data 0x00740065 was not valid memory pointer.
We can check the memory information of 0xe1fdb600 using !pool command.

0: kd> !pool e1fdb600

Pool page e1fdb600 region is Paged pool

 e1fdb000 size:  290 previous size:    0  (Allocated)  Gla:
 e1fdb290 size:   10 previous size:  290  (Free)       9.0.
 ...
 e1fdb490 size:  168 previous size:   a0  (Allocated)  MyMm
e1fdb5f8 is not a valid large pool allocation, checking large session pool...
e1fdb5f8 is freed (or corrupt) pool
Bad previous allocation size @e1fdb5f8, last size was 2d
***
*** An error (or corruption) in the pool was detected;
...
None found


There are several memory informations but it can't display memory informations from 0xe1fdb5f8, the message indicates that the memory area could be freed or corrupted. So we can guess the original data in 0xe1fdb600 might be valid memory pointer, but it was corrupted or overwritten by someone else.

Generally, a driver using the valid memory 0xe1fdb490 nearby corrupted area is suspect. The memory tag of 0xe1fdb490 is MyMm and this tag belongs to mydrv.sys. Therefore, mydrv.sys might corrupt the data in 0xe1fdb600 while it handles it's memory which has MyMm tag. 

However, we can’t fix this problem yet because we don’t know where mydrv.sys uses 0xe1fdb490, MyMm is general purpose memory tag in this driver so it’s hard to find specific code using this memory tag. How can we find the source code position of mydrv.sys we have to fix?

We have only one clue that mydrv.sys had used the memory 0xe1fdb490, we have to find the code using it. To simplify this problem, I settle an assumption that 0xe1fd490 is saved in the memory area of mydrv.sys as global variable.

We can check memory range of mydrv.sys like this:

0: kd> lmvm mydrv
start    end        module name
f3fa3000 f400ba00   mydrv   (deferred)             
    Image path: \??\C:\WINDOWS\system32\drivers\mydrv.sys
    Image name: mydrv.sys
    ...


mydrv.sys is located from 0xf3fa3000 to 0xf400ba00. I will search for a global variable containing 0xe1db490. I have to search for 0xe1fdb498 (0xe1fdb490 + 8) because 0xe1fdb490 displayed in !pool command output is the address of memory header, the real address allocated to mydrv.sys is 0xe1fdb498 where 0xe1fdb490 + memory header size.

We can search the memory as following:

0: kd> s f3fa3000 f400ba00 98 b4 fd e1


There is no result. It means that there is no global variable containing this value in mydrv.sys.
We need to change our strategy, next step would be to search whole memory area in order to find the memory allocated to mydrv.sys, it is like looking for a needle in a haystack. :)

I'll check the paged pool memory range because mydrv.sys uses only Paged Pool.
We can check Paged Pool range using !pool command without parameter.

0: kd> !pool

Paged Pool: e1000000 .. eb7fffff
e1000000: 1000 - busy
e1001000: 1000 - busy
e1002000: 1000 - busy
...
...
...
e3e74000: 22000 - busy
e3e9d000: 15000 - free
e3eb2000: 41000 - free
e3ef3000: 22000 - busy
e3f15000: a0c000 - free
e4921000: 6edf000 - free


The Paged Pool range is from 0xe1000000 to 0xeb7fffff but those are all freed from 0xe3f15000 so the valid address range for searching would be from 0xe1000000 to 0xe3f15000. Let’s look for a needle(0xe1fdb498) in the pool range.

0: kd> s e1000000 e3f15000 98 b4 fd e1
e115aec8  98 b4 fd e1 00 00 00 00-05 06 26 0c 42 63 4d 63  ..........&.MyMm


Wow! I’ve got a needle at 0xe115aec8. As you can see, we can find a meaningful information with any little clue.

To check the memory information, use !pool command.

0: kd> !pool e115aec8
Pool page e115aec8 region is Paged pool
 e115a000 size:  9d0 previous size:    0  (Allocated)  MyMm
 e115a9d0 size:   18 previous size:  9d0  (Allocated)  Ntf0 ...
 e115ae20 size:   48 previous size:   68  (Allocated)  Ntfc
 e115ae68 size:   40 previous size:   48  (Allocated)  MyMm
*e115aea8 size:   28 previous size:   40  (Allocated) *MyMm
  Owning component : Unknown (update pooltag.txt)


0xe115aec8 is within 0xe115aea8 allocation and memory tag MyMm means that it belongs to mydrv.sys
MyMm is general purpose memory tag, we can’t know yet what the memory is for.

I will search the pool for 0xe115ea8 in order to verify which code uses the memory.
We need to use 0xe115aeb0, added memory header size 8.

0: kd> s e1000000 e3f15000 b0 ae 15 e1
e18407e4  b0 ae 15 e1 a0 b4 fd e1-32 00 00 00 c4 ea a4 f3  ........2.......


0xe18407e4 contains 0xe115aeb0, check it with !pool.

0: kd> !pool e18407e4
Pool page e18407e4 region is Paged pool
 e1840000 size:   b8 previous size:    0  (Allocated)  Port (Protected)
 e18400b8 size:   10 previous size:   b8  (Allocated)  Glnk
  ...
 e1840720 size:   30 previous size:   18  (Allocated)  CMVa
 e1840750 size:   18 previous size:   30  (Allocated)  Uspi Process: 85b31c88
*e1840768 size:   98 previous size:   18  (Allocated) *MyMm  Owning component : Unknown (update pooltag.txt)
 e1840800 size:   50 previous size:   98  (Allocated)  LStb ...


Hmm… it is same pattern. Search for 0xe1840770 one more time.

0: kd> s e1000000 e3f15000 70 07 84 e1
e2f3f874  70 07 84 e1 48 21 13 e2-01 00 00 00 00 00 00 00  p...H!..........


0xe2f3f874 contains 0xe1840770 , check it with !pool.

0: kd> !pool e2f3f874
Pool page e2f3f874 region is Paged pool
 e2f3f000 size:  100 previous size:    0  (Allocated)  IoNm
 e2f3f100 size:   98 previous size:  100  (Allocated)  NtFs ...
 e2f3f760 size:  108 previous size:   38  (Allocated)  CM39
*e2f3f868 size:   58 previous size:  108  (Allocated) *MySt
  Owning component : Unknown (update pooltag.txt)
 e2f3f8c0 size:   80 previous size:   58  (Allocated)  IoNm ...


0xe2f3f874 is within 0xe2f3f868 allocation and memory tag MySt means that it belongs to mydrv.sys as well, but MySt tag is used for specific structure in the driver so we can retrieve the structure from source code and match with the memory data. We are lucky!

typedef struct _MY_STRUCT {
 ULONG            ulType;
 DRV_STRUCT  *pDrvStruct1;
 DRV_STRUCT  *pDrvStruct2;
 ...
} MY_STRUCT;


This is MY_STRUCT structure and TYPE_ID of MY_STRUCT is defined as 0x10. 
We have to make sure the first field ulType in the memory 0xe2f3f870 (0xe2f3f868 + 8).

0: kd> db e2f3f870
e2f3f870  10 00 00 00 70 07 84 e1-48 21 13 e2 01 00 00 00  ....p...H!......
e2f3f880  00 00 00 00 00 00 00 00-65 73 00 00 00 00 00 00  ........es......
e2f3f890  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................


It has correct value 0x10 so we are sure the memory is for MY_STRUCT. From now, we can trace back the addresses we've found above with structure defined in the source code of mydrv.sys. 

Use dt command with structure type and address.

0: kd> dt mydrv!MY_STRUCT e2f3f870
   +0x000 ulType          : 16
   +0x004 pDrvStruct1    : 0xe1840770 
   +0x008 pDrvStruct2    : 0xe2132148 
   ...


We can see the familiar address 0xe1840770 and the type of pDrvStruct1 is DRV_STRUCT.

0: kd> dt mydrv!DRV_STRUCT e1840770
   +0x000 id               : 64 
    ...
   +0x074 pPathStruct   : 0xe115aeb0 
   +0x078 pOsPath   : 0xe1fdb4a0  -> 0x5c
   ...


We can see the familiar address 0xe115aeb0 as well and the type of pPathStruct is PATH_STRUCT.

0: kd> dt mydrv!PATH_STRUCT e115aeb0
   +0x000 id               : 6d
   ...
   +0x018 pPathData     : 0xe1fdb498


Eventaully, we can see the address 0xe1fdb498 and the type of pPathData is PATH_DATA.

Let’s check the contents of the structure.

0: kd> dt mydrv!PATH_DATA e1fdb498
   +0x000 pPrev            : (null) 
   +0x004 pLimit           : 0xe1fdb5f2  "u"
   +0x008 Data             : []  "\"

0: kd> db e1fdb498
e1fdb498  00 00 00 00 f2 b5 fd e1-5c 00 44 00 65 00 76 00  ........\.D.e.v.
e1fdb4a8  69 00 63 00 65 00 5c 00-48 00 61 00 72 00 64 00  i.c.e.\.H.a.r.d.
e1fdb4b8  64 00 69 00 73 00 6b 00-56 00 6f 00 6c 00 75 00  d.i.s.k.V.o.l.u.
e1fdb4c8  6d 00 65 00 32 00 5c 00-57 00 49 00 4e 00 44 00  m.e.2.\.W.I.N.D.

e1fdb4d8  4f 00 57 00 53 00 5c 00-73 00 79 00 73 00 74 00  O.W.S.\.s.y.s.t.
e1fdb4e8  65 00 6d 00 33 00 32 00-5c 00 6f 00 6c 00 65 00  e.m.3.2.\.o.l.e.
e1fdb4f8  33 00 32 00 2e 00 64 00-6c 00 6c 00 00 00 6c 00  3.2...d.l.l...l.
e1fdb508  6c 00 00 00 6c 00 6c 00-00 00 74 00 6f 00 72 00  l...l.l...t.o.r.
e1fdb518  5c 00 4c 00 6f 00 63 00-61 00 6c 00 20 00 53 00  \.L.o.c.a.l. .S.
e1fdb528  65 00 74 00 74 00 69 00-6e 00 67 00 73 00 5c 00  e.t.t.i.n.g.s.\.
e1fdb538  54 00 65 00 6d 00 70 00-6f 00 72 00 61 00 72 00  T.e.m.p.o.r.a.r.
e1fdb548  79 00 20 00 49 00 6e 00-74 00 65 00 72 00 6e 00  y. .I.n.t.e.r.n.
e1fdb558  65 00 74 00 20 00 46 00-69 00 6c 00 65 00 73 00  e.t. .F.i.l.e.s.
e1fdb568  5c 00 43 00 6f 00 6e 00-74 00 65 00 6e 00 74 00  \.C.o.n.t.e.n.t.
e1fdb578  2e 00 49 00 45 00 35 00-5c 00 39 00 46 00 4c 00  ..I.E.5.\.9.F.L.
e1fdb588  4b 00 4f 00 4d 00 4a 00-31 00 5c 00 6d 00 61 00  K.O.M.J.1.\.m.a.
e1fdb598  70 00 5b 00 31 00 5d 00-2e 00 67 00 69 00 66 00  p.[.1.]...g.i.f.
e1fdb5a8  00 00 5d 00 2e 00 67 00-69 00 66 00 00 00 69 00  ..]...g.i.f...i.
e1fdb5b8  66 00 00 00 00 00 69 00-66 00 00 00 69 00 63 00  f.....i.f...i.c.
e1fdb5c8  65 00 5c 00 48 00 61 00-72 00 64 00 64 00 69 00  e.\.H.a.r.d.d.i.
e1fdb5d8  73 00 6b 00 56 00 6f 00-6c 00 75 00 6d 00 65 00  s.k.V.o.l.u.m.e.
e1fdb5e8  32 00 5c 00 44 00 6f 00-63 00 75 00 6d 00 65 00  2.\.D.o.c.u.m.e.
e1fdb5f8  6e 00 74 00 73 00 53 00-65 00 74 00 74 00 69 00  n.t.s.S.e.t.t.i.
e1fdb608  6e 00 67 00 73 00 5c 00-41 00 64 00 6d 00 69 00  n.g.s.\.A.d.m.i.
e1fdb618  6e 00 69 00 73 00 74 00-72 00 61 00 74 00 6f 00  n.i.s.t.r.a.t.o.
e1fdb628  72 00 5c 00 4c 00 6f 00-63 00 61 00 6c 00 20 00  r.\.L.o.c.a.l. .
e1fdb638  53 00 65 00 74 00 74 00-69 00 6e 00 67 00 73 00  S.e.t.t.i.n.g.s.
e1fdb648  5c 00 54 00 65 00 6d 00-70 00 6f 00 72 00 61 00  \.T.e.m.p.o.r.a.
e1fdb658  72 00 79 00 20 00 49 00-6e 00 74 00 65 00 72 00  r.y. .I.n.t.e.r.
e1fdb668  6e 00 65 00 74 00 20 00-46 00 69 00 6c 00 65 00  n.e.t. .F.i.l.e.
e1fdb678  73 00 5c 00 43 00 6f 00-6e 00 74 00 65 00 6e 00  s.\.C.o.n.t.e.n.
e1fdb688  74 00 2e 00 49 00 45 00-35 00 5c 00 39 00 46 00  t...I.E.5.\.9.F.
e1fdb698  4c 00 4b 00 4f 00 4d 00-4a 00 31 00 5c 00 62 00  L.K.O.M.J.1.\.b.
e1fdb6a8  30 00 32 00 5f 00 6f 00-6e 00 5f 00 6f 00 66 00  0.2._.o.n._.o.f.
e1fdb6b8  66 00 2d 00 4c 00 5b 00-31 00 5d 00 2e 00 67 00  f.-.L.[.1.]...g.
e1fdb6c8  69 00 66 00 00 00 01 00-88 20 c6 85 88 a8 cd 85  i.f...... ......


It is filled with string data from 0xe1fdb498 to 0xe1fdb6c8 but the range of memory allocated to 0xe1fdb498 is from 0xe1fdb490 to 0xe1fdb5f7.

0: kd> !pool e1fdb498
Pool page e1fdb498 region is Paged pool
 e1fdb000 size:  290 previous size:    0  (Allocated)  Gla:
 ...
 e1fdb3f0 size:   a0 previous size:    8  (Allocated)  Gla8
*e1fdb490 size:  168 previous size:   a0  (Allocated) *MyMm
  Owning component : Unknown (update pooltag.txt)
e1fdb5f8 is not a valid large pool allocation, checking large session pool...
e1fdb5f8 is freed (or corrupt) pool


Maybe, 0xe1fdb5f8 was allocated to NT kernel so there are NT kernel functions on the top of the call stack in this crash dump. NT kernel lost its data and was crashed because mydrv.sys overwrote the string data over its allocation range.

Finally, We've found a bug of the function handling PATH_DATA in mydrv.sys corrupts a memory in the paged pool. 

It was my fault. :)

Monday, June 2, 2014

Suspended Process (KiDeliverApc)

There are a lot of software problems we can experience when we are using our computer. Today, I am going to talk about a story of analysis for a problem from suspended process.

It was a system hang case. There were a lot of processes suspended in the system. I am going to find out who suspends the processes.

Here is a call stack of taskmgr.exe in the system.


THREAD 8956e660  Cid 0c18.0bcc  Teb: 7ffdd000 Win32Thread: e2975360 WAIT: (Suspended) KernelMode Non-Alertable SuspendCount 1
...
    Owning Process            899cd5a8       Image:      taskmgr.exe
    Attached Process          N/A            Image:         N/A
...
    ChildEBP RetAddr  
    aad52c18 80504d50 nt!KiSwapContext+0x2f (FPO: [Uses EBP] [0,0,4])
    aad52c24 804fcf40 nt!KiSwapThread+0x8a (FPO: [0,0,0])
    aad52c4c 8050448c nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
    aad52c64 80500dfa nt!KiSuspendThread+0x18 (FPO: [3,0,0])
    aad52cac 80504d6e nt!KiDeliverApc+0x124 (FPO: [3,10,0])
    aad52cc4 804fcf40 nt!KiSwapThread+0xa8 (FPO: [0,0,0])
    aad52cec 805c108c nt!KeWaitForSingleObject+0x1c2 (FPO: [5,5,4])
    aad52d50 8054289c nt!NtWaitForSingleObject+0x9a (FPO: [Non-Fpo])
    aad52d50 7c93e514 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ aad52d64)
    00dbff84 7c802532 ntdll!KiFastSystemCallRet
    00dbff98 0100d90d kernel32!WaitForSingleObject+0x12 (FPO: [2,0,0])
    00dbffb4 7c80b699 taskmgr!WorkerThread+0x3a (FPO: [1,0,4])
    00dbffec 00000000 kernel32!BaseThreadStart+0x37 (FPO: [Non-Fpo])


We can see KiSwapContext call on top of the call stack, it usually means context switching but there is an odd thing in this call stack that there are two KiSwapThread calls, this is not usual case. The KiSwapThread in the middle of the call stack is for context switching, what is the second KiSwapTherad on top of the call stack?

You can see a call flow that KiDeliverApc, KiSuspendThread, KeWaitForSingleObject, KiSwapThread, so the second KiSwapThread is from KiSuspendThread. I can figure out that some process called SuspendThread API to taskmgr.exe and it didn’t call ResumeThread.

Now we need to find out a process to call SuspendThread to all other processes in the system. In order to find the caller of SuspendThread, first of all, we have to understand KiDeliverApc function to call KiSuspendThread and APC(Asynchronous Procedure Call) mechanism in Windows.

Windows kernel has a concept of IRQL (Interrupt Request Level) and thread scheduler of Windows kernel do thread context switching at DISPATCH_LEVEL (It is IRQL 2). After scheduler chooses thread to run, IRQL becomes lower to APC_LEVEL and the APCs queued in the thread are executed and then IRQL becomes lower to PASSIVE_LEVEL that the thread codes can run.

Therefore, KiDeliverApc means that it delivers APCs queued in the thread and KiSuspendThread called by KiDeliverApc means that KiSuspendThread was inserted into APC queue of the thread by APC. The prototype of KiDeliverApc is as following.


VOID
NTAPI
KiDeliverApc(IN KPROCESSOR_MODE DeliveryMode,
             IN PKEXCEPTION_FRAME ExceptionFrame,
             IN PKTRAP_FRAME TrapFrame)

It doesn’t take any APC structures through it’s parameters. I wondered how KiDeliverApc deliver APC of the thread, so I analyzed inside of KiDeliverApc function. Following pseudo codes represent the functionality of it.

KiDeliverApc()
{
    PKTHREAD Thread = KeGetCurrentThread();

    //
    // If APCs exist in ApcState.ApcListHead of current thread,
    // KiDeliverApc gets all APCs from the list and call KernelRoutine, NormalRoutine.
    //    
    while (!IsListEmpty(&Thread->ApcState.ApcListHead[KernelMode]))
    {
        ApcListEntry = Thread->ApcState.ApcListHead[KernelMode].Flink;
        Apc = CONTAINING_RECORD(ApcListEntry, KAPC, ApcListEntry);

        KernelRoutine = Apc->KernelRoutine;
        NormalRoutine = Apc->NormalRoutine;
        KernelRoutine(Apc, ...);
        NormalRoutine(...);
    }
}

The point is that someone inserts APC containing KiSuspendThread as NormalRoutine to ApcState.ApcListHead of current thread.

To find out who is someone, I tried to set memory break point at the address of Thread->ApcState.ApcListHead in order to get debugger break when any APC is inserted to this thread where ApcState field offset is 0x34 in KTHREAD structure.


kd> dt _KTHREAD ApcState.
nt!_KTHREAD
   +0x034 ApcState  : 
      +0x000 ApcListHead : [2] _LIST_ENTRY
      +0x010 Process   : Ptr32 _KPROCESS
      +0x014 KernelApcInProgress : UChar
      +0x015 KernelApcPending : UChar
      +0x016 UserApcPending : UChar
   +0x138 ApcStatePointer : [2] 
   +0x165 ApcStateIndex : UChar

kd> ba w4 (Address of Thread+0x34)


Following call stack is when debugger has got the break point. This is an operation during thread creation time.

kd> k
ChildEBP RetAddr  
f9b9bb60 804fdaa3 nt!KiInsertQueueApc+0x79
f9b9bb80 805c7436 nt!KeSuspendThread+0x67
f9b9bcc4 805c7f02 nt!PspCreateThread+0x570
f9b9bd3c 8053ea48 nt!NtCreateThread+0xfc
f9b9bd3c 7c93e514 nt!KiFastCallEntry+0xf8

kd> ub nt!KiInsertQueueApc+0x79
nt!KiInsertQueueApc+0x64:
804ff374 eb3f            jmp     nt!KiInsertQueueApc+0xa5 (804ff3b5)
804ff376 0fbeda          movsx   ebx,dl
804ff379 8d3cdf          lea     edi,[edi+ebx*8]
804ff37c 8b5f04          mov     ebx,dword ptr [edi+4]
804ff37f 8d700c          lea     esi,[eax+0Ch]       ; ApcListEntry
804ff382 893e            mov     dword ptr [esi],edi
804ff384 895e04          mov     dword ptr [esi+4],ebx  
804ff387 8933            mov     dword ptr [ebx],esi ; Insertion to list


A list is inserted at instruction pointer 804ff387, it means that ebx is the address of Thread->ApcState.ApcListHead and esi is an address of ListEntry. esi was from instruction pointer 804ff37f, it is eax+0Ch. We can be aware of esi is ApcListEntry and eax is an address of KAPC by following structure.

kd> dt _KAPC @eax
nt!_KAPC
   +0x000 Type             : 18
   +0x002 Size             : 48
   +0x004 Spare0           : 0
   +0x008 Thread           : 0x812f1020 _KTHREAD
   +0x00c ApcListEntry     : _LIST_ENTRY [ 0x812f1054 - 0x812f1054 ]
   +0x014 KernelRoutine    : 0x80501ed4     void  nt!KiSuspendNop+0
   +0x018 RundownRoutine   : 0x805277ba     void  nt!PopAttribNop+0
   +0x01c NormalRoutine    : 0x8050230a     void  nt!KiSuspendThread+0
   +0x020 NormalContext    : (null) 
   +0x024 SystemArgument1  : (null) 
   +0x028 SystemArgument2  : (null) 
   +0x02c ApcStateIndex    : 0 ''
   +0x02d ApcMode          : 0 ''
   +0x02e Inserted         : 0 ''


We can see nt!KiSuspendThread+0 in NormalRoutine at offset 0x01c. We've found an instruction pointer where KAPC is inserted and field offset of KAPC where KiSuspendThread pointer saved, now I am going to set a break point at 804ff389 (nt!KiInsertQueueApc+0x79) with a condition only if the NormalRoutine is KiSuspendThread in order to get a break point when someone inserts an APC having KiSuspendThread.

kd> u nt!KiSuspendThread L1
nt!KiSuspendThread:
8050230a 64a124010000    mov     eax,dword ptr fs:[00000124h]

kd> bp nt!KiInsertQueueApc+79 ".if (poi(@eax+1c)==8050230a) {} .else {gc}”

After several break hits from thread creation, eventually, I've got a call stack as following.

kd> k
ChildEBP RetAddr  
f766ecc8 804fdaa3 nt!KiInsertQueueApc+0x79
f766ece8 805cb671 nt!KeSuspendThread+0x67
f766ed28 805cb706 nt!PsSuspendThread+0x6f
f766ed44 805cb8fe nt!PsSuspendProcess+0x28
f766ed58 8053ea48 nt!NtSuspendProcess+0x40
f766ed58 7c93e514 nt!KiFastCallEntry+0xf8
WARNING: Stack unwind information not available. Following frames may be wrong.
00000000 00000000 ntdll!KiFastSystemCallRet

This call stack means that this process called Win32 API SuspendProcess in user mode and it has a responsibility to call ResumeProcess but it didn’t, so the process is the root cause of system hang.