PE Parser Module

The PE Parser is responsible for loading executable files into memory, validating their PE structure, extracting sections and imports, and computing entropy for the highest-entropy section. It is the data foundation for the classification pipeline.

Related: Scanner Module ยท ML Classifier Module ยท Scan Pipeline Flow


1. Source Files

File Purpose
scanner/pe_parser.cpp File loading, PE validation, section/import parsing, entropy calculation, SEH wrapper
scanner/pe_parser.h PEParser class declaration, ParsedFile struct, public function signatures

2. Module Architecture

classDiagram
    class PEParser {
        -LPVOID m_fileData
        -DWORD m_fileSize
        -bool m_is64Bit
        -vector~PIMAGE_SECTION_HEADER~ m_sections
        -DWORD m_importCount
        +PEParser(LPVOID fileData, DWORD fileSize)
        +Initialize() bool
        +Is64Bit() bool
        +GetSectionCount() DWORD
        +GetImportCount() DWORD
        +GetSections() vector
        -ParseSections() void
        -ParseImports() void
    }

    class ParsedFile {
        +bool is64Bit
        +DWORD sectionCount
        +DWORD importCount
        +DWORD textSize
        +float textEntropy
        +vector~BYTE~ textOpcodes
    }

    class StaticHelpers {
        +NormalizePath(input, output)$ bool
        +SafeParsePE(path, out)$ bool
        +SafeParsePE_SEH(path, out)$ bool
        -LoadFile(path, data, size)$ bool
        -CalculateEntropy(data, size)$ float
        -RvaToPtr(rva, base, size, sections)$ BYTE*
        -ExtractMaxEntropySection(...)$ bool
        -ParsePE_CPP(path, out)$ bool
    }

    StaticHelpers --> PEParser : creates & uses
    StaticHelpers --> ParsedFile : populates

3. Parsing Pipeline

flowchart TD
    Entry["SafeParsePE_SEH(path, out)"]
    
    Entry --> SEH["__try"]
    SEH --> Normalize["NormalizePath(path)\nโ†’ Win32 path"]
    
    Normalize --> Load["LoadFile()\nCreateFileW(SHARE_READ|WRITE|DELETE)\n+ HeapAlloc + ReadFile"]
    
    Load --> Init["PEParser::Initialize()"]
    
    subgraph Validation["PE Validation"]
        DOS["Validate DOS Header\ne_magic == 0x5A4D ('MZ')"]
        NT["Validate NT Header\ne_lfanew bounds check\nSignature == 0x00004550 ('PE\\0\\0')"]
        Bits["Determine Bitness\nOptionalHeader.Magic\n0x20B โ†’ 64-bit\n0x10B โ†’ 32-bit"]
        DOS --> NT --> Bits
    end
    
    Init --> DOS
    
    Bits --> Sections["ParseSections()\nEnumerate IMAGE_SECTION_HEADER[]"]
    Sections --> Imports["ParseImports()\nWalk IMAGE_IMPORT_DESCRIPTOR[]\nvia RvaToPtr()"]
    
    Imports --> MaxEnt["ExtractMaxEntropySection()\nCalculate entropy for ALL sections\nSelect highest"]
    
    MaxEnt --> Result["Populate ParsedFile:\n- is64Bit\n- sectionCount\n- importCount\n- textSize\n- textEntropy\n- textOpcodes[0..4095]"]
    
    Result --> Free["HeapFree(fileData)"]
    
    SEH --> Except["__except\nEXCEPTION_EXECUTE_HANDLER"]
    Except --> RetFalse["Return FALSE"]

    style Validation fill:#4361ee,color:#fff
    style MaxEnt fill:#e07a5f,color:#fff
    style Result fill:#2d6a4f,color:#fff
    style RetFalse fill:#e63946,color:#fff

4. Path Normalization

The kernel driver sends paths in NT format. The scanner converts them to Win32 paths.

Input Format Example Conversion
\\?\C:\... Already Win32 Use as-is
\\.\device\... Device path Use as-is
C:\... Standard Win32 Use as-is
\??\C:\... NT symlink Strip \??\ prefix
\Device\HarddiskVolumeX\... NT device path Map via QueryDosDeviceW()

Volume Resolution Algorithm

flowchart TD
    Input["\\Device\\HarddiskVolume3\\Windows\\test.exe"]
    Input --> GetDrives["GetLogicalDriveStringsW()\nโ†’ C:\\ D:\\ E:\\"]
    
    GetDrives --> ForEach["For each drive letter"]
    ForEach --> Query["QueryDosDeviceW('C:')\nโ†’ \\Device\\HarddiskVolume3"]
    
    Query --> Match{"Path starts with\ndevice name?"}
    Match -->|Yes| Map["Replace device prefix\nwith drive letter\nโ†’ C:\\Windows\\test.exe"]
    Match -->|No| Next["Try next drive"]
    Next --> ForEach

    style Map fill:#2d6a4f,color:#fff

5. PE Validation Details

DOS Header Check

Offset 0x00: e_magic must equal 0x5A4D ("MZ")
Offset 0x3C: e_lfanew must be > 0 and within file bounds

NT Header Check

Offset e_lfanew + 0x00: Signature must equal 0x00004550 ("PE\0\0")
Offset e_lfanew + 0x18: OptionalHeader.Magic determines bitness
  0x10B โ†’ IMAGE_NT_OPTIONAL_HDR32_MAGIC (32-bit)
  0x20B โ†’ IMAGE_NT_OPTIONAL_HDR64_MAGIC (64-bit)

PE Layout Diagram

flowchart TB
    subgraph PE["PE File Layout"]
        direction TB
        DOS_H["DOS Header (64 bytes)\ne_magic = 'MZ'\ne_lfanew โ†’ offset to NT header"]
        Stub["DOS Stub\n(legacy code)"]
        NT_H["NT Header\nSignature = 'PE\\0\\0'\nFileHeader\nOptionalHeader"]
        Sec["Section Headers\n.text, .rdata, .data, .rsrc, ..."]
        SecData["Section Data\n(code, imports, resources)"]
        
        DOS_H --> Stub --> NT_H --> Sec --> SecData
    end

    style PE fill:#1a1a2e,color:#fff

6. Import Counting

The import count is determined by walking the IMAGE_IMPORT_DESCRIPTOR array:

  1. Get the Import Directory RVA from OptionalHeader.DataDirectory[1]
  2. Convert RVA to file pointer using RvaToPtr() (maps RVA through section headers)
  3. Count non-null IMAGE_IMPORT_DESCRIPTOR entries (stop at .Name == 0)

Result: Total number of DLLs imported by the PE (not individual functions).


7. Entropy Calculation

Shannon entropy is calculated per section using byte-frequency analysis:

\[H = -\sum_{i=0}^{255} p_i \log_2 p_i\]

Where $p_i = \frac{count(byte_i)}{total_bytes}$.

Entropy Range Typical Content
0.0 โ€“ 1.0 Null/zero-filled sections
3.0 โ€“ 5.0 Structured data, strings
5.0 โ€“ 6.5 Compiled code
6.5 โ€“ 7.0 Compressed data
7.0 โ€“ 8.0 Encrypted/packed data, random bytes

The parser selects the section with the maximum entropy across all sections, not just code sections. This is critical for detecting packed malware where the payload may reside in custom-named sections.


8. File Loading Strategy

CreateFileW(path, GENERIC_READ, 
    FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, ...)

The FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE flags allow the scanner to read:

  • DLLs currently loaded by other processes
  • Executables currently running
  • Files being written by other processes
  • System files with restrictive permissions

The entire file is read into a heap buffer (HeapAlloc + ReadFile). This is simpler than memory mapping and avoids file-mapping complications with in-use executables.


9. SEH Protection

The outermost entry point (SafeParsePE_SEH) wraps all parsing in a Structured Exception Handler:

__try {
    return SafeParsePE(path, out);
}
__except (EXCEPTION_EXECUTE_HANDLER) {
    return FALSE;  // Malformed PE caused access violation
}

This catches access violations from:

  • Corrupted PE headers pointing outside the buffer
  • Maliciously crafted section headers with invalid offsets
  • Integer overflows in size calculations

Next Steps