PE Parser Module
The PE Parser is responsible for loading executable files into memory, validating their PE structure, extracting sections and imports, and computing entropy for the highest-entropy section. It is the data foundation for the classification pipeline.
Related: Scanner Module ยท ML Classifier Module ยท Scan Pipeline Flow
1. Source Files
| File | Purpose |
|---|---|
scanner/pe_parser.cpp |
File loading, PE validation, section/import parsing, entropy calculation, SEH wrapper |
scanner/pe_parser.h |
PEParser class declaration, ParsedFile struct, public function signatures |
2. Module Architecture
classDiagram
class PEParser {
-LPVOID m_fileData
-DWORD m_fileSize
-bool m_is64Bit
-vector~PIMAGE_SECTION_HEADER~ m_sections
-DWORD m_importCount
+PEParser(LPVOID fileData, DWORD fileSize)
+Initialize() bool
+Is64Bit() bool
+GetSectionCount() DWORD
+GetImportCount() DWORD
+GetSections() vector
-ParseSections() void
-ParseImports() void
}
class ParsedFile {
+bool is64Bit
+DWORD sectionCount
+DWORD importCount
+DWORD textSize
+float textEntropy
+vector~BYTE~ textOpcodes
}
class StaticHelpers {
+NormalizePath(input, output)$ bool
+SafeParsePE(path, out)$ bool
+SafeParsePE_SEH(path, out)$ bool
-LoadFile(path, data, size)$ bool
-CalculateEntropy(data, size)$ float
-RvaToPtr(rva, base, size, sections)$ BYTE*
-ExtractMaxEntropySection(...)$ bool
-ParsePE_CPP(path, out)$ bool
}
StaticHelpers --> PEParser : creates & uses
StaticHelpers --> ParsedFile : populates
3. Parsing Pipeline
flowchart TD
Entry["SafeParsePE_SEH(path, out)"]
Entry --> SEH["__try"]
SEH --> Normalize["NormalizePath(path)\nโ Win32 path"]
Normalize --> Load["LoadFile()\nCreateFileW(SHARE_READ|WRITE|DELETE)\n+ HeapAlloc + ReadFile"]
Load --> Init["PEParser::Initialize()"]
subgraph Validation["PE Validation"]
DOS["Validate DOS Header\ne_magic == 0x5A4D ('MZ')"]
NT["Validate NT Header\ne_lfanew bounds check\nSignature == 0x00004550 ('PE\\0\\0')"]
Bits["Determine Bitness\nOptionalHeader.Magic\n0x20B โ 64-bit\n0x10B โ 32-bit"]
DOS --> NT --> Bits
end
Init --> DOS
Bits --> Sections["ParseSections()\nEnumerate IMAGE_SECTION_HEADER[]"]
Sections --> Imports["ParseImports()\nWalk IMAGE_IMPORT_DESCRIPTOR[]\nvia RvaToPtr()"]
Imports --> MaxEnt["ExtractMaxEntropySection()\nCalculate entropy for ALL sections\nSelect highest"]
MaxEnt --> Result["Populate ParsedFile:\n- is64Bit\n- sectionCount\n- importCount\n- textSize\n- textEntropy\n- textOpcodes[0..4095]"]
Result --> Free["HeapFree(fileData)"]
SEH --> Except["__except\nEXCEPTION_EXECUTE_HANDLER"]
Except --> RetFalse["Return FALSE"]
style Validation fill:#4361ee,color:#fff
style MaxEnt fill:#e07a5f,color:#fff
style Result fill:#2d6a4f,color:#fff
style RetFalse fill:#e63946,color:#fff
4. Path Normalization
The kernel driver sends paths in NT format. The scanner converts them to Win32 paths.
| Input Format | Example | Conversion |
|---|---|---|
\\?\C:\... |
Already Win32 | Use as-is |
\\.\device\... |
Device path | Use as-is |
C:\... |
Standard Win32 | Use as-is |
\??\C:\... |
NT symlink | Strip \??\ prefix |
\Device\HarddiskVolumeX\... |
NT device path | Map via QueryDosDeviceW() |
Volume Resolution Algorithm
flowchart TD
Input["\\Device\\HarddiskVolume3\\Windows\\test.exe"]
Input --> GetDrives["GetLogicalDriveStringsW()\nโ C:\\ D:\\ E:\\"]
GetDrives --> ForEach["For each drive letter"]
ForEach --> Query["QueryDosDeviceW('C:')\nโ \\Device\\HarddiskVolume3"]
Query --> Match{"Path starts with\ndevice name?"}
Match -->|Yes| Map["Replace device prefix\nwith drive letter\nโ C:\\Windows\\test.exe"]
Match -->|No| Next["Try next drive"]
Next --> ForEach
style Map fill:#2d6a4f,color:#fff
5. PE Validation Details
DOS Header Check
Offset 0x00: e_magic must equal 0x5A4D ("MZ")
Offset 0x3C: e_lfanew must be > 0 and within file bounds
NT Header Check
Offset e_lfanew + 0x00: Signature must equal 0x00004550 ("PE\0\0")
Offset e_lfanew + 0x18: OptionalHeader.Magic determines bitness
0x10B โ IMAGE_NT_OPTIONAL_HDR32_MAGIC (32-bit)
0x20B โ IMAGE_NT_OPTIONAL_HDR64_MAGIC (64-bit)
PE Layout Diagram
flowchart TB
subgraph PE["PE File Layout"]
direction TB
DOS_H["DOS Header (64 bytes)\ne_magic = 'MZ'\ne_lfanew โ offset to NT header"]
Stub["DOS Stub\n(legacy code)"]
NT_H["NT Header\nSignature = 'PE\\0\\0'\nFileHeader\nOptionalHeader"]
Sec["Section Headers\n.text, .rdata, .data, .rsrc, ..."]
SecData["Section Data\n(code, imports, resources)"]
DOS_H --> Stub --> NT_H --> Sec --> SecData
end
style PE fill:#1a1a2e,color:#fff
6. Import Counting
The import count is determined by walking the IMAGE_IMPORT_DESCRIPTOR array:
- Get the Import Directory RVA from
OptionalHeader.DataDirectory[1] - Convert RVA to file pointer using
RvaToPtr()(maps RVA through section headers) - Count non-null
IMAGE_IMPORT_DESCRIPTORentries (stop at.Name == 0)
Result: Total number of DLLs imported by the PE (not individual functions).
7. Entropy Calculation
Shannon entropy is calculated per section using byte-frequency analysis:
\[H = -\sum_{i=0}^{255} p_i \log_2 p_i\]Where $p_i = \frac{count(byte_i)}{total_bytes}$.
| Entropy Range | Typical Content |
|---|---|
| 0.0 โ 1.0 | Null/zero-filled sections |
| 3.0 โ 5.0 | Structured data, strings |
| 5.0 โ 6.5 | Compiled code |
| 6.5 โ 7.0 | Compressed data |
| 7.0 โ 8.0 | Encrypted/packed data, random bytes |
The parser selects the section with the maximum entropy across all sections, not just code sections. This is critical for detecting packed malware where the payload may reside in custom-named sections.
8. File Loading Strategy
CreateFileW(path, GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE, ...)
The FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE flags allow the scanner to read:
- DLLs currently loaded by other processes
- Executables currently running
- Files being written by other processes
- System files with restrictive permissions
The entire file is read into a heap buffer (HeapAlloc + ReadFile). This is simpler than memory mapping and avoids file-mapping complications with in-use executables.
9. SEH Protection
The outermost entry point (SafeParsePE_SEH) wraps all parsing in a Structured Exception Handler:
__try {
return SafeParsePE(path, out);
}
__except (EXCEPTION_EXECUTE_HANDLER) {
return FALSE; // Malformed PE caused access violation
}
This catches access violations from:
- Corrupted PE headers pointing outside the buffer
- Maliciously crafted section headers with invalid offsets
- Integer overflows in size calculations
Next Steps
- See how parsed data becomes features: ML Classifier Module
- Full pipeline context: Scan Pipeline Flow
- How to extend parsing: Adding Detection Rules