ML Classifier Module

The ML Classifier takes a feature vector extracted from a PE file and produces a verdict: CLEAN, SUSPICIOUS, MALICIOUS, or ERROR. The current implementation uses an entropy-and-import-count heuristic as a baseline classifier.

Related: PE Parser Module Β· Policy Engine Module Β· Scan Pipeline Flow Β· Design Decisions


1. Source Files

File Purpose
scanner/ml.cpp Classify() implementation
scanner/ml.h Classify() declaration
scanner/features.cpp ExtractFeatures() implementation
scanner/features.h FeatureVector struct, ExtractFeatures() declaration

2. Pipeline Position

flowchart LR
    PE["ParsedFile\n(from PE Parser)"]
    FE["ExtractFeatures()"]
    ML["Classify()"]
    POL["ApplyPolicy()"]
    
    PE -->|"ParsedFile"| FE
    FE -->|"FeatureVector"| ML
    ML -->|"SCAN_RESULT"| POL
    
    style FE fill:#4361ee,color:#fff
    style ML fill:#e07a5f,color:#fff
    style POL fill:#2d6a4f,color:#fff

3. Feature Extraction

FeatureVector Structure

Field Type Source Description
entropy float ParsedFile.textEntropy Shannon entropy of the highest-entropy section (0.0–8.0)
importCount int ParsedFile.importCount Number of imported DLLs

Extraction Logic

void ExtractFeatures(const ParsedFile* parsed, FeatureVector* fv) {
    fv->entropy = parsed->textEntropy;
    fv->importCount = parsed->importCount;
}

This is a direct mapping β€” no normalization or scaling is applied.


4. Classification Algorithm

flowchart TD
    Input["FeatureVector {\n  entropy: float,\n  importCount: int\n}"]
    
    Input --> E{"entropy > 6.99?"}
    
    E -->|"≀ 6.99"| Clean["SCAN_CLEAN βœ…"]
    
    E -->|"> 6.99"| I{"importCount > 10?"}
    
    I -->|"≀ 10"| Clean
    
    I -->|"> 10"| Malicious["SCAN_MALICIOUS 🚨"]

    style Clean fill:#2d6a4f,color:#fff
    style Malicious fill:#e63946,color:#fff

Decision Boundary

quadrantChart
    title Classification Decision Space
    x-axis "Low Import Count" --> "High Import Count"
    y-axis "Low Entropy" --> "High Entropy"
    quadrant-1 "MALICIOUS"
    quadrant-2 "CLEAN"
    quadrant-3 "CLEAN"
    quadrant-4 "CLEAN"

The classifier uses a simple axis-aligned decision boundary with two thresholds:

Threshold Value Rationale
Entropy > 6.99 Sections with entropy > 7.0 are statistically likely to contain compressed or encrypted data, a hallmark of packed malware
Import Count > 10 Packed malware tends to unpack at runtime, needing imports for memory management, process injection, and I/O. Legitimate high-entropy files (e.g., compressed resources) typically have fewer functional imports

Classification Truth Table

Entropy Import Count Verdict
≀ 6.99 Any CLEAN
> 6.99 ≀ 10 CLEAN
> 6.99 > 10 MALICIOUS

5. SCAN_RESULT Enum

classDiagram
    class SCAN_RESULT {
        <<enumeration>>
        SCAN_CLEAN = 0
        SCAN_SUSPICIOUS = 1
        SCAN_MALICIOUS = 2
        SCAN_ERROR = 3
    }
Value Usage Currently Returned
SCAN_CLEAN File is benign βœ… Yes
SCAN_SUSPICIOUS File is suspicious but not confirmed malicious ❌ Reserved for future use
SCAN_MALICIOUS File is classified as malware βœ… Yes
SCAN_ERROR Parsing or analysis error ❌ Reserved for future use

6. Design Rationale

Why Entropy?

Malware packers (UPX, Themida, custom packers) compress or encrypt the payload section, producing near-random byte distributions. Shannon entropy measures this randomness:

  • Normal compiled code: Entropy typically 5.0–6.5 due to repeating instruction patterns, padding, and structured data
  • Packed/encrypted code: Entropy typically 7.0–8.0 due to uniform byte distribution

Why Import Count?

  • Packed malware still needs runtime imports (VirtualAlloc, WriteProcessMemory, CreateThread, etc.) to unpack and execute its payload
  • Legitimate packed files (e.g., UPX-packed utilities) tend to have minimal imports since they’re often simple tools
  • This second threshold reduces false positives from legitimately compressed resources (high entropy, low imports)

Known Limitations

Limitation Impact Mitigation Path
High false-positive rate on legitimately packed software UPX-packed tools flagged as malicious Add UPX signature detection
No API-level import analysis Cannot distinguish malicious imports from benign Parse import names, score suspicious APIs
Binary threshold (no confidence score) No nuance β€” either CLEAN or MALICIOUS Replace with weighted scoring model
Single-section entropy Ignores per-section patterns Analyze entropy distribution across all sections
No opcode analysis Stored textOpcodes are unused Implement opcode frequency analysis

7. Extending the Classifier

The FeatureVector and Classify functions are designed to be extended. See Adding Detection Rules for a step-by-step guide on:

  1. Adding new features to FeatureVector
  2. Extracting them in ExtractFeatures()
  3. Incorporating them into the classification logic in Classify()

Future Feature Candidates

Feature Source Detection Value
Suspicious API imports Import name table Detects process injection, keylogging
Section name anomalies Section headers Packed malware uses non-standard names
Entry point location OptionalHeader Entry point outside .text indicates packing
Resource entropy Resource section Encrypted payloads hidden in resources
Opcode frequency First 4KB of text Obfuscated code has unusual opcode distribution

Next Steps