ML Classifier Module

The ML Classifier takes a feature vector extracted from a PE file and produces a verdict: CLEAN, SUSPICIOUS, MALICIOUS, or ERROR. The current implementation uses an entropy-and-import-count heuristic as a baseline classifier.

1. Source Files

File	Purpose
`scanner/ml.cpp`	`Classify()` implementation
`scanner/ml.h`	`Classify()` declaration
`scanner/features.cpp`	`ExtractFeatures()` implementation
`scanner/features.h`	`FeatureVector` struct, `ExtractFeatures()` declaration

2. Pipeline Position

flowchart LR
    PE["ParsedFile\n(from PE Parser)"]
    FE["ExtractFeatures()"]
    ML["Classify()"]
    POL["ApplyPolicy()"]
    
    PE -->|"ParsedFile"| FE
    FE -->|"FeatureVector"| ML
    ML -->|"SCAN_RESULT"| POL
    
    style FE fill:#4361ee,color:#fff
    style ML fill:#e07a5f,color:#fff
    style POL fill:#2d6a4f,color:#fff

3. Feature Extraction

FeatureVector Structure

Field	Type	Source	Description
`entropy`	`float`	`ParsedFile.textEntropy`	Shannon entropy of the highest-entropy section (0.0–8.0)
`importCount`	`int`	`ParsedFile.importCount`	Number of imported DLLs

Extraction Logic

void ExtractFeatures(const ParsedFile* parsed, FeatureVector* fv) {
    fv->entropy = parsed->textEntropy;
    fv->importCount = parsed->importCount;
}

This is a direct mapping — no normalization or scaling is applied.

4. Classification Algorithm

flowchart TD
    Input["FeatureVector {\n  entropy: float,\n  importCount: int\n}"]
    
    Input --> E{"entropy > 6.99?"}
    
    E -->|"≤ 6.99"| Clean["SCAN_CLEAN ✅"]
    
    E -->|"> 6.99"| I{"importCount > 10?"}
    
    I -->|"≤ 10"| Clean
    
    I -->|"> 10"| Malicious["SCAN_MALICIOUS 🚨"]

    style Clean fill:#2d6a4f,color:#fff
    style Malicious fill:#e63946,color:#fff

Decision Boundary

quadrantChart
    title Classification Decision Space
    x-axis "Low Import Count" --> "High Import Count"
    y-axis "Low Entropy" --> "High Entropy"
    quadrant-1 "MALICIOUS"
    quadrant-2 "CLEAN"
    quadrant-3 "CLEAN"
    quadrant-4 "CLEAN"

The classifier uses a simple axis-aligned decision boundary with two thresholds:

Threshold	Value	Rationale
Entropy	> 6.99	Sections with entropy > 7.0 are statistically likely to contain compressed or encrypted data, a hallmark of packed malware
Import Count	> 10	Packed malware tends to unpack at runtime, needing imports for memory management, process injection, and I/O. Legitimate high-entropy files (e.g., compressed resources) typically have fewer functional imports

Classification Truth Table

Entropy	Import Count	Verdict
≤ 6.99	Any	CLEAN
> 6.99	≤ 10	CLEAN
> 6.99	> 10	MALICIOUS

5. SCAN_RESULT Enum

classDiagram
    class SCAN_RESULT {
        <<enumeration>>
        SCAN_CLEAN = 0
        SCAN_SUSPICIOUS = 1
        SCAN_MALICIOUS = 2
        SCAN_ERROR = 3
    }

Value	Usage	Currently Returned
`SCAN_CLEAN`	File is benign	✅ Yes
`SCAN_SUSPICIOUS`	File is suspicious but not confirmed malicious	❌ Reserved for future use
`SCAN_MALICIOUS`	File is classified as malware	✅ Yes
`SCAN_ERROR`	Parsing or analysis error	❌ Reserved for future use

6. Design Rationale

Why Entropy?

Malware packers (UPX, Themida, custom packers) compress or encrypt the payload section, producing near-random byte distributions. Shannon entropy measures this randomness:

Normal compiled code: Entropy typically 5.0–6.5 due to repeating instruction patterns, padding, and structured data
Packed/encrypted code: Entropy typically 7.0–8.0 due to uniform byte distribution

Why Import Count?

Packed malware still needs runtime imports (VirtualAlloc, WriteProcessMemory, CreateThread, etc.) to unpack and execute its payload
Legitimate packed files (e.g., UPX-packed utilities) tend to have minimal imports since they’re often simple tools
This second threshold reduces false positives from legitimately compressed resources (high entropy, low imports)

Known Limitations

Limitation	Impact	Mitigation Path
High false-positive rate on legitimately packed software	UPX-packed tools flagged as malicious	Add UPX signature detection
No API-level import analysis	Cannot distinguish malicious imports from benign	Parse import names, score suspicious APIs
Binary threshold (no confidence score)	No nuance — either CLEAN or MALICIOUS	Replace with weighted scoring model
Single-section entropy	Ignores per-section patterns	Analyze entropy distribution across all sections
No opcode analysis	Stored `textOpcodes` are unused	Implement opcode frequency analysis

7. Extending the Classifier

The FeatureVector and Classify functions are designed to be extended. See Adding Detection Rules for a step-by-step guide on:

Adding new features to FeatureVector
Extracting them in ExtractFeatures()
Incorporating them into the classification logic in Classify()

Future Feature Candidates

Feature	Source	Detection Value
Suspicious API imports	Import name table	Detects process injection, keylogging
Section name anomalies	Section headers	Packed malware uses non-standard names
Entry point location	OptionalHeader	Entry point outside `.text` indicates packing
Resource entropy	Resource section	Encrypted payloads hidden in resources
Opcode frequency	First 4KB of text	Obfuscated code has unusual opcode distribution

Next Steps

See what happens after classification: Policy Engine Module
Understand the PE data source: PE Parser Module
Step-by-step guide to adding rules: Adding Detection Rules