ML Classifier Module
The ML Classifier takes a feature vector extracted from a PE file and produces a verdict: CLEAN, SUSPICIOUS, MALICIOUS, or ERROR. The current implementation uses an entropy-and-import-count heuristic as a baseline classifier.
Related: PE Parser Module Β· Policy Engine Module Β· Scan Pipeline Flow Β· Design Decisions
1. Source Files
| File |
Purpose |
scanner/ml.cpp |
Classify() implementation |
scanner/ml.h |
Classify() declaration |
scanner/features.cpp |
ExtractFeatures() implementation |
scanner/features.h |
FeatureVector struct, ExtractFeatures() declaration |
2. Pipeline Position
flowchart LR
PE["ParsedFile\n(from PE Parser)"]
FE["ExtractFeatures()"]
ML["Classify()"]
POL["ApplyPolicy()"]
PE -->|"ParsedFile"| FE
FE -->|"FeatureVector"| ML
ML -->|"SCAN_RESULT"| POL
style FE fill:#4361ee,color:#fff
style ML fill:#e07a5f,color:#fff
style POL fill:#2d6a4f,color:#fff
FeatureVector Structure
| Field |
Type |
Source |
Description |
entropy |
float |
ParsedFile.textEntropy |
Shannon entropy of the highest-entropy section (0.0β8.0) |
importCount |
int |
ParsedFile.importCount |
Number of imported DLLs |
void ExtractFeatures(const ParsedFile* parsed, FeatureVector* fv) {
fv->entropy = parsed->textEntropy;
fv->importCount = parsed->importCount;
}
This is a direct mapping β no normalization or scaling is applied.
4. Classification Algorithm
flowchart TD
Input["FeatureVector {\n entropy: float,\n importCount: int\n}"]
Input --> E{"entropy > 6.99?"}
E -->|"β€ 6.99"| Clean["SCAN_CLEAN β
"]
E -->|"> 6.99"| I{"importCount > 10?"}
I -->|"β€ 10"| Clean
I -->|"> 10"| Malicious["SCAN_MALICIOUS π¨"]
style Clean fill:#2d6a4f,color:#fff
style Malicious fill:#e63946,color:#fff
Decision Boundary
quadrantChart
title Classification Decision Space
x-axis "Low Import Count" --> "High Import Count"
y-axis "Low Entropy" --> "High Entropy"
quadrant-1 "MALICIOUS"
quadrant-2 "CLEAN"
quadrant-3 "CLEAN"
quadrant-4 "CLEAN"
The classifier uses a simple axis-aligned decision boundary with two thresholds:
| Threshold |
Value |
Rationale |
| Entropy |
> 6.99 |
Sections with entropy > 7.0 are statistically likely to contain compressed or encrypted data, a hallmark of packed malware |
| Import Count |
> 10 |
Packed malware tends to unpack at runtime, needing imports for memory management, process injection, and I/O. Legitimate high-entropy files (e.g., compressed resources) typically have fewer functional imports |
Classification Truth Table
| Entropy |
Import Count |
Verdict |
| β€ 6.99 |
Any |
CLEAN |
| > 6.99 |
β€ 10 |
CLEAN |
| > 6.99 |
> 10 |
MALICIOUS |
5. SCAN_RESULT Enum
classDiagram
class SCAN_RESULT {
<<enumeration>>
SCAN_CLEAN = 0
SCAN_SUSPICIOUS = 1
SCAN_MALICIOUS = 2
SCAN_ERROR = 3
}
| Value |
Usage |
Currently Returned |
SCAN_CLEAN |
File is benign |
β
Yes |
SCAN_SUSPICIOUS |
File is suspicious but not confirmed malicious |
β Reserved for future use |
SCAN_MALICIOUS |
File is classified as malware |
β
Yes |
SCAN_ERROR |
Parsing or analysis error |
β Reserved for future use |
6. Design Rationale
Why Entropy?
Malware packers (UPX, Themida, custom packers) compress or encrypt the payload section, producing near-random byte distributions. Shannon entropy measures this randomness:
- Normal compiled code: Entropy typically 5.0β6.5 due to repeating instruction patterns, padding, and structured data
- Packed/encrypted code: Entropy typically 7.0β8.0 due to uniform byte distribution
Why Import Count?
- Packed malware still needs runtime imports (
VirtualAlloc, WriteProcessMemory, CreateThread, etc.) to unpack and execute its payload
- Legitimate packed files (e.g., UPX-packed utilities) tend to have minimal imports since theyβre often simple tools
- This second threshold reduces false positives from legitimately compressed resources (high entropy, low imports)
Known Limitations
| Limitation |
Impact |
Mitigation Path |
| High false-positive rate on legitimately packed software |
UPX-packed tools flagged as malicious |
Add UPX signature detection |
| No API-level import analysis |
Cannot distinguish malicious imports from benign |
Parse import names, score suspicious APIs |
| Binary threshold (no confidence score) |
No nuance β either CLEAN or MALICIOUS |
Replace with weighted scoring model |
| Single-section entropy |
Ignores per-section patterns |
Analyze entropy distribution across all sections |
| No opcode analysis |
Stored textOpcodes are unused |
Implement opcode frequency analysis |
7. Extending the Classifier
The FeatureVector and Classify functions are designed to be extended. See Adding Detection Rules for a step-by-step guide on:
- Adding new features to
FeatureVector
- Extracting them in
ExtractFeatures()
- Incorporating them into the classification logic in
Classify()
Future Feature Candidates
| Feature |
Source |
Detection Value |
| Suspicious API imports |
Import name table |
Detects process injection, keylogging |
| Section name anomalies |
Section headers |
Packed malware uses non-standard names |
| Entry point location |
OptionalHeader |
Entry point outside .text indicates packing |
| Resource entropy |
Resource section |
Encrypted payloads hidden in resources |
| Opcode frequency |
First 4KB of text |
Obfuscated code has unusual opcode distribution |
Next Steps