Research

SoK: A Tale of Reduction, Security, and Correctness

September 25, 2023/4 min read

In September 2023, our paper was published at ESORICS (European Symposium on Research in Computer Security), one of the top-tier venues for security research. The work was a collaboration between researchers at SRI International, Georgia Institute of Technology, Stony Brook University, the University of Arizona, and myself.

The Problem: Software Bloat

Modern software ships with far more code than any single deployment actually uses. A web server might include support for dozens of protocols, a media library bundles codecs for formats nobody requests, and standard libraries carry functions that never get called.

This isn't just a storage problem — it's a security problem. Every line of unused code is attack surface. CVEs regularly target features that the vast majority of deployments never invoke. The Heartbleed vulnerability in OpenSSL, for instance, exploited a feature (TLS heartbeats) that most servers didn't need.

Software debloating aims to fix this: automatically remove code that isn't needed for a specific deployment configuration, reducing the attack surface without changing functionality.

What We Evaluated

Our systematization of knowledge (SoK) evaluated the landscape of C/C++ debloating tools across three dimensions:

Reduction — How much code does the tool actually remove?
Security — Does debloating meaningfully reduce the attack surface (CVEs, gadgets)?
Correctness — Does the debloated program still behave correctly?

We tested 10 debloating tools against a common benchmark suite, applying each tool to real-world programs like Nginx, SQLite, and cURL. The evaluation was deliberately adversarial — we didn't just check if the debloated program passed its own test suite, we probed for subtle behavioral differences.

Tools evaluated:
├── Static analysis-based
│   ├── CHISEL
│   ├── RAZOR
│   └── OCCAM
├── Dynamic trace-based
│   ├── PIECE-WISE
│   ├── BinRec
│   └── TRIMMER
└── Hybrid approaches
    ├── LMCAS
    ├── DEBLOAT
    ├── BLADE
    └── C-Reduce (baseline)

Key Findings

The Reduction-Correctness Trade-off

The tools that achieved the most aggressive reduction were also the most likely to break program semantics. This isn't surprising in hindsight, but the degree of the trade-off was striking.

Dynamic trace-based tools, which observe program execution on sample inputs and remove unexercised code, achieved impressive reduction ratios. But they're fundamentally limited by the completeness of their training inputs. Miss an edge case in your traces, and the debloated program silently drops functionality.

Security Gains Are Real but Nuanced

Debloating does remove real CVEs. We found that aggressive debloating of Nginx eliminated code associated with several known vulnerabilities. However, the gadget reduction (ROP/JOP gadgets available to an attacker) was less consistent than the raw code reduction might suggest. Critical gadgets in core code paths tend to survive debloating since, by definition, those paths are used.

Correctness Is the Hard Part

The most concerning finding: several tools produced debloated binaries that passed standard test suites but exhibited subtle behavioral differences under adversarial testing. These weren't crashes — they were silent semantic changes.

For instance, one debloated version of a parser would accept malformed input that the original correctly rejected. Another silently dropped error logging, making the program appear to work while losing observability.

Our Contribution

Beyond the empirical evaluation, we proposed a framework for evaluating debloating tools that considers:

Functional correctness — not just test-suite pass rates, but semantic equivalence under diverse inputs
Security metrics — CVE elimination, gadget surface reduction, and preservation of security checks
Practicality — build system integration, scalability to large codebases, and maintenance burden

We also identified open research questions: How do you verify that debloating preserves security-critical error handling? Can formal methods provide guarantees that testing can't? How should debloating interact with ongoing software updates?

Reflections

Working on this paper was my entry point into academic security research. The rigor required — every claim backed by data, every methodology decision justified — shaped how I think about engineering problems too.

The core lesson translates directly to software design: removing code is a security improvement, but only if you can prove you removed the right code. This applies whether you're debloating a binary or refactoring a codebase. Every deletion needs confidence in its correctness.

The paper is available in the ESORICS 2023 proceedings for those interested in the full methodology and results.