Skip to content
Mbox Viewer

Deduplication

The process of detecting and removing duplicate email messages from an archive, typically by comparing Message-ID values, to avoid redundancy when merging multiple MBOX files.

Duplicate messages arise naturally when managing email archives over time. For example, if you run two Google Takeout exports six months apart and combine them, messages from the overlapping period will appear in both MBOX files. Merging without deduplication doubles those messages in the combined archive, breaking thread counts and search result relevance.

The most reliable deduplication key is the Message-ID header, which is designed to be globally unique per message. Two messages with the same Message-ID are considered duplicates. A deduplication pass over a set of MBOX files can identify these collisions and either skip the duplicate during import or remove it from the merged output.

Edge cases in deduplication include messages with missing Message-IDs (common in very old or malformed mail) and messages with identical Message-IDs but different content (caused by buggy sending software). Robust tools handle these by combining Message-ID with a hash of key headers or the full message body as a secondary fingerprint. Mbox Viewer uses Message-ID comparison when merging archives to keep the result clean.

Related terms

Read your MBOX files on your Mac