mbox files - can they be "compacted"? - Discuss

List overview All Threads
Download

newer

mbox files - can they be "compacted"?

older

mailman and yahoo

Old HP Xeon server blade with only...

Rafał Radecki

6 Apr 2014 6 Apr '14

12:09 p.m.

Hi All ;)

Is there an option to compact large mbox files from the shell? I did not find anything in google, I have some very large constantly updated mbox files and would like to know if they can be made smaller with any tool. AFAIK mutt does such operation when for example an email is deleted but I am curious if there are other options.

BR, Rafal.

Show replies by date

Nicolas Thierry-Mieg

6 Apr 6 Apr

12:57 p.m.

On 04/06/2014 02:09 PM, Rafał Radecki wrote:

...

Hi All ;)

Is there an option to compact large mbox files from the shell? I did not find anything in google, I have some very large constantly updated mbox files and would like to know if they can be made smaller with any tool.

rm makes them a lot smaller. gzip not as much, but you can get the content back.

sorry for the noise... :-)

Mr Queue

6:27 p.m.

On Sun, 6 Apr 2014 14:09:45 +0200 Rafał Radecki radecki.rafal@gmail.com wrote:

...

Is there an option to compact large mbox files from the shell?

Couple of different filesystems support on the fly compression. zfs and btrfs come to mind.

-- Q: What's tan and black and looks great on a lawyer? A: A doberman.

Chris

14 Apr 14 Apr

4:31 a.m.

On 04/06/2014 02:09 PM, Rafał Radecki wrote:

...

I have some very large constantly updated mbox files

I don't know a tool to compact them, but I would consider converting them to Maildir. Although they won't need less space, handling them will be easier.

- Chris

Keith Keller

5:25 a.m.

On 2014-04-14, Chris ch2009@arcor.de wrote:

...

On 04/06/2014 02:09 PM, Rafa?? Radecki wrote:

...
I have some very large constantly updated mbox files

I don't know a tool to compact them, but I would consider converting them to Maildir. Although they won't need less space, handling them will be easier.

In the context of the OP, when mutt tries to deal with a message (e.g., deleting, moving to a folder), it can be boatloads faster, since handling the message works on a small file which contains just that message. Deleting a message from an mbox mailbox, for example, requires rewriting the entire changed mbox file to disk (minus the deleted message). Deleting a message from a Maildir mailbox is just removing one file from a directory.

--keith

-- kkeller@wombat.san-francisco.ca.us

Russell Miller

5:41 a.m.

On Apr 13, 2014, at 10:25 PM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...

In the context of the OP, when mutt tries to deal with a message (e.g., deleting, moving to a folder), it can be boatloads faster, since handling the message works on a small file which contains just that message. Deleting a message from an mbox mailbox, for example, requires rewriting the entire changed mbox file to disk (minus the deleted message). Deleting a message from a Maildir mailbox is just removing one file from a directory.

HOWEVER. When a directory grows too large, the OS can take a long time to seek through the directory, which can cause its own set of problems. And this makes cleaning out a maildir directory selectively a real pain. Maildir really could do with a hashing mechanism.

--Russell

John R Pierce

5:47 a.m.

On 4/13/2014 10:41 PM, Russell Miller wrote:

...

HOWEVER. When a directory grows too large, the OS can take a long time to seek through the directory, which can cause its own set of problems. And this makes cleaning out a maildir directory selectively a real pain. Maildir really could do with a hashing mechanism.

some file systems are better at this than others... like, xfs does quite well with 1000s of small files in a directory.

I wonder what thunderbird uses? I have 12000 messages in my 'centos' folder, 24720 in another folder, yet it seems quite snappy to find and delete individual messages.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Lamar Owen

15 Apr 15 Apr

6:25 p.m.

On 04/14/2014 01:47 AM, John R Pierce wrote:

...

I wonder what thunderbird uses?

One of the mbox variants; that's why it does 'compaction.'

Scott Robbins

14 Apr 14 Apr

2:38 p.m.

On Sun, Apr 13, 2014 at 10:41:14PM -0700, Russell Miller wrote:

...

On Apr 13, 2014, at 10:25 PM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...
In the context of the OP, when mutt tries to deal with a message (e.g., deleting, moving to a folder), it can be boatloads faster, since handling the message works on a small file which contains just that message. Deleting a message from an mbox mailbox, for example, requires rewriting the entire changed mbox file to disk (minus the deleted message). Deleting a message from a Maildir mailbox is just removing one file from a directory.

Time spent with mutt searching a directory can be drastically cut by using caching. See my old page, http://home.roadrunner.com/~computertaijutsu/mutt.html#IMAP

Even if not using IMAP, using a $HOME/.mutt_cache can greatly speed things up.

-- Scott Robbins PGP keyID EB3467D6 ( 1B48 077D 66F6 9DB0 FDC2 A409 FA54 EB34 67D6 ) gpg --keyserver pgp.mit.edu --recv-keys EB3467D6

Bill Campbell

5:29 p.m.

On Sun, Apr 13, 2014, Russell Miller wrote:

...

On Apr 13, 2014, at 10:25 PM, Keith Keller kkeller@wombat.san-francisco.ca.us wrote:

...
In the context of the OP, when mutt tries to deal with a message (e.g., deleting, moving to a folder), it can be boatloads faster, since handling the message works on a small file which contains just that message. Deleting a message from an mbox mailbox, for example, requires rewriting the entire changed mbox file to disk (minus the deleted message). Deleting a message from a Maildir mailbox is just removing one file from a directory.

...

HOWEVER. When a directory grows too large, the OS can take a long time to seek through the directory, which can cause its own set of problems. And this makes cleaning out a maildir directory selectively a real pain. Maildir really could do with a hashing mechanism.

We have been using Maildir with courier-imap for decades, and haven't had an issue with this. My security folder typically has 25,000+ messages for the last 7 days messages, and accessing either with IMAP or directly with mutt isn't a problem.

I have written various scripts over the years to convert from various mail storage formats ranging from SCO's horrible ctrl-a delimited through the U.W. IMAP, and ones that query other IMAP servers to convert their folder structures to local Maildir.

Maildir is generally very easy to handle with standard *nix command line tools. We have moved mail servers for some regional ISPs by rsync'ing with tens of thousands of email customers by rsync'ing from the old server to the new one to get the bulk of the mail across before cutting over to the new machine. Then we shut the old server down, change the DNS to point to the new one, and finally do a new rsync --delete to update the new machine. There's a period where some deleted messages may reappear on the client's email before the rsync is complete, but all new messages appear immediately.

Bill

-- INTERNET: bill@celestial.com Bill Campbell; Celestial Software LLC URL: http://www.celestial.com/ PO Box 820; 6641 E. Mercer Way Voice: (206) 236-1676 Mercer Island, WA 98040-0820 Fax: (206) 232-9186 Skype: jwccsllc (206) 855-5792 Never blame a legislative body for not doing something. When they do nothing, that don't hurt anybody. When they do something is when they become dangerous. -- Will Rogers

Russell Miller

15 Apr 15 Apr

2:49 a.m.

...

We have been using Maildir with courier-imap for decades, and haven't had an issue with this. My security folder typically has 25,000+ messages for the last 7 days messages, and accessing either with IMAP or directly with mutt isn't a problem.

I have written various scripts over the years to convert from various mail storage formats ranging from SCO's horrible ctrl-a delimited through the U.W. IMAP, and ones that query other IMAP servers to convert their folder structures to local Maildir.

Maildir is generally very easy to handle with standard *nix command line tools.

As some have noted, modern filesystems are better at this than ones such as ext2. However, even in the best of cases, there are still situations where maildirs with a lot of messages are awkward to handle. Specifically, if you're trying to find specific messages based on criteria that are not easily discernable from the inode, for example, things with attachments. The awkwardness comes from the fact that the shell has a maximum argument size, so you can't use *. You have to use a bit more script-fu, such as find, etc.

Even if there aren't huge issues with doing this, it's an easily fixed thing. Allowing directories to have hundreds of thousands of entries as a matter of course, even if it's something that causes no issues in many cases, to me is an architectural issue.

But then, I noticed my beard is starting to turn grey the other day, so maybe I should just get out the COBOL and tell everyone how we did it when I was a kid.

--Russell

Les Mikesell

4:51 a.m.

On Mon, Apr 14, 2014 at 9:49 PM, Russell Miller duskglow@gmail.com wrote:

...

...
Even if there aren't huge issues with doing this, it's an easily fixed thing. Allowing directories to have hundreds of thousands of entries as a matter of course, even if it's something that causes no issues in many cases, to me is an architectural issue.

Even if modern systems sort-of handle it, it still seems like a bad thing to do when you consider that opening a file for writing has to atomically decide whether that name already exists before creating it - so other concurrent create/delete operations have to be blocked.

-- Les Mikesell lesmikesell@gmail.com

John R Pierce

5:16 a.m.

On 4/14/2014 9:51 PM, Les Mikesell wrote:

...

Even if modern systems sort-of handle it, it still seems like a bad thing to do when you consider that opening a file for writing has to atomically decide whether that name already exists before creating it

so other concurrent create/delete operations have to be blocked.

the better file systems (xfs, zfs, ntfs at least) use a b-tree directory structure, so finding a filename out of 10s of 1000s is very little overhead.

-- john r pierce 37N 122W somewhere on the middle of the left coast

Keith Keller

7:52 a.m.

On 2014-04-15, Russell Miller duskglow@gmail.com wrote:

...

As some have noted, modern filesystems are better at this than ones such as ext2. However, even in the best of cases, there are still situations where maildirs with a lot of messages are awkward to handle. Specifically, if you're trying to find specific messages based on criteria that are not easily discernable from the inode, for example, things with attachments.

This will be bad with an mbox mailbox too. Actually it'll be worse, because it'll be too hard to tell which message the grep hits.

--keith

-- kkeller@wombat.san-francisco.ca.us

Lamar Owen

6:24 p.m.

On 04/14/2014 01:41 AM, Russell Miller wrote:

...

HOWEVER. When a directory grows too large, the OS can take a long time to seek through the directory, which can cause its own set of problems. And this makes cleaning out a maildir directory selectively a real pain. Maildir really could do with a hashing mechanism.

Worse, if the dir gets too big, even after files are deleted it can be very slow. I had one case with >1,000,000 messages in a single maildir (spam on steroids, was getting 80,000 messages per hour overnight); after it was cleaned out to <1,000 messages it still took several minutes to ls the dir, and the machine's responsiveness went through the floor. Copying to a new dir and renaming fixed the slowdown; the directory was >50MB (the directory itself, not its contents).

I'd rather have mbox for plain text e-mail storage, and a database for something really high performance.

4144

Age (days ago)

4153

Last active (days ago)

discuss@lists.centos.org

14 comments

11 participants

tags (0)

participants (11)

Bill Campbell
Chris
John R Pierce
Keith Keller
Lamar Owen
Les Mikesell
Mr Queue
Nicolas Thierry-Mieg
Rafał Radecki
Russell Miller
Scott Robbins