About MARC
News
Have a list you want added?
What is MARC?
Why we built MARC
Why didn't anyone answer my mail?
Current TODO list
Robot policy
Some notes on Privacy

Have a list you want archived? Tell us about it. Let us know the list's email address, web site (if any), and its list-management address / what software the list runs on (Majordomo, ezmlm, Mailman, etc) if you know it. If you can include pointers to existing list archives or some other way for us to get our hands on old messages, we'll back-fill our database when we add the list, so it'll start out with some history. Let us know if you are a list owner or just an interested user, etc (for some lists it isn't clear if public archives are desired, so we'll defer to the list owners in those cases).

This engine is the official home for the KDE Project's mailing list archives. The KDE lists can be found at:
    http://lists.kde.org

The rest of MARC (if you got here from the KDE archive) lives at:
    https://marc.info/


Thanks to MARC's donors over the years. (Note, we are no longer soliciting donations.)
The following have donated $500 or more (oldest first). I'm naming my next hard drive after you!

The following people/organizations have donated $100-$499. You rule!

The following people have donated $50-$99. You rule too (though not as much :-P)

Thanks also to the couple dozen who have sent in anything from $2 - $49 ;)

Please contact me if you would like your name above to be a link somewhere, or if you think you should be on the above lists but aren't. Thanks!


What is MARC, anyway?

The Mailing list ARChives (MARC) are currently homed at KoreLogic data centers. Previously, it lived at 10 East - formerly the AIMS group, made up of the people who used to be Progressive Computer Concepts, Inc back when I worked with them - until they were acquired one last time. (Note, if you work for GE and can get the marc.theaimsgroup.com DNS CNAME / redirect restored, please contact me...)

MARC is a project we devised, developed, and (try to!) maintain in our spare time. It is an RDBMS (MySQL, to be exact) driven database of mailing list messages, viewable and browsable by list, thread, author, or searchable via a full-text search engine. Its interface is no-frills but highly functional, designed to be useable even over slow links or with text-only browsers like lynx.

As of 2014-04-02, the MARC archive has 70 million emails across about 3500 mailing lists, from over seven million different authors. It gets about 350,000 new mails per month, and over 35 million total web-hits per month.

Why is it here?

It started in late 1996 or so...

We always thought there ought to be a DejaNews for mailing lists (that is, back when there was a DejaNews... <sniff>). There is a great deal of useful, highly specialized information which circulates through mailing lists. While many lists have a dedicated archive available via the Web (some of which quite full-featured with searchable indexes, etc), we knew of no project to catalog these myriad sites, or to give them a consistant interface.

Signing up for all the mailing lists we were interested in was not an option -- a single high-volume list can get you into the habit of ignoring (with the help of procmail) all list mail for a week at a time, and then simply deleting the huge backlog because you don't have time to read it. And then, when you have a specific topic you're interested in, or realize that some thread has gotten interesting, you can't go back and reread its development. We know this first-hand ;)

We often found ourselves researching questions and issues about which we could find no discussions in Usenet, but which -- once we tracked down the right list -- were a Frequently Asked Question in a particular mailing list. But finding that list, and finding an archive for that list, and digging through those archives... was often a very time-consuming affair.

So we began subscribing mail aliases to a number of lists about which we were particularly interested, about topics we dealt with every day and were always interested in knowing more about, and started archiving them with Hypermail. Lists like MySQL, MiniSQL, Linux-Kernel, AXP-Linux, Multi-User NT, Bugtraq, and... well, the list kept growing. This was the first generation, if you will, of our mailing list archives. Various search engines picked up these pages, and we started having random (and repeat) visitors. So we figured we were not the only people who thought such a thing was useful.

About its present form

As we used the archives more and more for high-volume lists, the shortcomings of the model became evident. For one thing, we had no good searching capability. For another, the Hypermail archives were cumbersome for high-volume lists (our monthly index files got to be quite large for, say, Linux-Kernel, which got well over 2,000 messages on an average month--now more like 17,000). Also, there was no easy means to add "privacy features" (more on that later). Lastly, disk space consumption rose quickly, as flat HTML files one-per-message isn't a terribly efficient way to pack this kind of data.

These issues combined with one other factor: we wanted to play with MySQL. We'd been using MiniSQL for years, developing perl-based CGI, custom client/server applications for Windows and X desktops, including everything from database-driven web sites to enterprise corporate intranet systems. However, MySQL offered a multithreaded engine with exceptional performance (especially on joins -- a sore spot for MiniSQL). And MySQL's perl5 interface was nearly identical to that of MiniSQL. We wanted to give MySQL a thorough breaking-in period, both to get used to the new features and slight syntactical differences, and to reassure ourselves that it was ready to handle the load and the responsibility for the large amounts of mission-critical software packages we had developed for customers.

We white-boarded a scheme for storing mailing list messages in a relational database, worked on methods of arranging the data for maximum performance with given hardware and scalability as our datasets grew, and started coding. But only in our spare time, as we had no customer to bill the project to, yet. After a few days we had a working system (and I had a mild case of carpel-tunnel).

Our experiences since then

First of all, we were then and continue to be impressed by MySQL, for its stability, performance, range of features, and price (it's free unless you want to purchase support). We've really enjoyed the project, too, although early thoughts of turning it into *real* competition with the big boys (DejaNews, Excite, etc) have been mostly squashed by a number of factors, including waning interest on our part to work on it (kind of like ncLinux...), lack of interested sponsors (ditto), and the warm fuzzies we get from contributing a free resource to the computing world as karmic payback for all the free tools (GNU software, Linux, MySQL, etc) we've been using for years.

However, we still use the site ourselves heavily, and continue to make improvements in the interface and in the back-end as time allows.

Feature requests and bug reports are welcome. Some problems I know about and just haven't gotten around to fixing; some enhancements I have planned and, well, ditto. But your comments are welcome. Please email any comments/suggestions/concerns to webguy@marc.info. Particularly, we are always interested in requests for additions to our list of lists to archive.

Why didn't anyone answer my mail?

There are several reasons why mails to us / about MARC might go unanswered for a long time (or forever).

TODO list

Here are a few of the things which are known issues that need improving, or features I'm itching to implement, and have a plan of how I'm going to do it, but which are more than an afternoon's hacking to get done:

Robot policy

In theory, we don't mind people snarfing down some MARC pages for off-line reading. (I travel a lot, and sometimes want to pull down long threads before hitting the road to read locally, etc.)

On the other hand... first, if we think you are a spam-bot address-harvester, no death is slow or painful enough. Also, even if well-intentioned, a robot crawling MARC can sometimes create a DoS; if the robot sustains many parallel requests (or we happen to be hit by multiple different robots at the same time) and doesn't back off if the site starts to slow down, it can bog down the server. In a perfect world MARC would scale better, and would automatically recognize abusive robots 100% accurately, 100% of the time. But since it's not a perfect world... we may throttle traffic from you if your IP, user-agent, or IP/user-agent combination have misbehaved in the past.

If you want to crawl MARC, please be sure you have a delay between requests, say one or two seconds. If you think we've mis-identified you as a robot, please feel free to contact us. Please include information we'll need to find your activity in our logs, such as the time you get this message, the IP address(es) you are browsing from, and the user-agent (web browser) you are using.

Some notes about privacy

We have developed a bad attitude about spam over the years and we know we are not the only ones. We take steps to respect people's privacy and to try to minimize the usefulness of our site to would-be spammers gathering email addresses.

First of all, we respect the 'x-no-archive' mail header ala DejaNews -- if your mail message includes a 'x-no-archive: yes' header it will be dropped from our feed, to respect the wishes of list members who wish to keep their posts private. Since some mail clients still do not allow adding arbitrary mail headers, we've recently adopted what Deja (now Google) added when we weren't looking: we also check for 'X-No-Archive: yes' in the first line of the *body* of mails. Please email us if you find a message in our database posted by you with the 'x-no-archive' header set. (Note to list admins: the ezmlm list management software gratuitously adds 'X-No-Archive: yes' to every mail that passes through the list by default. It's not possible to tell if the mail was originally sent with that header set or not. That'll prevent any ezmlm'd message from appearing in MARC, until/unless the default list config is changed by an administrator.)

Second, any database view that includes multiple messages (viewing threads, browsing lists, or browsing the results of a search) will show only the real name of the sender, or, failing that, the username of the sender with the @domain.com stripped. So, index pages cannot be pulled and parsed by address-trolling robots. Such a robot would have to pull every individual message to obtain a list of addresses.

More recently, other archives have started stripping the names and addresses off of posts entirely. That starts to get into some sticky copyright issues--by removing attribution entirely you are no longer giving credit where credit is due--so it's a difficult balance. That's an easy step for list-admins to take themselves--they "own" the contents of their lists, and list-members agree to whatever their terms & conditions are by using their list--but since we're a third party and don't have the explicit or implicit permission of each poster to remove their attributions, we can't do that. However, I have implemented some address munging which obfuscates, but does not destroy or remove, the original poster's adress--at signs and periods are replaced with ' () ' and ' ! ' respectively. Of course most spammers will evolve to handle these eventually... We still do not tamper with message bodies, however, since that might break patches, PGP signatures, etc.

Third, we only archive and make available mailing lists whose contents are already public, such as available at at least one other site on the net, or which have an open policy. Most of our archives are totally unofficial, and as such, when in doubt, we first verify that the list maintainers have allowed at least one other site to carry archives of messages before we make them available to the rest of the world, to respect the wishes of list administrators who wish to keep their lists private. List administrators are invited to email us requests to cease archiving their list and/or making the archive publically accessable if they wish for any reason. We also welcome submissions of "blurbs" to go along with a particular list's archive describing the topic of the list, its home, maintainers, etc (something we're not very good about keeping up to date for all the lists, left to our own devices).

We are very, very, very reluctant to make any changes to database-contents once a message comes in. We've received threats from clueless companies' lawyers because of archived bugtraq posts pointing out security flaws, for example. If we honor occasional "oops I didn't mean to post that" mails, we would be editing content, and those clueless lawyers might have a leg to stand on. As a result, our position is that we will only remove a message for one of these reasons:

-A list admin asks us to remove a private list we've accidentally made public archives for, in which case, poof, the whole list is gone (after we are sufficiently sure it's really the list admin requesting it, and not a forged mail, etc).
-On request from an original poster (that we can come close to verifying, given that on the Internet everyone is a dog) that is agreed-to by the list admin/owner.
-A message is clearly, without a doubt, useless spam that got through a list's (or our) filters.
-A court orders us to remove a message.

Two other circumstances will cause a mail to be dropped: one, technical glitches (of course those hardly ever happen...). Two, when we are bulk-inserting older mails, we might filter out known spam from the old list spools before adding them.

#include "stddisclaimer.h"

Thanks, and enjoy! (Um, and send pizza.)

Hank Leininger



Configure | About | News | Add a list | Sponsored by KoreLogic