About MARC

About MARC News
Have a list you want added?
What is MARC?
Why we built MARC
Why didn't anyone answer my mail?
Current TODO list
Robot policy
Some notes on Privacy

Have a list you want archived? Tell us about it. Let us know the list's email address, web site (if any), and its list-management address / what software the list runs on (Majordomo, ezmlm, Mailman, etc) if you know it. If you can include pointers to existing list archives or some other way for us to get our hands on old messages, we'll back-fill our database when we add the list, so it'll start out with some history. Let us know if you are a list owner or just an interested user, etc (for some lists it isn't clear if public archives are desired, so we'll defer to the list owners in those cases).

This engine is the official home for the KDE Project's mailing list archives. The KDE lists can be found at:
http://lists.kde.org

The rest of MARC (if you got here from the KDE archive) lives at:
https://marc.info/

Thanks to MARC's donors over the years. (Note, we are no longer soliciting donations.)
The following have donated $500 or more (oldest first). I'm naming my next hard drive after you!

The following people/organizations have donated $100-$499. You rule!

The following people have donated $50-$99. You rule too (though not as much :-P)

Derek Morr
Robert Peichaer
Nohlex GmbH

Thanks also to the couple dozen who have sent in anything from $2 - $49 ;)

Please contact me if you would like your name above to be a link somewhere, or if you think you should be on the above lists but aren't. Thanks!

What is MARC, anyway?

The Mailing list ARChives (MARC) are currently homed at KoreLogic data centers. Previously, it lived at 10 East - formerly the AIMS group, made up of the people who used to be Progressive Computer Concepts, Inc back when I worked with them - until they were acquired one last time. (Note, if you work for GE and can get the marc.theaimsgroup.com DNS CNAME / redirect restored, please contact me...)

MARC is a project we devised, developed, and (try to!) maintain in our spare time. It is an RDBMS (MySQL, to be exact) driven database of mailing list messages, viewable and browsable by list, thread, author, or searchable via a full-text search engine. Its interface is no-frills but highly functional, designed to be useable even over slow links or with text-only browsers like lynx.

As of 2014-04-02, the MARC archive has 70 million emails across about 3500 mailing lists, from over seven million different authors. It gets about 350,000 new mails per month, and over 35 million total web-hits per month.

Why is it here?

It started in late 1996 or so...

We always thought there ought to be a DejaNews for mailing lists (that is, back when there was a DejaNews... <sniff>). There is a great deal of useful, highly specialized information which circulates through mailing lists. While many lists have a dedicated archive available via the Web (some of which quite full-featured with searchable indexes, etc), we knew of no project to catalog these myriad sites, or to give them a consistant interface.

Signing up for all the mailing lists we were interested in was not an option -- a single high-volume list can get you into the habit of ignoring (with the help of procmail) all list mail for a week at a time, and then simply deleting the huge backlog because you don't have time to read it. And then, when you have a specific topic you're interested in, or realize that some thread has gotten interesting, you can't go back and reread its development. We know this first-hand ;)

We often found ourselves researching questions and issues about which we could find no discussions in Usenet, but which -- once we tracked down the right list -- were a Frequently Asked Question in a particular mailing list. But finding that list, and finding an archive for that list, and digging through those archives... was often a very time-consuming affair.

So we began subscribing mail aliases to a number of lists about which we were particularly interested, about topics we dealt with every day and were always interested in knowing more about, and started archiving them with Hypermail. Lists like MySQL, MiniSQL, Linux-Kernel, AXP-Linux, Multi-User NT, Bugtraq, and... well, the list kept growing. This was the first generation, if you will, of our mailing list archives. Various search engines picked up these pages, and we started having random (and repeat) visitors. So we figured we were not the only people who thought such a thing was useful.

About its present form

As we used the archives more and more for high-volume lists, the shortcomings of the model became evident. For one thing, we had no good searching capability. For another, the Hypermail archives were cumbersome for high-volume lists (our monthly index files got to be quite large for, say, Linux-Kernel, which got well over 2,000 messages on an average month--now more like 17,000). Also, there was no easy means to add "privacy features" (more on that later). Lastly, disk space consumption rose quickly, as flat HTML files one-per-message isn't a terribly efficient way to pack this kind of data.

These issues combined with one other factor: we wanted to play with MySQL. We'd been using MiniSQL for years, developing perl-based CGI, custom client/server applications for Windows and X desktops, including everything from database-driven web sites to enterprise corporate intranet systems. However, MySQL offered a multithreaded engine with exceptional performance (especially on joins -- a sore spot for MiniSQL). And MySQL's perl5 interface was nearly identical to that of MiniSQL. We wanted to give MySQL a thorough breaking-in period, both to get used to the new features and slight syntactical differences, and to reassure ourselves that it was ready to handle the load and the responsibility for the large amounts of mission-critical software packages we had developed for customers.

We white-boarded a scheme for storing mailing list messages in a relational database, worked on methods of arranging the data for maximum performance with given hardware and scalability as our datasets grew, and started coding. But only in our spare time, as we had no customer to bill the project to, yet. After a few days we had a working system (and I had a mild case of carpel-tunnel).

Our experiences since then

First of all, we were then and continue to be impressed by MySQL, for its stability, performance, range of features, and price (it's free unless you want to purchase support). We've really enjoyed the project, too, although early thoughts of turning it into *real* competition with the big boys (DejaNews, Excite, etc) have been mostly squashed by a number of factors, including waning interest on our part to work on it (kind of like ncLinux...), lack of interested sponsors (ditto), and the warm fuzzies we get from contributing a free resource to the computing world as karmic payback for all the free tools (GNU software, Linux, MySQL, etc) we've been using for years.

However, we still use the site ourselves heavily, and continue to make improvements in the interface and in the back-end as time allows.

Feature requests and bug reports are welcome. Some problems I know about and just haven't gotten around to fixing; some enhancements I have planned and, well, ditto. But your comments are welcome. Please email any comments/suggestions/concerns to webguy@marc.info. Particularly, we are always interested in requests for additions to our list of lists to archive.

Why didn't anyone answer my mail?

There are several reasons why mails to us / about MARC might go unanswered for a long time (or forever).

You sent to us (webguy@marc.info) a question which should have gone to some list that's archived at MARC, asking us a question related to some project whose lists we archive. Keep in mind, MARC does not home any lists or projects, only their archives. As such we aren't experts in most of the things discussed on lists we archive. For those topics we do have some clue about, we try to help out on the related lists--like other helpful folks ;) We simply can't take any direct support questions for non-MARC-stuff. Too often such mails just get bit-bucketed. However we'll try to reply with a pointer to this note and a guess as to the list(s) to which the question probably ought to have been sent (if we know).
You sent a fairly generic mail on a non-urgent matter (i.e. request for a new list to be added). Since MARC is all spare-time volunteer-supported, list requests, etc often get batched up until we have time to get to them, and usually until we have several in a row to do. Bug reports, on the other hand, especially critical bugs or ones which impact lots of users, we try to give higher priority (but still, we have day jobs).
Your mail ended up in my mailbox and I thought it would take longer to respond to than I had at the moment (in the middle of my day job or something else, or required think-time or research, etc). And then it got buried behind a hundred others, and I eventually forgot about it before answering, because I suck. When in doubt, please do not hesitate to prod us, once every few days / week, until you get a satisfactory response. I will not be in the least insulted or bothered by this; I only offer a blanket apology that it is sometimes necessary.
You sent an email that looked so much like spam that it was trapped by a filter automatically, or I deleted it without even reading it and realizing it came from a real human. Sorry; I get about 40 real emails and 150-250 spams a day. I probably drop at least one legitimate mail a week. Again, don't hesitate to resend something that seems to have gone into the bit-bucket.

TODO list

Here are a few of the things which are known issues that need improving, or features I'm itching to implement, and have a plan of how I'm going to do it, but which are more than an afternoon's hacking to get done:

Improve the search engine so that it behaves more intelligently wrt commonly used special characters, most notably _ (underscore) which currently is used as a word-splitting character (and thus is impossible to search for, itself). Also, remove the three-letter-minimum limit on search terms. Unfortunately these will require a complete rebuild of the full-text indexes, which would take many weeks. For practical purposes it may not get done until after the next hardware upgrade.
Some people have asked for list views to show the oldest message in a thread on "top", rather than the newest. Others have asked for the oldest threads first, so that you "Next"-page to newer messages, rather than putting most-recent-on-top. The current scheme is obviously what makes most sense to me, but whatever ;) This shouldn't be too hard to implement as another user-configurable setting, so I'll do it.
Improve the DOS/flood handling. The current methods are gross and kludgy. Apache's mod_throttle might do it, if I hadn't heard scary (unconfirmed) rumors about security problems with it...
I am still, personally, the bottleneck for getting new lists added. I have all kinds of concerns with making fully automated list-setup, but it would still be better if it wasn't a manual process that required my attention/review. Maybe a possibility would be a list-setup interface which project owners would have access to update (so a couple of KDE people would be able to create KDE lists, Apache people create their lists, and so on). That still means that one-off projects are in the dark, though. And, the idea implies some work to get off the ground. I don't like work. Then again I dislike wasted manual effort, and/or being a point of failure, even more. Hmm....
*Insert your wishlist item here*. Feature requests followed by lots of other people asking for the same thing (um, or bags of money) will get bumped up in my priority list.

Robot policy

In theory, we don't mind people snarfing down some MARC pages for off-line reading. (I travel a lot, and sometimes want to pull down long threads before hitting the road to read locally, etc.)

On the other hand... first, if we think you are a spam-bot address-harvester, no death is slow or painful enough. Also, even if well-intentioned, a robot crawling MARC can sometimes create a DoS; if the robot sustains many parallel requests (or we happen to be hit by multiple different robots at the same time) and doesn't back off if the site starts to slow down, it can bog down the server. In a perfect world MARC would scale better, and would automatically recognize abusive robots 100% accurately, 100% of the time. But since it's not a perfect world... we may throttle traffic from you if your IP, user-agent, or IP/user-agent combination have misbehaved in the past.

If you want to crawl MARC, please be sure you have a delay between requests, say one or two seconds. If you think we've mis-identified you as a robot, please feel free to contact us. Please include information we'll need to find your activity in our logs, such as the time you get this message, the IP address(es) you are browsing from, and the user-agent (web browser) you are using.

Some notes about privacy

We have developed a bad attitude about spam over the years and we know we are not the only ones. We take steps to respect people's privacy and to try to minimize the usefulness of our site to would-be spammers gathering email addresses.

First of all, we respect the 'x-no-archive' mail header ala DejaNews -- if your mail message includes a 'x-no-archive: yes' header it will be dropped from our feed, to respect the wishes of list members who wish to keep their posts private. Since some mail clients still do not allow adding arbitrary mail headers, we've recently adopted what Deja (now Google) added when we weren't looking: we also check for 'X-No-Archive: yes' in the first line of the *body* of mails. Please email us if you find a message in our database posted by you with the 'x-no-archive' header set. (Note to list admins: the ezmlm list management software gratuitously adds 'X-No-Archive: yes' to every mail that passes through the list by default. It's not possible to tell if the mail was originally sent with that header set or not. That'll prevent any ezmlm'd message from appearing in MARC, until/unless the default list config is changed by an administrator.)

Second, any database view that includes multiple messages (viewing threads, browsing lists, or browsing the results of a search) will show only the real name of the sender, or, failing that, the username of the sender with the @domain.com stripped. So, index pages cannot be pulled and parsed by address-trolling robots. Such a robot would have to pull every individual message to obtain a list of addresses.

More recently, other archives have started stripping the names and addresses off of posts entirely. That starts to get into some sticky copyright issues--by removing attribution entirely you are no longer giving credit where credit is due--so it's a difficult balance. That's an easy step for list-admins to take themselves--they "own" the contents of their lists, and list-members agree to whatever their terms & conditions are by using their list--but since we're a third party and don't have the explicit or implicit permission of each poster to remove their attributions, we can't do that. However, I have implemented some address munging which obfuscates, but does not destroy or remove, the original poster's adress--at signs and periods are replaced with ' () ' and ' ! ' respectively. Of course most spammers will evolve to handle these eventually... We still do not tamper with message bodies, however, since that might break patches, PGP signatures, etc.

Third, we only archive and make available mailing lists whose contents are already public, such as available at at least one other site on the net, or which have an open policy. Most of our archives are totally unofficial, and as such, when in doubt, we first verify that the list maintainers have allowed at least one other site to carry archives of messages before we make them available to the rest of the world, to respect the wishes of list administrators who wish to keep their lists private. List administrators are invited to email us requests to cease archiving their list and/or making the archive publically accessable if they wish for any reason. We also welcome submissions of "blurbs" to go along with a particular list's archive describing the topic of the list, its home, maintainers, etc (something we're not very good about keeping up to date for all the lists, left to our own devices).

We are very, very, very reluctant to make any changes to database-contents once a message comes in. We've received threats from clueless companies' lawyers because of archived bugtraq posts pointing out security flaws, for example. If we honor occasional "oops I didn't mean to post that" mails, we would be editing content, and those clueless lawyers might have a leg to stand on. As a result, our position is that we will only remove a message for one of these reasons:

-A list admin asks us to remove a private list we've accidentally made public archives for, in which case, poof, the whole list is gone (after we are sufficiently sure it's really the list admin requesting it, and not a forged mail, etc).
-On request from an original poster (that we can come close to verifying, given that on the Internet everyone is a dog) that is agreed-to by the list admin/owner.
-A message is clearly, without a doubt, useless spam that got through a list's (or our) filters.
-A court orders us to remove a message.

Two other circumstances will cause a mail to be dropped: one, technical glitches (of course those hardly ever happen...). Two, when we are bulk-inserting older mails, we might filter out known spam from the old list spools before adding them.

#include "stddisclaimer.h"

Thanks, and enjoy! (Um, and send pizza.)

Hank Leininger

Configure | About | News | Add a list | Sponsored by KoreLogic