'Re: Splitting Open Dir files'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mozilla-rdf
Subject:    Re: Splitting Open Dir files
From:       Danny Ayers <Danny.Ayers () highpeak ! ac ! uk>
Date:       1999-08-18 9:38:19
[Download RAW message or body]

Thanks.
I'm not used to Perl, so I expect to have fun with this! I would appreciate a
line or two on usage, to save me a bit of head scratching.

re. pornfilter : very good idea, I've had a browse through the data and there
does seem to be loads. Ok, when all systems (whatever they may be) are up and
running 100% it would probably be acceptable to include this stuff, but to have
development bogged down by such junk just doesn't make sense...

Cheers,
Danny.

Dan Brickley wrote:

> On Tue, 17 Aug 1999, Danny Ayers wrote:
>
> > Hi,
> > I'm wanting to experiment with the Open Directory RDF files, but the
> > fact that they're so HUGE creates a bit of a problem. What I would like
> > is to split these into, say, 1MB chunks and work on them. I suppose I
> > stand a better chance on a Linux box than on my usual Win machine, but
> > I'm not sure what utilities are available for this kind of stuff.
> > Any suggestions (Win or Lin) or tips?
>
> Attached is simple and nasty hack of a perl script I've been using that
> takes the DMoz data chunk by chunk separating on whitespace lines. In
> other words we forget that it is XML and RDF (or that it nearly is) and
> just process it based on observable patterns in the data. Like newlines
> separating bitesize chunks.
>
> The script actually asserts the facts straight into an on-disk RDF
> triple store, at least it did until the triplestore keeled over due to
> sheer size of the dataset. More on which separately. You could replace
> this code with whatever your application hopes to achieve.
>
> I've copied the perl script below in case the attachment doesn't make it
> through the mail-to-news gateway. It's gross, as I say, but seemed to
> work most of the time.
>
> Dan
>
> ps. please no flames about the 'pornfilter' line in the script. I swear
> the data files would be vastly smaller if all the adult stuff was in a
> different downloadable...
>
> #!/bin/perl
> $|=1;
> $/='';
> my $datafile=shift || die "Usage: $0 <datafile> <dbname>";
> my $dbname=shift || die "Usage: $0 <datafile> <dbname>";
>
> #!/usr/bin/perl
> use lib "../rudolf-perl";
> #use strict;
> use RDF::API::Database;
>
> #################################
> # initialise an RDF database
> my $dbobj = RDF::API::Database->new(); # connect to database
> $dbobj->set('module'=>'RDF::RDFdb'); # set backend module to use
> $dbobj->set('rw'=>'1'); # rw
> $dbobj->set('datastore'=>$dbname); # where is data directory
>
> $dbobj->open(); # open database
>
> $dbobj->index(); # index after each block
>
> open(DMOZ,"$datafile") || die "No datafile $datafile";
>
> ################# Main processing loop
>
> my $i=0;
> while ($line = <DMOZ>) {
> $i++;
> if ($i =~ m/00$/) {
>                 $dbobj->close();
>                 $dbobj->open();
>                 print "Chunk $i flushing to disk!\n";
>                 } # flush to disk
> #       print "Got chunk\n";
>         $line =~ s/\n//g;
>         while ($line =~ m#<Topic r:id="(.*)">(.*)</Topic>#g) {
> #                       print "Matched Topic 1:$1 2:$2\n";
>                         processTopic($1,$2);
>                         }
>         while ($line =~ m#<ExternalPage about="(.*)">(.*)</ExternalPage>#g) {
> #                       print "Matched ExternalPage\n";
>                         processExternalPage($1,$2);
>                         }
>         } # Done with each chunk of the datafile
> #################
>
> ## clean up fake predicate URIs (slightly)
> sub tidyPredicate{
> my $p = shift;
> if ($p =~ m/:/) {
>         return ($p);
>         }
> return('moz:'.$p);
> }
>
> ################# Deal with contents of Topic element
> sub processTopic {
> ($id,$data)=@_;
> #print "In topic...\n";
> #if ($id =~ m/Adult/) { return();} #PORNFILTER#
> #print "In topic... ID=$id\n";
>
> $data =~ m#<catid>([^<]+)</catid>#; my $catid=$1;
> $data =~ m#<d:Title>([^<]+)</d:Title>#; my $title=$1;
>
> print "Topic: $id catid: $catid: Title: $title\n";
>
> $dbobj->assert('r:type',$id,'moz:Topic'); # Stash this on disk
> $dbobj->assert('catid',$id,"\"$catid\"");
>
> # Store various RDF properties
> # print "looking for properties of $id\n";
> while ($data =~ m#<([^\s]+) r:resource="([^"]+)"/>#g) {
>         my $predicate=tidyPredicate($1);
>         $dbobj->assert($predicate,$id,$2);
>         print "Asserting $id -$predicate-> $2\n";
>         }
> }# end Topic handler
>
> ################# Deal with contents of ExternalPage element
> sub processExternalPage {
> ($id,$data)=@_; # ID is a URI here
> # should check to see if we have a topic entry linking
> # to this URI; if not we've filtered out that Topic
>
> print "Page: URI= $id\n";
>
> $dbobj->assert('r:type',$id,'moz:ExternalPage'); # Stash this on disk
>
> # Store various RDF properties
> while ($data =~ m#<([^\s]+)\s+r:resource="([^"]+)"/>#g) {
>         my $predicate=tidyPredicate($1);
>         $dbobj->assert($predicate,$id,$2);
>         print "Asserting $id -$predicate-> $2\n";
>         }
> ## ditto for literal (String) valued properties
> while ($data =~ m#<([^>]+)>([^<]+)</[^>]+>#g) {
>         my $predicate=tidyPredicate($1);
>         $dbobj->assert($predicate , $id, '"' .$2. '"' );
>         print "Asserting $id -$predicate-> '$2'\n";
>         }
>
> }# end ExternalPage handler
>
> $dbobj->index(); # index after each block
>
> $dbobj->close(); # close database
>
>   ------------------------------------------------------------------------
>                      Name: dmoz2rdf.pl
>    dmoz2rdf.pl       Type: Plain Text (TEXT/PLAIN)
>                  Encoding: BASE64
>               Description: cheesy Dmoz-processing perl script

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic