[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-general
Subject: [Nutch-general] Re: dedup vs. session ids
From: Hans <benedict () chemie ! de>
Date: 2005-06-29 15:53:07
Message-ID: 42C2C3E3.3020605 () chemie ! de
[Download RAW message or body]
Andy Liu wrote:
> URL normalization occurs during parsing. If your index isn't that
> big, it may be easier to start your crawl from scratch.
Can I do parsing without re-fetching? Or are only the parsed data stored on disk?
Can I re-fetch only some servers while keeping the data of the other servers intact? \
(It's only a handful of my servers that use session ids.)
Will the old pages whith badly normalized urls get overwritten by the new ones or \
will I have to delete them manually?
Thanks for your help!
Regards,
Hans Benedict
Andy Liu wrote:
> URL normalization occurs during parsing. If your index isn't that
> big, it may be easier to start your crawl from scratch.
>
> On 6/29/05, Hans Benedict <benedict@chemie.de> wrote:
>
>
> > Juho, thanks, that was what I was looking for.
> >
> > What I still don't understand: When is this URL-Normalization done? Or
> > more precisely: What will I have to do with my already crawled pages?
> > Reindex? Update the db? A simple dedup did not seem to do the job...
> >
> > Regards,
> >
> > Hans Benedict
> >
> > _________________________________________________________________
> > Chemie.DE Information Service GmbH Hans Benedict
> > Seydelstraße 28 mailto: benedict@chemie.de
> > 10117 Berlin, Germany Tel +49 30 204568-40
> > Fax +49 30 204568-70
> >
> > www.Chemie.DE | www.ChemieKarriere.NET
> > www.Bionity.COM | www.BioKarriere.NET
> >
> >
> >
> > Juho Mäkinen wrote:
> >
> >
> >
> > > Take a look under conf/regex-normalize.xml
> > >
> > > I don't know how it works, but it seems to do just what you need,
> > > removing session data from GET urls. It's been configured to
> > > remove PHPSESSID variables on default, but you should be
> > > easily able to figure how to custome it for your needs.
> > >
> > > - Juho Mäkinen, http://www.juhonkoti.net
> > >
> > > On 6/27/05, Hans Benedict <benedict@chemie.de> wrote:
> > >
> > >
> > >
> > >
> > > > Hi,
> > > >
> > > > I am crawling some sites that use session ids. As the crawler does not
> > > > use cookies, they are put in the url's querystring. This results in
> > > > thousands of pages that are - based on the visible content - duplicates,
> > > > but are detected as such, because the urls contained in the html are
> > > > different.
> > > >
> > > > Has anybody found a solution to this problem? Is there a way to activate
> > > > cookies for the crawler?
> > > >
> > > > --
> > > > Kind regards,
> > > >
> > > > Hans Benedict
> > > >
> > > > _________________________________________________________________
> > > > Chemie.DE Information Service GmbH Hans Benedict
> > > > Seydelstraße 28 mailto: benedict@chemie.de
> > > > 10117 Berlin, Germany Tel +49 30 204568-40
> > > > Fax +49 30 204568-70
> > > >
> > > > www.Chemie.DE | www.ChemieKarriere.NET
> > > > www.Bionity.COM | www.BioKarriere.NET
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
--
Hans
-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic