'[Nutch-general] Re: dedup vs. session ids'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    [Nutch-general] Re: dedup vs. session ids
From:       Hans <benedict () chemie ! de>
Date:       2005-06-29 15:53:07
Message-ID: 42C2C3E3.3020605 () chemie ! de
[Download RAW message or body]

Andy Liu wrote:

> URL normalization occurs during parsing.  If your index isn't that
> big, it may be easier to start your crawl from scratch.

Can I do parsing without re-fetching? Or are only the parsed data stored on disk?

Can I re-fetch only some servers while keeping the data of the other servers intact? \
(It's only a handful of my servers that use session ids.) 

Will the old pages whith badly normalized urls get overwritten by the new ones or \
will I have to delete them manually?

Thanks for your help!

Regards,

Hans Benedict




Andy Liu wrote:

> URL normalization occurs during parsing.  If your index isn't that
> big, it may be easier to start your crawl from scratch.
> 
> On 6/29/05, Hans Benedict <benedict@chemie.de> wrote:
> 
> 
> > Juho, thanks, that was what I was looking for.
> > 
> > What I still don't understand: When is this URL-Normalization done? Or
> > more precisely: What will I have to do with my already crawled pages?
> > Reindex? Update the db? A simple dedup did not seem to do the job...
> > 
> > Regards,
> > 
> > Hans Benedict
> > 
> > _________________________________________________________________
> > Chemie.DE Information Service GmbH     Hans Benedict
> > Seydelstraße 28                        mailto: benedict@chemie.de
> > 10117 Berlin, Germany                  Tel +49 30 204568-40
> > Fax +49 30 204568-70
> > 
> > www.Chemie.DE               |          www.ChemieKarriere.NET
> > www.Bionity.COM             |          www.BioKarriere.NET
> > 
> > 
> > 
> > Juho Mäkinen wrote:
> > 
> > 
> > 
> > > Take a look under conf/regex-normalize.xml
> > > 
> > > I don't know how it works, but it seems to do just what you need,
> > > removing session data from GET urls. It's been configured to
> > > remove PHPSESSID variables on default, but you should be
> > > easily able to figure how to custome it for your needs.
> > > 
> > > - Juho Mäkinen, http://www.juhonkoti.net
> > > 
> > > On 6/27/05, Hans Benedict <benedict@chemie.de> wrote:
> > > 
> > > 
> > > 
> > > 
> > > > Hi,
> > > > 
> > > > I am crawling some sites that use session ids. As the crawler does not
> > > > use cookies, they are put in the url's querystring. This results in
> > > > thousands of pages that are - based on the visible content - duplicates,
> > > > but are detected as such, because the urls contained in the html are
> > > > different.
> > > > 
> > > > Has anybody found a solution to this problem? Is there a way to activate
> > > > cookies for the crawler?
> > > > 
> > > > --
> > > > Kind regards,
> > > > 
> > > > Hans Benedict
> > > > 
> > > > _________________________________________________________________
> > > > Chemie.DE Information Service GmbH     Hans Benedict
> > > > Seydelstraße 28                        mailto: benedict@chemie.de
> > > > 10117 Berlin, Germany                  Tel +49 30 204568-40
> > > > Fax +49 30 204568-70
> > > > 
> > > > www.Chemie.DE               |          www.ChemieKarriere.NET
> > > > www.Bionity.COM             |          www.BioKarriere.NET
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 

-- 
Hans



-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic