[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-general
Subject:    [Nutch-general] robots.txt
From:       david.wojciechowski () uniklinik-freiburg ! de
Date:       2006-06-30 8:29:43
Message-ID: OF18D9D3DB.5087697B-ONC125719D.002DF7F9-C125719D.002EAB46 () uniklinik-freiburg ! de
[Download RAW message or body]

hi

i use nutch 0.7.1 to crawl a few intranetserver.
yesterday i tried to exclude some directories with the robots.txt.
but nothing changed.
i copied this robots.txt to the server:

User-agent: NutchCVS
Disallow: /cgi-bin/
Disallow: /manuals/

the User-agent "NutchCVS" and the robots agent name in nutch-default
is the same.

can anyone helps me with this problem?

i'm crawling with this command:

bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &

greets david

==========================================================

David Wojciechowski
Universitätsklinikum Freiburg
Klinikrechenzentrum
Agnesenstrasse 6-8
D-79106 Freiburg

Telefon :  0761 / 270 - 1842
Fax: 0761 / 270 - 2276
E-Mail   :  david.wojciechowski@uniklinik-freiburg.de

==========================================================


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic