[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-general
Subject: [Nutch-general] robots.txt
From: david.wojciechowski () uniklinik-freiburg ! de
Date: 2006-06-30 8:29:43
Message-ID: OF18D9D3DB.5087697B-ONC125719D.002DF7F9-C125719D.002EAB46 () uniklinik-freiburg ! de
[Download RAW message or body]
hi
i use nutch 0.7.1 to crawl a few intranetserver.
yesterday i tried to exclude some directories with the robots.txt.
but nothing changed.
i copied this robots.txt to the server:
User-agent: NutchCVS
Disallow: /cgi-bin/
Disallow: /manuals/
the User-agent "NutchCVS" and the robots agent name in nutch-default
is the same.
can anyone helps me with this problem?
i'm crawling with this command:
bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &
greets david
==========================================================
David Wojciechowski
Universitätsklinikum Freiburg
Klinikrechenzentrum
Agnesenstrasse 6-8
D-79106 Freiburg
Telefon : 0761 / 270 - 1842
Fax: 0761 / 270 - 2276
E-Mail : david.wojciechowski@uniklinik-freiburg.de
==========================================================
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic