[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nutch-developers
Subject:    [Nutch-dev] [jira] Commented: (NUTCH-528) CrawlDbReader: add some
From:       Doğacan_Güney_(JIRA) <jira () apache ! org>
Date:       2007-07-30 11:06:52
Message-ID: 26009947.1185793612995.JavaMail.jira () brutus
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516366 \
] 

Doğacan Güney commented on NUTCH-528:
-------------------------------------

This is my personal nit, but the cli options look weird. Why not something like:

bin/nutch readdb crawl/crawldb -stats -sortByHost 

or even better (if we can also sort by something else)

bin/nutch readdb crawl/crawldb -stats -sort host|foo|bar

Same is true for CSV:

bin/nutch readdb crawl/crawldb -dump FOLDER -format csv|normal (with default being \
'normal', normal being the old output format)

Besides these issues, I think these changes are useful and I would like to see them \
in.

> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
> 
> Key: NUTCH-528
> URL: https://issues.apache.org/jira/browse/NUTCH-528
> Project: Nutch
> Issue Type: Improvement
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Assignee: Emmanuel Joke
> Priority: Minor
> Fix For: 1.0.0
> 
> Attachments: NUTCH-528.patch
> 
> 
> * I've added improve the stats to list the number of urls by status and by hosts. \
> This is an option which is not mandatory. For instance if you set sortByHost \
> option, it will show: bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls:	36
> retry 0:	36
> min score:	0.0020
> avg score:	0.059
> max score:	1.0
> status 1 (db_unfetched):	33
> www.yahoo.com :	33
> status 2 (db_fetched):	3
> www.yahoo.com :	3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then be easy \
> to integrate the file in Excel and make some more complex statistics. bin/nutch \
> readdb crawl/crawldb -dump FOLDER toCsv Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry \
> interval;Score;Signature;Metadata "http://www.yahoo.com/";1;"db_unfetched";Wed Jul \
> 25 14:59:59 CST 2007;Thu Jan 01 08:00:00 CST \
> 1970;0;2592000.0;30.0;0.04151206;"null";"null" \
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu \
> Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" \
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST \
>                 2007;Thu Jan 01 08:00:00 CST \
>                 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic