[prev in list] [next in list] [prev in thread] [next in thread]
List: nutch-developers
Subject: [Nutch-dev] [jira] Commented: (NUTCH-528) CrawlDbReader: add some
From: Doğacan_Güney_(JIRA) <jira () apache ! org>
Date: 2007-07-30 11:06:52
Message-ID: 26009947.1185793612995.JavaMail.jira () brutus
[Download RAW message or body]
[ https://issues.apache.org/jira/browse/NUTCH-528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516366 \
]
Doğacan Güney commented on NUTCH-528:
-------------------------------------
This is my personal nit, but the cli options look weird. Why not something like:
bin/nutch readdb crawl/crawldb -stats -sortByHost
or even better (if we can also sort by something else)
bin/nutch readdb crawl/crawldb -stats -sort host|foo|bar
Same is true for CSV:
bin/nutch readdb crawl/crawldb -dump FOLDER -format csv|normal (with default being \
'normal', normal being the old output format)
Besides these issues, I think these changes are useful and I would like to see them \
in.
> CrawlDbReader: add some new stats + dump into a csv format
> ----------------------------------------------------------
>
> Key: NUTCH-528
> URL: https://issues.apache.org/jira/browse/NUTCH-528
> Project: Nutch
> Issue Type: Improvement
> Environment: Java 1.6, Linux 2.6
> Reporter: Emmanuel Joke
> Assignee: Emmanuel Joke
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: NUTCH-528.patch
>
>
> * I've added improve the stats to list the number of urls by status and by hosts. \
> This is an option which is not mandatory. For instance if you set sortByHost \
> option, it will show: bin/nutch readdb crawl/crawldb -stats sortByHost
> CrawlDb statistics start: crawl/crawldb
> Statistics for CrawlDb: crawl/crawldb
> TOTAL urls: 36
> retry 0: 36
> min score: 0.0020
> avg score: 0.059
> max score: 1.0
> status 1 (db_unfetched): 33
> www.yahoo.com : 33
> status 2 (db_fetched): 3
> www.yahoo.com : 3
> CrawlDb statistics: done
> Of course without this option the stats are unchanged.
> * I've add a new option to dump the crawldb into a CSV format. It will then be easy \
> to integrate the file in Excel and make some more complex statistics. bin/nutch \
> readdb crawl/crawldb -dump FOLDER toCsv Extract of the file:
> Url;Status code;Status name;Fetch Time;Modified Time;Retries since fetch;Retry \
> interval;Score;Signature;Metadata "http://www.yahoo.com/";1;"db_unfetched";Wed Jul \
> 25 14:59:59 CST 2007;Thu Jan 01 08:00:00 CST \
> 1970;0;2592000.0;30.0;0.04151206;"null";"null" \
> "http://www.yahoo.com/help.html";1;"db_unfetched";Wed Jul 25 15:08:09 CST 2007;Thu \
> Jan 01 08:00:00 CST 1970;0;2592000.0;30.0;0.0032467535;"null";"null" \
> "http://www.yahoo.com/contacts.html";1;"db_unfetched";Wed Jul 25 15:08:12 CST \
> 2007;Thu Jan 01 08:00:00 CST \
> 1970;0;2592000.0;30.0;0.0032467535;"null";"null"
> * I've removed some unused code ( CrawlDbDumpReducer ) as confirmed by Andrzej.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic