[prev in list] [next in list] [prev in thread] [next in thread]
List: quanta
Subject: Re: [Quanta] crashes when VPL is selected (3.4.0)
From: Paulo Moura Guedes <moura () kdewebdev ! org>
Date: 2005-03-21 18:44:22
Message-ID: 200503211844.22223.moura () kdewebdev ! org
[Download RAW message or body]
Well, I can't reproduce the crash with Quanta and kdelibs from CVS and the
rest of KDE from 3.3 series. It's a little difficult to me to address this
problem...
Can someone test the attached file on VPL?
On Monday 21 March 2005 18:34, Gour wrote:
> Paulo Moura Guedes (moura@kdewebdev.org) wrote:
> > On every file?
>
> On practically every one on the site I'm working at, except the most
> simple page.
>
> > If not, send me the one please.
>
> In the attachment here is one - not from the site.
>
> Sincerely,
> Gour
--
Paulo Moura Guedes
Linux Caixa Mágica - http://caixamagica.org
KDE Web Development - http://kdewebdev.org
["htdig.html" (text/html)]
<html><head><title>Installing and configuring the ht://Dig search engine</title>
<link rel=stylesheet type="text/css" href="../scrounge.css">
</head>
<body bgcolor="#ffffff">
<center>
<table border=0 width=60% cellspacing=0 cellpadding=0>
<tr><td valign="center" bgcolor="#cc3300">
<img src="../scrounge3r.gif" alt="scrounge.org"><br>
</td></tr></table>
<h2>Installing and Configuring the ht://Dig Search Engine</h2></center>
<p>ht://Dig is an excellent search engine to install on your web server. <a \
href="../search.html">Try it out!</a> See the <a \
href="http://www.htdig.org/require.html">Features and Requirements</a> page for more \
information. Check the <a href="http://www.htdig.org/">ht://Dig home page</a> for \
the latest news and updates. I'm going to cover some additional installation and \
configuration hints.
<h3>Getting it going</h3>
<ul>
<li><b><a href="#quickstart">Quick Start (for the intrepid)</a></b>
<p><li><b><a href="#longform">Installation (Long form)</a></b>
<ul>
<li><a href="#installrpm">Installing the RPM</a>
<li><a href="#installtarball">Installing the tarball</a>
<ul>
<li><a href="#configapache">Configuring Apache (tarball only)</a>
</ul>
<li><a href="#whereis">Where everything is</a>
<li><a href="#configconf">Configuring the htdig.conf file</a>
<li><a href="#digging">Generating the search index</a>
<li><a href="#searching">Doing a search. Finally.</a>
<li><a href="#troubleshooting">Troubleshooting</a>
</ul>
</ul>
<h3>Tips and Techniques</h3>
<ul>
<li><a href="#customizesearch">Customizing the search results</a>
<li><a href="#date">Making the date display all four digits of the year in search \
results</a> <li><a href="#rundig2">An alternate rundig script</a>
<li><a href="#pdf">Indexing PDF files</a>
<li><a href="#doc">Indexing Microsoft Word files</a>
<li><a href="#logging">Logging search requests</a>
</ul>
<p>Please report any <a href="mailto:wayne@scrounge.org">errors or ommissions to \
me</a>. Suggestions are welcome too. Thank you.
<p><hr>
<h1>Getting it going</h1>
<a name="quickstart"></a>
<h3>Quick Start (for the intrepid)</h3>
<p>If you are using Red Hat or Mandrake Linux and you are reasonably familiar with \
using Apache, you might get by by following these Quick Start instructions. \
Otherwise, use the <a href="#longform">complete instructions</a>.
<ul>
<li>(As root) get and install the RPM. (Full information <a \
href="#installrpm">here</a>. Note the vixie-cron issue for Red Hat 5.0-5.1.)
<p><li>Edit <tt>/etc/htdig/htdig.conf</tt> and check to see that <tt><a \
href="http://www.htdig.org/attrs.html#start_url">start_url:</a></tt> correctly points \
what you want to index on your server. Watch out because the RPM installer adds a \
<em>second</em> <tt>start_url:</tt> definition at the end of the file.
<p><li>Type <tt>rundig -v</tt> to create the search index database. You \
<em>should</em> see indications that it is indexing each file. If not and it appears \
to be "hanging," abort with Ctrl-C and check your configuration.
<p><li>You should now be able to search by accessing <tt>search.html</tt>, which is \
installed in <tt>/home/httpd/html</tt>. <a \
href="http://www.htdig.org/hts_method.html">How searching works</a>.
<p><li>It worked? Good. Now look through the rest of this document to learn more \
about configuring ht://Dig. If it <em>didn't</em> work, then well, the same advice \
applies: look through the rest of this document.
</ul>
<p>Note that the RPM installer created a cron job in <tt>/etc/cron.daily</tt> that \
will run <tt>/usr/sbin/rundig</tt> once a day so that the search index will \
automatically be updated once a day.
<p>But you still should look over the rest of this documentation.
<p>
<a name="longform"></a>
<h3>Installation (Long form)</h3>
<p>Before you start, you should look over the <a \
href="http://www.htdig.org/require.html">Features and Requirements</a> page. Ht://Dig \
is available in source "tarball" and Red Hat style RPM distributions. The RPM \
distribution is much easier to install, but the tarball gives you more flexibility in \
specifying the locations where everything will be installed. Your choice. This \
document is going to cover installing both the htdig 3.1.5.tar.gz "tarball" and the \
RPM file. The <a href="http://www.htdig.org/where.html">Where to get it</a> page is \
the best place to get the most recent version of ht://Dig.
<!-- <p>The ht://Dig installation instructions are excellent. Follow them after \
reading my comments. -->
<a name="installrpm"></a>
<h3>Installing the RPM</h3>
<p>Mandrake 7.2 has ht://Dig on the install CD and might already be installed on your \
system. Red Hat 7.0 has it on the "Power Tools" CD. You can get other RPM \
distributions <a href="http://www.htdig.org/files/binaries/">from here</a>. (Or <a \
href="http://www.scrc.umanitoba.ca/htdig/rpms/">from here</a>.) Download <em>one</em> \
of these: <ul>
<p>htdig-3.1.5-0.i386.rpm (Red Hat 4.2)<br>
htdig-3.1.5-0glibc.i386.rpm (Red Hat 5.x) *<br>
htdig-3.1.5-0glibc21.i386.rpm (for glibc-2.1, Red Hat 6.0, 7.0**)
</ul>
<p> Put it somewhere on your Linux machine and (as root) type \
<tt>rpm -Uvh htdig*.rpm</tt>. Bang, it's installed. Now skip to <a \
href="#whereis">Where everything is</a>.
<span class="smalltext">
<blockquote><small>
<p class="smalltext">* There is a bug with vixie-cron for Red Hat 5.0 and 5.1. The \
ht://Dig team reccomends upgrading to a newer version of vixie-cron. Look for \
vixie-cron-3.0.1-37.5.2.i386.rpm. This affects you, because the RPM installer \
installs <b>rundig</b> as an <b>/etc/cron.daily</b> job. Get the updated vixie-cron \
<a href="http://www.scrc.umanitoba.ca/htdig/rpms/">from here</a>.
<p class="smalltext">** If you are using Red Hat 7.0 and don't have the Power Tools \
CD, then you can use htdig-3.1.5-0glibc21.i386.rpm, but it needs some additional work \
to get it going. You must first install compat-libstdc++-6.2-2.9.0.9.i386.rpm from \
the first Red Hat 7.0 install CD. The default HTML directory in previous version of \
Red Hat was /home/httpd/html. It is now /var/www/html. \
htdig-3.1.5-0glibc21.i386.rpm installs several things in /home/httpd/html. These \
need to be moved to /var/www/html.
<p class="smalltext">Move search.html and the htdig directory to /var/www/html. You \
must also move /home/httpd/cgi-bin/htsearch to /var/www/cgi-bin/htsearch. The \
'local_urls' variable in /etc/htdig/htdig.conf needs to be modified because it refers \
to /home/httpd/html.
</small></blockquote></span>
<a name="installtarball"></a>
<h3>Installing the tarball</h3>
<p>For the tarball, you should decide where you want ht://Dig to install its \
programs. <!-- I went with the default. You might want to change this. --> You \
must decide this before you install it, because you can't move it after you have it \
installed. (Except by deleting the entire installation and re-installing from \
scratch.) The default is to install in the <tt>/opt/www</tt> directory. The \
assorted ht://Dig binaries and configuration files will be located in this directory \
tree. You must configure your Web server to execute the ht://Dig CGI programs from \
here. If this is not acceptable, then change these locations during the installation \
procedure.
<p>OK, now follow the <a href="http://www.htdig.org/install.html">ht://Dig \
installation instructions</a>. (You probably should open them in a new window so \
that you can refer to this page.) When you get to the <b>Configure</b> step, you \
have the opportunity to edit the <TT>CONFIGURE</TT> script that defines where \
everything will get installed. If you want to go with the default location, then \
just continue on through the procedure.
<a name="configapache"></a>
<h3>Configuring Apache (tarball only)</h3>
<blockquote>
<p class="smalltext">The RPM installation should need no Apache configuration \
changes, because everything goes in "standard" locations. Assuming that your \
installation uses the standard locations.... </blockquote>
<p>Assuming that you installed ht:/Dig in the default <tt>/opt/www</tt> directory, \
here are the configuration changes that you should add to your Apache configuration \
file(s).
<p><table border="1" cellspacing=0 cellpadding=8>
<tr valign="top"><td>
<tt>Alias /htdig/ /opt/www/htdocs/htdig/</tt></td><td>
So that you can "point" to assorted graphic files. e.g.,<br>
<tt><img src="/htdig/htdig.gif"></tt> Also, the default <tt>search.html</tt> \
file is located here.
<p>It is a real good idea to keep the <tt>/htdig/</tt> definition, because the \
template files that are used to display the search results all refer to \
<tt>htdig/</tt> to locate files. </td></tr>
<tr valign="top"><td>
<tt>ScriptAlias /htdig-cgi/ /opt/www/cgi-bin/</tt></td><td>
Is how you access the htsearch program for searching. e.g.,<br>
<tt><form method="post" action="/htdig-cgi/htsearch"></tt>
</td></tr>
<tr valign="top"><td>
<pre>
<Directory /opt/www/cgi-bin/>
AllowOverride None
Options ExecCGI
</Directory>
</pre></td><td>
So that Apache will allow access to the ht://Dig cgi-bin directory.
</td></tr>
</table>
<p>After editing your Apache configuration files, type \
<tt>/etc/rc.d/rc.init/httpd restart</tt> to restart Apache.
<a name="whereis"></a>
<h3>Where everything is</h3>
<table border="1" cellspacing=0 cellpadding=7>
<tr valign="top"><th>Name</th><th>RPM locations</th><th>Tarball (Default \
locations)</th><th>Used for</th><tr>
<a name="configdir"></a>
<tr valign="top"><td><a \
href="http://www.htdig.org/config.html#htdig.conf">${CONFIG_DIR}</a></td><td>/etc/htdig</td><td>/opt/www/htdig/conf</td><td>htdig.conf \
configuration file</td></tr>
<a name="commondir"></a>
<tr valign="top"><td>${COMMON_DIR}</td><td>/var/lib/htdig/common</td><td>/opt/www/htdig/common</td><td>Template \
files used for search results</td></tr>
<a name="bindir"></a>
<tr valign="top"><td>${BIN_DIR}</td><td>/usr/sbin</td><td>/opt/www/htdig/bin</td><td>rundig \
and other "digging" binaries</td></tr>
<a name="databasedir"></a>
<tr valign="top"><td>${DATABASE_DIR}</td><td>/var/lib/htdig/db</td><td>/opt/www/htdig/db</td><td>The \
search index database files.</td></tr>
<a name="cgibindir"></a>
<tr valign="top"><td>${CGIBIN_DIR}</td><td>/home/httpd/cgi-bin</td><td>/opt/www/cgi-bin</td><td>htsearch</td></tr>
<a name="imagdir"></a>
<tr valign="top"><td>${IMAGE_DIR}</td><td>/home/httpd/html/htdig</td><td>/opt/www/htdocs/htdig</td><td>htdig.gif, \
and other graphic files</td></tr>
<a name="searchdir"></a>
<tr valign="top"><td><a \
href="http://www.htdig.org/config.html#search.html">${SEARCH_DIR}</a></td><td>/home/httpd/html</td><td>/opt/www/htdocs/htdig</td><td>search.html \
sample search form</td></tr>
<!--
<tr valign="top"><td></td><td></td><td></td><td></td></tr>
<tr valign="top"><td></td><td></td><td></td><td></td></tr>
<tr valign="top"><td></td><td></td><td></td><td></td></tr>
<tr valign="top"><td></td><td></td><td></td><td></td></tr>
-->
<tr valign="bottom"><th>Name</th><th>RPM locations</th><th>Tarball (Default \
locations)</th><th>Used for</th><tr> </table>
<a name="configconf"></a>
<h3>Configuring the htdig.conf file</h3>
<blockquote>
<p class="smalltext"><b>Important note for RPM users:</b> The RPM installation \
program attempts to configure ht://Dig so that it will work "out of the box." They \
installed the various files in "standard" Red Hat locations. One thing that is never \
standard, however, is the name of your machine. The ht://Dig RPM installer attempts \
to glean this information from your existing configuration files and <em>appends new \
definitions at the <b>end</b> of the htdig.conf file</em>, in addition to the "stock" \
definitions that are scattered throughout the htdig.conf file. This includes the all \
important <tt>start_url:</tt> variable. Variable definitions at the end of the file \
override earlier definitions. Bear this in mind as you are scrolling through \
htdig.conf. </blockquote>
<p>Edit <tt><a href="#configdir">${CONFIG_DIR}</a>/htdig.conf</tt>. Scroll down and \
find the <tt><a href="http://www.htdig.org/attrs.html#start_url">start_url:</a></tt> \
line. This line defines what ht://Dig will index for searching. The default is to \
index the http://www.htdig.org/ site. This is not a good site to test with, because \
it takes a <em>long</em> time to index. Change this to point to a "site" on your own \
machine. For speed, change the URL to use your machine's IP address, rather than the \
full domain name. For example, if your machine is addressed as 192.168.1.1, then set \
<tt>start_url:</tt> to be <tt>http://192.168.1.1/</tt>
<p><b>Start_url: must be specified to be accessed the same way as your web server \
accesses it. </b>
<p>Because ht://Dig works like a web crawler and accesses your HTML pages the same \
way as a web browser does. So use a browser to access the site on your own machine. \
Use the same URL that your browser uses in <tt>start_url:</tt>.
<blockquote><small>
<p class="smalltext">Using the IP address to refer to the site is a shortcut for \
testing. This IP address will be returned in the search results, so 192.168.1.1, for \
example, isn't what you would use when you release the search form to the public. In \
this case, you either have to set start_url: to the actual domain that the site uses, \
or (preferably) use <em>two</em> configuration files (one for digging and another for \
searching) and use the <b><a \
href="http://www.htdig.org/attrs.html#url_part_aliases">url_part_aliases</a></b> \
directive to translate from a local IP address to the real domain. This is more \
complicated than what you should be doing until you have it working and are familiar \
with the basic operations.
<p class="smalltext">For an additional speed boost, check out the <b><a \
href="http://www.htdig.org/attrs.html#local_urls">local_urls:</a></b> directive that \
lets ht://Dig access the files through the local filesystem, rather than having to go \
through the web server. But, again, wait until you have ht:/Dig working and are \
reasonably familiar with how everything works before you try using this. \
</small></blockquote>
<p>You should create a <tt>robots.txt</tt> file in the server's root directory to \
specify what you do <em>not</em> want ht://Dig (or any other search engine!) to \
index. Here is a sample <tt>robots.txt</tt> file
<blockquote><pre>
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
</pre>
<p>From <a href="http://info.webcrawler.com/mak/projects/robots/norobots.html">A \
Standard for Robot Exclusion</a>. </blockquote>
<p><a href="http://www.htdig.org/confindex.html">Reference for all configuration file \
directives</a>
<a name="digging"></a>
<h3>Generating the search index</h3>
<p>Before you can search you must generate the search index database. Change to \
<tt><a href="#bindir">${BIN_DIR}</a></tt>. Use the <tt>rundig</tt> script to run the \
ht://Dig programs to index your site. Type <tt>./rundig -v</tt> Rundig will run the \
<tt>htdig</tt> "digging" (indexing) and <tt>htmerge</tt> (second step of creating the \
search index) programs. The <tt>-v</tt> option tells them to be verbose. Meaning \
that you should see each file as it is indexed, followed by indications of the \
merging activity.
<p>This <em>should</em> complete in a reasonable length of time (depending on the \
size of your site.) If you see prolonged periods of inactivity, then press Ctrl-C to \
abort the programs and check <tt>start_url:</tt> in the <tt><a \
href="#configdir">${CONFIG_DIR}</a>/htdig.conf</tt> configuration file. If indexing \
is taking too long for testing, consider changing <tt>start_url:</tt> to only index a \
subset of your site until you are done wrestling with the configuration file.
<p>Note that you must update the index whenever the site is updated. If your site is \
large and indexing is time consuming, then you might want to do the indexing in a <a \
href="cron.html">cron</a> job that is run in the middle of the night.
<blockquote><small>
<p class="smalltext">RPM users should know that the RPM installer creates an \
<b>/etc/cron.daily</b> job that will automatically run <b>rundig</b> once a day. \
This may be all that you need.
<p class="smalltext">When you get the configuration file squared away, then use \
<b>./rundig -s</b> for a considerably shorter display. Alternatively, if \
something is giving you problems then try using <b>./rundig -vvv</b> for an \
<em>extremely</em> detailed and verbose display. In this case, you would probably \
want to redirect the output to a file. \
<b>./rundig -vvv > debug.txt</b> Then load <b>debug.txt</b> in an \
editor.
<p class="smalltext">Right now the only way you have to generate the index is by \
running the <b>rundig</b> (or <b><a href="#rundig2">rundig2</a></b>) script, which \
possibly is limiting because generates the whole index from scratch each time that it \
is run. This has two undesireable side effects: 1., it takes times and machine \
resources, and, 2., searching returns no results while the <b>rundig</b> script is \
running.
<p class="smalltext">There are other ways to do the search index database updating to \
sidestep these issues. You should examine the command line options for the indexing \
programs so that you can develop an indexing procedure that best suits your site's \
needs.
</blockquote></small>
<p>More information on the <a href="http://www.htdig.org/htdig.html">htdig</a>,
<a href="http://www.htdig.org/htmerge.html">htmerge</a>,
<a href="http://www.htdig.org/htnotify.html">htnotify</a>, and
<a href="http://www.htdig.org/htfuzzy.html">htfuzzy</a> programs that are used to \
generate the search index database.
<a name="searching"></a>
<h3>Doing a search. Finally.</h3>
<p>Look at <tt><a href="#searchdir">${SEARCH_DIR}</a>/search.html</tt> This is your \
sample <a href="http://www.htdig.org/hts_form.html">search form</a>.
<blockquote><small>
<p class="smalltext">For the tarball installation, you probably have to change one \
line, because we defined the CGI directory to be <b>htdig-cgi</b> in the Apache \
configuration file. So change </small>
<pre><form method="post" action="/cgi-bin/htsearch">
</pre>
<small>
<p class="smalltext">to
</small>
<pre><form method="post" action="/htdig-cgi/htsearch">
</pre>
<small>
<p class="smalltext">and save the file.
</blockquote></small>
<p>Now use a browser to access this search form. If the IP address of your server is \
192.168.1.1, then enter either <tt>192.168.1.1/htdig/search.html</tt> (tarball) or \
<tt>192.168.1.1/search.html</tt> (RPM) as the URL for your browser. You should see \
the search form. Enter a word that you know is somewhere on your site. Click the \
search button.
<p>(Fingers are crossed.)
<p>You <em>should</em> see the search results displayed, almost instantly.
<p>More information on the <a href="http://www.htdig.org/htsearch.html">htsearch</a> \
CGI program that does the actual searching.
<a name="troubleshooting"></a>
<h3>Troubleshooting</h3>
<p>If something isn't working right, the first thing to do is to go back and check \
your configuration and try repeating the above procedures. If this doesn't help, \
then the <a href="http://www.htdig.org/">ht://Dig site</a> has a lot of valuable \
reference material. Check the <a \
href="http://www.htdig.org/config.html">configuration page</a>, check the <a \
href="http://www.htdig.org/FAQ.html">FAQ</a>. Check the on-line reference section. \
Most important, make sure to visit the <a \
href="http://www.htdig.org/mailarchive.html">ht://Dig Mailing List Archive</a>. The \
ht://Dig community provides <em>excellent</em> support. Most (if not all) common \
"why doesn't this work" type questions have already been asked and answered on the \
mailing list, or in the FAQ.
<p><b>Use the search box at the bottom of <a href="http://www.htdig.org/">the main \
ht://Dig page</a> to search the archives (and the rest of the ht://Dig site.)</b>
<p>
<h1>Tips and Techniques</h1>
<a name="customizesearch"></a>
<h3>Customizing the search results</h3>
<p>Examine <tt><a href="#searchdir">${SEARCH_DIR}</a>/search.html</tt>. You use this \
as a basis for how you want the search forms to look. The search results are defined \
by the template files that are located in <tt><a \
href="#commondir">${COMMON_DIR}</a></tt>. You edit these to change how the search \
results are displayed.
<ul>
<li><a href="http://www.htdig.org/hts_form.html">Search form</a>
<li><a href="http://www.htdig.org/hts_templates.html">Template files</a>
<li><a href="http://www.htdig.org/config.html">Configuration documentation</a> has \
more information on these files. <li><a \
href="http://www.htdig.org/htsearch.html">Htsearch</a> <li><a \
href="http://www.htdig.org/hts_method.html">How searching works</a> </ul>
<p>One tricky part is that ht://Dig <em>totally ignores</em> the template files \
unless you add a <tt><a \
href="http://www.htdig.org/attrs.html#template_map">template_map directive</a></tt> \
to <tt>htdig.conf</tt>. Like this:
<pre>
this_base: myweb
search_results_header: ${common_dir}/${this_base}/header.html
search_results_footer: ${common_dir}/${this_base}/footer.html
nothing_found_file: ${common_dir}/${this_base}/nomatch.html
syntax_error_file: ${common_dir}/${this_base}/syntax.html
template_map: Long builtin-long ${common_dir}/${this_base}/long.html \
Short builtin-short ${common_dir}/${this_base}/short.html \
Default default ${common_dir}/${this_base}/long.html
template_name: Default
</pre>
<p>In this case I defined a new variable, <tt>this_base:</tt> with a value of \
<tt>myweb</tt>. The way I use this is to first create a <tt>myweb</tt> directory on \
top of <tt><a href="#commondir">${COMMON_DIR}</a></tt> and copy all the template \
files into it <em>before</em> I started editing them. This leaves an untouched set \
of the template files.
<p>Once this has been done I went through and edited all the template files so that \
they displayed the way I wanted. e.g., editing <tt>${COMMON_DIR}/myweb/header.html, \
${COMMON_DIR}/myweb/footer.html</tt>, etc. This method is also valuable if you are \
indexing (and searching) multiple sites and are using multiple configuration files. \
You keep each different set of template files in a different directory (defined by \
the value that is assigned to <tt>this_base</tt>.)
<p>Optional. You could also separate the database files by defining them like
<pre>
database_base: ${database_dir}/${this_base}
</pre>
<p>The database files default to be named like <tt>db.docdb, db.word.db</tt>, etc. \
Making the above change would result in the database files being named like \
<tt>myweb.docdb, myweb.word.db</tt>, etc. Again, this is important if you are using \
multiple configuration files to manage multiple search databases on the same machine. \
If you are only using one search database, then you can ignore defining \
<tt>database_base:</tt>.
<!--
<p>You would have to make a new directory for each different \
<tt>${database_dir}/${this_base}</tt>. For example, if <tt>${database_dir}</tt> is \
defined as <tt>/opt/www/htdig/db</tt> and <tt>${this_base}</tt> is defined as \
<tt>myweb</tt>, then you would create the \
<tt>/opt/www/htdig/db/myweb</tt> directory.
-->
<a name="date"></a>
<h3>Making the date display all four digits of the year in search results</h3>
<p>Add a <tt><a href="http://www.htdig.org/attrs.html#date_format">date_format:</a></tt> \
command to <tt>htdig.conf</tt>.
<p>Example: <tt>date_format: %m/%d/%Y</tt> will display like \
<b>01/23/2000</b>.
<p>See <tt>man strftime</tt> for full reference.
<a name="rundig2"></a>
<h3>An alternate rundig script</h3>
<p>ht://Dig supplies the <tt>rundig</tt> script that is sufficient to manage some \
ht://Dig indexing operations. But <tt>rundig</tt> doesn't support all the possible \
<tt>htdig, htmerge, and htfuzzy</tt> command line options. It is also difficult to \
use when you are specifying a different configuration file, because you have to type \
in the complete path to the configuration file.
<p>I have modified <tt>rundig</tt> to address this. The modified script is named \
<tt>rundig2</tt>. It now supports all the command line options. It also supplies \
the path and file extension when you use the <tt>-c config file</tt> option.
<p>Download either
<ul>
<a href="rundig2tar.txt">rundig2tar.txt</a> (For a default tarball installation)<br>
<a href="rundig2rpm.txt">rundig2rpm.txt</a> (For a RPM installation)
</ul>
<p>Download whichever of these is most appropriate. Rename it to be \
<tt>rundig2</tt>, check to see that the variables that define locations \
(<tt>DBDIR</tt>, etc.) are correct, move it to <tt><a \
href="#bindir">${BIN_DIR}</a></tt>, and chmod it to be executable. \
(<tt>chmod 755 rundig2</tt>)
<p>Now you can use <tt>rundig2</tt> instead of <tt>rundig</tt> when you are creating \
the database files. If <tt>rundig2</tt> doesn't work for you, for some reason, then \
go back to using <tt>rundig</tt> and <a href="mailto:wayne@scrounge.org">please let \
me know about it</a>.
<a name="pdf"></a>
<h3>Indexing PDF files</h3>
<p>Ht://Dig will index Adobe Acrobat PDF files quite nicely, but it needs some \
additional configuration. You must download and install a PDF-to-text converter and \
do some additional configuration. Here's how.
<p>Download the <a href="http://www.foolabs.com/xpdf/">Xpdf package</a> from the <a \
href="http://www.foolabs.com/xpdf/download.html">Xpdf Download page</a>. Linux Intel \
users can download the pre-compiled binaries (x86, Linux 2.0 (libc6):) Once you have \
the binaries, then copy <tt>pdftotext</tt> and <tt>pdfinfo</tt> to a suitable \
location (<tt><a href="#bindir">${BIN_DIR}</a></tt> or <tt>/usr/bin</tt>, for \
example)
<p>Alternatively, you can also use one of these <a \
href="http://www.scrc.umanitoba.ca/htdig/rpms/">Xpdf RPM files</a>. Download \
<em>one</em> of these files:
<p><ul>
xpdf-0.90.0.i386.rpm (Red Hat 4.2)<br>
xpdf-0.90.0glibc.i386.rpm (Red Hat 5.x)<br>
xpdf-0.90.0glibc21.i386.rpm (Red Hat 6.x)<br>
</ul>
<p>Install the RPM (<tt>rpm -Uvh xpdf*.rpm</tt>) and <tt>pdftotext</tt> and \
<tt>pdfinfo</tt> will be installed in <tt>usr/bin</tt> (Double check the location \
with <tt>rpm -ql xpdf</tt>)
<!--
<p>Download <tt>conv_doc.pl.gz</tt> from <a \
href="http://www.htdig.org/files/contrib/parsers/">http://www.htdig.org/files/contrib/parsers/</a>. \
Then use gunzip to "unzip" it.
<blockquote><small>
<p class="smalltext">You <em>might</em> have trouble unzipping conv_doc.pl.gz after \
downloading it with your browser. Try right clicking on it. Or try using a real \
FTP program. You can also get (ungzipped) conv_doc.pl <a \
href="http://www.scrc.umanitoba.ca/htdig/rpms">from here</a>. </small></blockquote>
-->
<p>Download <tt>conv_doc.pl</tt> <a \
href="http://www.scrc.umanitoba.ca/htdig/rpms">from here</a> and copy it to your \
<tt><a href="#bindir">${BIN_DIR}</a></tt> directory. Chmod it to to be executable. \
(<tt>chmod 755 conv_doc.pl</tt>) Then load it in your editor and change \
the <tt>$CATPDF</tt> variable to point to where <tt>pdftotext</tt> is and change \
<tt>$PDFINFO</tt> to where <tt>pdfinfo</tt> is.
<p>Finally, edit <tt><a href="#configdir">${CONFIG_DIR}</a>/htdig.conf</tt> and add
<pre>
external_parsers: application/pdf->text/html <i>/usr/local/bin/</i>conv_doc.pl
</pre>
<p>Replace <tt><i>/usr/local/bin/</i></tt> with the location of where you copied \
<tt>conv_doc.pl</tt> <a href="http://www.htdig.org/attrs.html#external_parsers">More \
about the external_parsers: directive</a>.
<p><b>Important note.</b> ht:/Dig must read each PDF file in its entirety in order \
to index it. This is affected by the <tt>max_doc_size:</tt> directive in \
<tt>htdig.conf</tt>. Make sure that <tt>max_doc_size:</tt> is set to be larger than \
your largest PDF file.
<blockquote><small>
<p class="smalltext">pdftotext is pretty nifty. It can also be interfaced to lynx \
Check /etc/lynx.cfg and ~.mailcap. </small></blockquote>
<a name="doc"></a>
<h3>Indexing Microsoft Word files</h3>
<p> Installing a Microsoft Word to text converter is similar to <a \
href="#pdf">Indexing PDF Files</a>. Follow the procedures there to install and \
configure <tt>conv_doc.pl</tt>. The only difference is that you install a \
Word-to-Text converter, such as <a \
href="http://www.fe.msk.ru/~vitus/catdoc/">catdoc</a>. These go together, so it is \
almost as easy to install both the Word and PDF converters at the same time. \
<tt>conv_doc.pl</tt> is already partially configured to use <tt>catdoc</tt>. Add
<pre>
external_parsers: application/msword->text/html <i>/usr/local/bin/</i>conv_doc.pl
</pre>
<p>to <tt><a href="#configdir">${CONFIG_DIR}</a>/htdig.conf</tt>. If you were \
installing both the PDF and Word converters, then you'd add
<pre>
external_parsers: application/msword->text/html <i>/usr/local/bin/</i>conv_doc.pl \
\
application/pdf->text/html <i>/usr/local/bin/</i>conv_doc.pl
</pre>
<p>Again, replace <i>/usr/local/bin/</i> with the location where you have actually \
installed the <tt>conv_doc.pl</tt> script.
<a name="logging"></a>
<h3>Logging search requests</h3>
<p>It is valuable to have a record of what prople are searching for so that you know \
what they are interested in. This can give you hints on additional content that you \
need to add to your site.
<p>To log search requests, add <tt><a \
href="http://www.htdig.org/attrs.html#logging">logging:</a> true</tt> to your \
configuration file. This will direct the system logging facility to log search \
requests.
<p>However, you might want to change the default logfile where syslog sends these \
messages to. (By default it goes to <tt>/var/log/messages</tt>.) To do this, edit \
your <tt>/etc/syslog.conf</tt> file and add this to it:
<pre>
# Log ht://Dig search requests
local5.* /var/log/htdig
</pre>
<p><b>Remember to use tabs and <i>NOT</i> spaces in your <tt>syslog.conf</tt> file. \
Otherwise it won't work.</b>
<p>The system will now log search requests to both <tt>/var/log/messages</tt> as well \
as to <tt>/var/log/htdig</tt>, so now you have to tell it not to log search requests \
to <tt>/var/log/messages</tt>. To do this, add <tt>;local5.none</tt> to your \
<tt>/var/log/messages</tt> line. It should look something like this:
<pre>
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none<font color="#FF0000"><b>;local5.none</b></font> \
/var/log/messages </pre>
<p>For the changes to take effect, you'll need to restart your <tt>syslog</tt> \
daemon. To do so, just do a
<pre>
killall -HUP syslogd
</pre>
<p>That will force <tt>syslogd</tt> to re-read its config file for the changes to \
take effect.
<p>See <tt>man syslog.conf -S 5</tt> for more information.
<p><i>Syslog information courtesy of Bruce A. Buhler</i>
<p><hr>
<p>Back to the <a href="../index.html#linux">scrounge.org home page.</a>
</body></html>
_______________________________________________
Quanta mailing list
Quanta@mail.kde.org
https://mail.kde.org/mailman/listinfo/quanta
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic