(A version of this article appeared on ApacheToday.com, June 2, 2000)
If you're like me, you probably loathe updating directory index pages. You add a new file, or folder to your website and then you have to find other pages that you should link from and update them- Not to mention the toils of updating all of those pages if the page name/location changes!
I solve this problem, quite simply, by creating directory index scripts using Perl. The largest member of this class of scripts is a directory on my private webserver that has folders containing pages talking about my projects. My entire website is logically organized (logical to me, anyways) using directories to house and nest information, and my "projects" page is no different.
Figure 1
From a filesystem structure standpoint, every directory in my "projects" directory contains a different project. Every project directory has an "index" HTML file. Every "index" HTML file has a title. Figure 1 demonstrates this. Keeping these rules in mind, it is easy to write a short Perl script that makes ones' life much easier. Configuring Apache
This script resides in the root of the "Projects" folder, and is called "index.pl". In order for Apache to consider "index.pl" the directory index script, we have to configure the httpd.conf file to include "index.pl" as a valid directory index file. You may choose "index.cgi" instead of ".pl" if you want. Below shows my DirectoryIndex statement. Apache reads these entries one at a time, from left to right. You will probably want to have "index.html" placed ahead of "index.pl", if the majority of your directory index pages are HTML pages and not these handy scripts.
DirectoryIndex index.pl index.html index.php index.cgi index.htm
Regardless of what you call these scripts, make sure you let Apache know how to handle them, by using the AddHandler directive in your config file. Below is an excerpt of mine.
AddHandler cgi-script .pl .cgi
Contents |
Thinking About the Problem
Recall the environment I mentioned earlier:
1. Every directory in my "projects" directory contains a different project 2. Every project directory has an "index.html" file 3. Every "index.html" file has a title
Given this organizational structure, our little script has to do only four things:
1. Obtain a list of directories 2. For every directory, open the "index.html" file if it exists 3. For every "index.html" file, extract the title of the page 4. For every pilfered title, print it back to the user as a link to the given page
Step 1 - Obtain a list of directories
A clumsy, but easy way do acquire a list of directories, is to place all of the contents of the root directory we want to index, into an array (the "projects" directory for our example).
1: my $dir="/usr/local/apache/htdocs/projects/"; 2: my $url="http://mattwork.potsdam.edu/projects/"; 3: opendir(PRJD,"$dir"); 4: my @dirs=readdir PRJD; 5: closedir(PRJD);
Line 1 places the filesystem name of the root folder we want to index into a scalar variable called "$dir". Line 2 placed the URL we will want to substitute (eventually) for the filesystem name in a scalar variable called "$url". Line 3 opens the directory, and assigns a directory handle named "PRJD" to it. Line 4 places all the objects in that directory into the array called "@dirs". Line 5 simply closes the open directory. From here, we have all of the directories (as well as files and symbolic links) stored in the "@dirs" array. It may not make sense why we're using "$dir" and "$url" right now, but that will all be much clearer on the other side.
Step 2 - For every directory, open the "index.html" file if it exists
So now we need to iterate over our "@dirs" array. The simplest way to do this is with the "for" statement. Every time the loop goes around a special scalar named "$_" will contain the name of the object (be it directory, file, or link), and the loop will terminate when all objects have been processed. For sanity's sake, the code snippet below will "ignore" any object that isn't a directory, or any that starts with a "." (period).
1: for(sort @dirs) {
2: if($_ =~ /^\./) { next; }
3: unless(-d "$dir$_") { next; }
4: my $path="$dir$_/index.html";
5: my $title=Get_Title("$path");
6: # print the entry here
7: }
Line 1 starts the loop, iterating over our array-o-directory-objects. The "sort" function will sort the list of directories alphabetically, for a bit more user-friendliness. Line 2 says "if this object begins with a dot, then skip this object and cycle the loop". Line 3 says "if this object is not a directory, then skip this object and cycle the loop". Line 4 placed the "$dir$_/index.html" magic into the "$path" scalar variable. Line 5 calls a mystery function (that we will be writing very soon) to extract the title from the webpage. Line 6 is a comment, holding the place for some code we will shoe-horn in here during Step 4. Line 7 is the end of this loop.
If you look at line 4 above, you'll notice a "$dir$_/" magic. The "$dir" is the variable we set up in Step 1 that contains the filesystem name of the root directory we want to index. The "$_" variable, as I mentioned before, contains the name of the object we're currently processing. The trailing "/" is simply to append a slash to the end of the directory name. So, if we were currently processing the "Apache" object, "$dir" would contain "/usr/local/apache/htdocs/projects/", "$_" would contain "Apache" so "$dir$_/" would be the concatenation of all three which is "/usr/local/apache/htdocs/projects/Apache/"! What wonderful magic.
Step 3 - For every "index.html" file, extract the title of the page
As I mentioned when describing Line 5 of Step 2, we have to write a "Get_Title" function that takes in the name of the file, processes it, and returns the title of the page. Fortunately, titles are pretty easy to extract.
1: sub Get_Title {
2: my $filename=shift;
3: unless(-f "$filename") { return("NO INDEX"); }
4: open(HTML,"<$filename");
5: while(<HTML>){
6: if($_ =~ /<title>(.*)<\/title>/i) {
7: close HTML;
8: return "$1";
9: }
10: }
11: close HTML;
12: return "Untitled";
13: }
Don't let this snippet scare you, it's actually quite logical once dissected. Line 1 declares the function "Get_Title". Line 2 takes the parameter we passed to the function (that's the "$dir$_/index.html" from Line 4 in Step 2), and shifts it into the scalar variable "$filename". Line 3 says "unless this is a file, return the text 'NO INDEX'". Line 4 opens the file for reading and assigns the handle "HTML" to it. Line 5 begins a "while" iteration over every line of the open file (every line will cause a new iteration of the loop, the contents of the line will be stored in the special variable "$_"). Line 6 says "if this line contains a '<title>' and a '</title>', place the stuff in between in the special variable '$1' and continue inside the brackets". Line 7 is inside the "if" statement, and closes the HTML file. Line 8 returns the text of the title and exits the function. Line 9 ends the if statement. Line 10 ends the while statement. Line 11 will close the HTML file if no title has been found. Line 12 will return the word "Untitled" in the advent that no title has been found. Line 13 ends the function.
This function is a bit complex in code, but I like how it demonstrates a lot of Perl's power and flexibility. The "if" statement in line 6 contains a regular expression that it case-insensitive (note the "i" after the last "/") - So that <TITLE> and <title> and <titLE> all appear the same to the "if". Step 4 - For every pilfered title, print it back to the user as a link to the given page
I noted back in my description of Line 6 in Step 2, that we needed to add some code that displays the proper HTML link for the viewer of our index. Before we get to that, we need to do a little house cleaning. We need to shoehorn in an HTML header, and perhaps some introductory text on the line before the "for @dirs" on Line 1 of Step 2. At the very least, we need to send the HTTP content header to the viewer's browser, and probably should send a little more. The snippet below is an example of such.
1: print "Content-Type: text/html\n\n"; 2: print "<html><head><title>Project Index Page</title></head><body>\n";
Please note the two return characters on Line 1- This is essential. Line two may be ignored for brevity.
So, now we're back to outputting the correct link information back to the viewer. The code below would replace the comment I made on Line 5 of Step 2.
1: $path =~ s/$dir/$url/i; 2: print "<a href=\"$path\">$title</a><br>\n";
Line 1 uses a substitution pattern to replace the filesystem name with the appropriate URL name. Line 2 prints the HTMLized entry we want: the title of the page showing, and the underlying link to that page.
Summary Discussion
There's lots of room for improvement with this script. The script I have is 82 lines of code, and has all sorts of neat features, some of which I'll mention in a moment. There is also room for frustrating errors with this script. It is imperative that you keep track of your trailing "/". You need to append them where needed, and don't append them where you don't. If you're having odd problems, place a print statement just before you use one of the "$dir$_/" magics, and print that out to make sure it looks like a directory path, or a file path (depending on what you're interested in at the moment).
As I mentioned above, there's lots of room for improvement. Here is a list of some of the features that I have implemented in my various indexing scripts. They get more complex as you go down.
* More robust HTML for prettiness * Making each entry a "list item" * Exclusion list - an array of folders to skip during indexing * Case-insensitive alphabetical directory sorting * More memory-efficient iteration code * Better error trapping for file/directory opening * Case-insensitive alphabetical title sorting * Recurse into sub-directories and "tree" out (like a spider) * Allow the user to sort the list by clicking on "Sort by Title", "Sort by Age" * Allow the user to "Sort by Popularity" (based on the number of "hits")
I'd also like to shamelessly plug Lincoln Stein's "CGI.pm" module, which is available on CPAN, and probably is already in your Perl distribution. I use this module to take a lot of the HTMLizing out of my hands. I wrote this example using no modules, so you could see what was going on, but I highly recommend using CGI.pm to do a lot of the HTML stuff for you.
Full Text of this Example
1 : #!/usr/bin/perl -w
2 :
3 : sub Get_Title {
4 : my $filename=shift;
5 : unless(-f "$filename") { return("NO INDEX"); }
6 : open(HTML,"<$filename");
7 : while(<HTML>){
8 : if($_ =~ /<title>(.*)<\/title>/i) {
9 : close HTML;
10: return "$1";
11: }
12: }
13: close HTML;
14: return "Untitled";
15: }
16:
17: my $dir="/usr/local/apache/htdocs/projects/";
18: my $url="http://mattwork.potsdam.edu/projects/";
19: opendir(PRJD,"$dir");
20: my @dirs=readdir PRJD;
21: closedir(PRJD);
22: print "Content-Type: text/html\n\n";
23: print "<html><head><title>Project Index Page</title></head><body>\n";
24: for(sort @dirs) {
25: if($_ =~ /^\./) { next; }
26: unless(-d "$dir$_") { next; }
27: my $path="$dir$_/index.html";
28: my $title=Get_Title("$path");
29: $path =~ s/$dir/$url/i;
30: print "<a href=\"$path\">$title</a><br>\n";
31: }