Enhance Usability by Highlighting Search Terms
by Brian Suda, Matt Riggott
Google’s caching system offers several cool features; one of most useful is that the words you searched for are highlighted in the page. Most web users don’t read pages carefully — they scan text for what they’re looking for. This is why Google’s cached-page highlighting is so useful. When the page is rendered, users don’t need to read the entire page to find what they came for, the page shows them where it is. As a quick example, the words highlighted above most likely caught your eye before you actually got to reading them.
Usability heuristics state that users should not have to remember information from one site to the next. Wouldn’t it be great if you could extend search-term highlighting to the pages on your own website any time a visitor came from a search engine? How about also highlighting search terms from your own site’s search tool?
We’ve written a script in PHP that you can add to individual pages or entire websites that will automatically highlight words in your page if the user has followed a link from a search engine results page. You can skip the implementation overview and installation instructions and go straight to the script if you like.
Implementation
When someone visits your site from a search engine results page, that
results page’s URL
is sent on to your site. This is known as the referring URL or referrer (the
HTTP specification
misspells this as “referer’), and can be accessed via scripting languages
such as PHP,
Python,
and ECMAScript / JavaScript. In that referrer
there is a query string (assuming the search engine uses the HTTP “get’ method, something
all the search engines we know do), which contains several keys and values.
These look something like search.php?q=SEARCH+TERMS+HERE&l=en.
With these keys and values, you can determine what terms were used on the
search engine that listed your site as a result.
The next step is to find all words in your page that match those that
the user searched for on the search engine. Once you have a complete list of
terms from the referrer’s query string, you wrap each instance of a term in a span element with a special class. Using your site’s cascading
style sheets, you then highlight these terms using background colors, font
weights, or different voices (depending on the target medium) so that they are more
apparent to the user. We gave each search term a different class so the terms
can be highlighted in different ways (e.g. every mention of “color” is
highlighted in yellow, every mention of “coding” is highlighted blue, and so
on).
This sounds fairly easy but there are complications that need to be
considered. If the visitor searches for “div,” you don’t want to
replace all the <div> tags with <<span
class="highlight">div</span>>.
You also don’t want to add span elements inside any attribute
values, or you’ll end up with something like <img src="example.png"
alt="This is an example <span class="highlight">image</span>"/>. We need to strip
out the tags from the plain text, parse the plain text for search terms and
wrap any instances in span tags, and finally put the plain text
and the tags back together again — without changing the original structure
or rendering of the page.
We accomplished this using regular expressions, a powerful tool that allows you to match patterns of text (see CPAN for a basic tutorial on using regular expressions). If you want to find an HTML tag you could use PHP’s string searching functions to find every possible combination of tags, but that takes a lot of work; with regular expressions you simply search for patterns.
We use a pattern analogous to saying “look for ‘<’ followed by any amount of characters that are not ‘>’, followed by ‘>’”. The HTML file acts as the input string the regular expression tries to match the pattern against. Using this we were able to separate the HTML tags and the plain text. We then take the untagged plain text and add the span tags around search terms, then put back the HTML tags in their original positions. This way any semantic meaning and presentation — visual, aural, or otherwise — is preserved, along with the structure and validity of markup.
Considerations for dynamically generated pages
So far we have concentrated on static files, and you may be wondering how
the highlighting functionality can be applied to dynamic pages, i.e. those that are not created in full until they are
sent to the user-agent. This problem is solved with PHP’s
output buffering.
By calling a single function, ob_start,
at the top of your PHP
scripts, output is held in a buffer until you choose to output it to the
HTTP stream. The
ob_start function takes the name of a function as its single
argument. As the buffer is about to be output this function is called with
the buffer’s contents passed as a parameter. Whatever the function returns
is sent out into the ether to the user-agent. We can use this to modify the
buffer by adding our highlighting span tags.
Blimey. That’s enough techie-talk; time for a demonstration. We’ve rigged up a demo search engine: run a search, follow the result, and the resulting page will highlight your search terms.
Adding it to your website
Whether you run a large or small domain, new technology needs to be easily deployed and maintained. There are several ways to include the search engine highlighting function into your PHP code. Here are just two.
The first method all depends on how trusting your system admininstrator is, but if you use the Apache web server, you may be able to add a
php_value auto_prepend_file command to a .htaccess
file. This asks Apache to add the contents of a file to the top of each page
it serves. So to add the search-engine highlighting functionality to your
site you should add a line like:
php_value auto_prepend_file "/path/to/your/header.inc"
The header.inc file should contain the following code:
<?php
include('/absolute/path/to/sehl.php');
ob_start('sehl');
?>
Notice that the ob_start() function takes one parameter, in
this case a callback function, sehl (an abbreviation
for “search engine highlight”). This is the function that will be called
when the buffer is automatically flushed. The PHP
include statement includes sehl.php, which
contains the sehl function. Once you’ve finished this minor
fiddling you’re good to go. It’s important to note that Apache’s .htaccess file is a complex beastie, so if you want
to know more you should read Apache’s
.htaccess file tutorial.
If you can’t use .htaccess files or you’re
getting server errors, you won’t be able use php_value auto_prepend_file.
That’s not a big problem because there is another method you can use to include the
highlighting functionality. In each PHP
script you want to have search-engine highlighting, simply add a line at the
top of script that includes the header.inc
file like so:
include('/path/to/your/header.inc');
Notes on efficiencies
There are several points to be aware of before adding the search-engine highlighting script to your site. Regular expressions are very complex and use lots of computer resources in attempting to match strings. The larger the body of text, the more work the system has to do; this can potentially harm performance. Output buffering requires a small overhead as well — the system has to hold your page in memory, edit it, then send a copy to the user.
Small- to medium-sized sites should not have any need to worry, but large-scale sites
with millions of hits would need to evaluate the best possible way to
implement this function. In an attempt at optimization, the sehl function
will only execute a bare minimum of code if the referrer is not thought to
be a search engine. No regular expressions will be be used and no words will
be highlighted.
Customizing the script
In its current state, the sehl function will add a short
explanation to the top of each page it highlights word in, like so:
Why are some words highlighted in this page?
This site’s search-engine highlighting feature marks the words you just searched for easy identification.
A nice extension to this would be to add links to each instance of the highlighted words as demonstrated below:
You have just searched for search terms here; there are 6 instances on this page: 1, 2, 3, 4, 5, and 6.
These numbered links would be anchors that jump through the page to the highlighted words. It would also be possible to integrate this into your own site’s search engine (e.g. Atomz site search). You already know the search terms the users are interested in, now you can pass those onto other services.
You have just searched for search terms here; there are 6 instances on this page: 1, 2, 3, 4, 5, 6. Our own search engine has found 34 additional pages that match your search terms.
The current implementation is clever enough to make sure it does not highlight partial matches, that is it will not highlight “day” inside of “today”. It is also case-insensitive, so a search for “day” will result in “Day”, “DAY”, etc. also being highlighted. These can both be easily changed to highlight partial matches and be case-sensitive respectively by making small changes to the regular expressions.
How to get the script
We expect this to be an ongoing project; you will always find the latest version of the search engine highlight code on Brian’s site. Additionally, A List Apart hosts the version used at the time of writing (zip file, 7.2KB).
There are probably a million and one different
ways that the code could be improved (we’ve already started on a fully
object-oriented version ourselves), and any comments are welcome. We’ve
released this code under the GNU
General Public Licence,
so you’re welcome to port the code to other scripting languages and do with
it what you will. Enjoy! 
Discuss
Was it good for you, too? Join the discussion »

Brian Suda: SWM informatician WLTM XML-RPC, SOAP, or REST, good HCI a must. Enjoys RDF, XHTML, LAMP, long walks on the beach and the
word sawdust. Contact me at 
