When working on my PHP search engine I needed a stemmer to improve the tokenization and search index. After some reasearch I foud the PECL Stem package, which uses the Snowball Stemming algorithm and has suppport for multiple languages. Exactly what I needed!

The PECL Stem package supports the following languages through these functions:

  • Danish => stem_danish()
  • Dutch => stem_dutch()
  • English => stem_english()
  • Finnish => stem_finnish()
  • French => stem_french()
  • German => stem_german()
  • Hungarian => stem_hungarian()
  • Italian => stem_italian()
  • Norwegian => stem_norwegian()
  • Portuguese => stem_portuguese()
  • Romanian => stem_romanian()
  • Russian => stem_russian_unicode()
  • Spanish => stem_spanish()
  • Swedish => stem_swedish()
  • Turkish => stem_turkish_unicode()

To install, run:

$ pecl install stem

During the installation process, you will be prompted which languages to install support for. Make sure you select [yes] for all the languages you want to use.

After the extension is installed, make sure that your php.ini file contains the line:

extension="stem.so"

Remember to reload apache, or whatever web server you are using.

To check that if the extension has been correctly loaded by PHP, run:

$ php -i | grep 'stem support'

It should output: stem support => enabled

Example of how to use:

<?php

echo stem_english('computer'); // Prints the stem "comput"
echo stem_swedish('datorer');  // Prints the stem "dator"