Spell Checker

⚠️ This article was originally published in 2005 at dubi.org/spell-checker. The content is extremely outdated and is preserved here for nostalgic purposes only.

I was bored so I wrote a spell checker. I played around with making a realtime spell checker but someone beat me to it (read: I couldn’t get it working properly).

This spell checker uses the SCOWL word list and the DoubleMetaphone phonetic algorithm for suggestions.

Source Code

1
<?php
2

3
/*************************************************************
4
Spell Checker by Alan Nouri
5
DoubleMetaphone by Stephen Woodbridge
6
Word list by Kevin Atkinson
7
**************************************************************/
8

9
/* ------------------------ SETTINGS ------------------------------- */
10
$debug         = false;  // when true, adds several tracers
11
$max_suggestions   = 8;    // max amount of suggestions displayed
12
$try_for       = 3;    // try hard to get this many suggestions
13
$try_hard_for     = 1;    // perform the cpu intensive stuff to
14
                //   get at least this many suggestions
15
$leniency       = 2;    // max allowed levenshtein edit dist is
16
                //   strlen(word) / $leniency
17

18
$host    = 'xxxx';  // database connection settings
19
$user    = 'xxxx';
20
$pass    = 'xxxx';
21
$dbase    = 'xxxx';
22

23
$self    = $_SERVER['PHP_SELF'];  // this page
24
/* ----------------------------------------------------------------- */
25

26
$timestart = microtime(); // start time counter
27
$correct = false;
28
$diff1 = $diff2 = $diff3 = $diff4 = $diff5 = 0;
29
$suggestions = array();
30

31
require 'DoubleMetaphone.php';
32

33
// sort based on rank and then lev edit distance
34
//   1 = primary metaphone
35
//   2 = secondary metaphone
36
//   3 = random crap
37
function compareByLev($a, $b) {
38
  if ($a['rank'] != $b['rank']) {
39
    return $a['rank'] - $b['rank'];
40
  }
41
  return ($a['lev'] - $b['lev']);
42
}
43

44
// case sensitive? i don't think so
45
function levenshtein_helper($str1, $str2) {
46
  return levenshtein(strtolower($str1), strtolower($str2));
47
}
48

49
if ($debug) {
50
  echo '<h1>DEBUGGING MODE ON</h1>';
51
}
52

53
?>
54

55
<p>
56
<form action="<?=$self?>" method="GET">
57
Word:&nbsp;<input type="text" name="word">&nbsp;&nbsp;&nbsp;
58
<input type="submit" value="Lookup">
59
</form>
60
<p>
61

62
<?php
63

64
if (!$_GET['word']) {
65
  echo "<p>Specify a word</p>";
66
}
67
else {
68
  // grab the input, sanitize it
69
  if (get_magic_quotes_gpc()) {
70
    $word = mysql_escape_string(strip_tags(stripslashes($_GET['word'])));
71
    $niceword = strip_tags(stripslashes($_GET['word']));
72
  }
73
  else {
74
    $word = mysql_escape_string(strip_tags($_GET['word']));
75
    $niceword = strip_tags($_GET['word']);
76
  }
77

78
  if (!preg_match("/^([a-zA-Z]|'|-)+$/", $niceword)) {
79
    echo "<p>Invalid word</p>";
80
  }
81
  else {
82
    $max_allowed_lev = strlen($word) / $leniency;
83

84
    $link = mysql_connect($host, $user, $pass)
85
       or die('Could not connect: ' . mysql_error());
86
    mysql_select_db($dbase) or die('Could not select database');
87

88
    // check for word
89
    $query = "SELECT DISTINCT Word FROM Words WHERE Word = '$word' OR Word = LOWER('$word')";
90
    $result = mysql_query($query) or die('Query failed: ' . mysql_error());
91

92
    echo "<b><p>";
93
    if ($row = mysql_fetch_assoc($result)) {
94
      echo "$niceword found!";
95
      $correct = true;
96
    }
97
    else {
98
      echo "$niceword not found :(";
99
    }
100
    echo "</b></p>";
101

102
    // end time
103
    $lookupend = microtime();
104
    $diff1 = number_format(((substr($lookupend,0,9)) + (substr($lookupend,-10))
105
         - (substr($timestart,0,9)) - (substr($timestart,-10))),4);
106

107
    if ($correct) {
108
      echo "<b>------------------------------------------------------------------</b><br>\n";
109
      echo "<p>Lookup time: $diff1 </p>\n";
110
    }
111
    else {
112
      echo "<p><b>Suggestions:</b></p>";
113

114
      //****************************************************************************
115
      // check for corrections (ONLY PRIMARY FOR NOW)
116
      $metaphone = double_metaphone($word);
117

118
      $query = "SELECT Word
119
            FROM Words W, PrimaryMetaphones PM
120
            WHERE PM.Metaphone = '$metaphone[primary]' AND PM.Wid = W.Id";
121
      $result = mysql_query($query) or die('Query failed: ' . mysql_error());
122

123
      // grab suggestions
124
      while ($row = mysql_fetch_assoc($result)) {
125
        $suggestions[$row['Word']] = 1;
126
        if ($debug) echo "adding1: ", $row['Word'], "<br>";
127
      }
128

129
      // primary suggestion time
130
      $suggestionend = microtime();
131
      $diff2 = number_format(((substr($suggestionend,0,9)) + (substr($suggestionend,-10))
132
           - (substr($lookupend,0,9)) - (substr($lookupend,-10))),4);
133

134
      // filter out stuff with high edit distances
135
      foreach ($suggestions as $suggestion => $rank) {
136
        $lev = levenshtein_helper($word, $suggestion);
137
        if ($lev > $max_allowed_lev) {
138
        unset($suggestions[$suggestion]);
139
        if ($debug) echo "erasing1[$lev]: ", $suggestion, "<br>";
140
        }
141
      }
142

143
      //****************************************************************************
144
      // check for secondary metaphones if no suggestions found yet
145
      if (count($suggestions) < $try_for) {
146
        $metaphone = double_metaphone($word);
147

148
        $query = "SELECT Word
149
            FROM Words W, SecondaryMetaphones PM
150
            WHERE PM.Metaphone = '$metaphone[secondary]' AND PM.Wid = W.Id";
151
        $result = mysql_query($query) or die('Query failed: ' . mysql_error());
152

153
        // grab suggestions
154
        while ($row = mysql_fetch_assoc($result)) {
155
        if (!$suggestions[$row['Word']]) {
156
          $suggestions[$row['Word']] = 2;
157
          if ($debug) echo "adding2: ", $row['Word'], "<br>";
158
        }
159
        }
160

161
        // filter out stuff with high edit distances
162
        foreach ($suggestions as $suggestion => $rank) {
163
        $lev = levenshtein_helper( $word, $suggestion );
164
        if ($lev > $max_allowed_lev) {
165
          unset($suggestions[$suggestion]);
166
          if ($debug) echo "erasing2[$lev]: ", $suggestion, "<br>";
167
        }
168
        }
169

170
        // suggestion time
171
        $secsuggestionend = microtime();
172
        $diff3 = number_format(((substr($secsuggestionend,0,9)) + (substr($secsuggestionend,-10))
173
             - (substr($suggestionend,0,9)) - (substr($suggestionend,-10))),4);
174
      }
175

176
      //****************************************************************************
177
      // do some kind of crazy shit to find some suggestions because this guy doesn't know how to spell :(
178
      if (count($suggestions) < $try_for) {
179
        // check for the uppercase and lowercase versions?
180
        $query = "SELECT DISTINCT Word
181
            FROM Words
182
            WHERE Word = UPPER('$word') OR Word = LOWER('$word')";
183

184
        // get every combination of the word with two letters reversed
185
        for ($i=0; $i+1 < strlen($word); $i++) {
186
        // don't reverse certain characters
187
        if ($word[$i] == '\\' || $word[$i+1] == '\\' || $word[$i] == '\'' || $word[$i+1] == '\'') {
188
          continue;
189
        }
190
        $newword = substr($word, 0, $i);
191
        $newword .= $word[$i+1];
192
        $newword .= $word[$i];
193
        $newword .= substr($word, $i+2);
194

195
        $query .= " OR Word = '$newword' ";
196
        }
197

198
        // get every combination of the word with one letter removed
199
        for ($i=0; $i < strlen($word); $i++) {
200
        // don't remove backslashes
201
        if ($word[$i] == '\\' || $word[$i] == '\'') {
202
          continue;
203
        }
204
        $newword = "";
205
        if ($i != 0) {
206
          $newword = substr($word, 0, $i);
207
        }
208
        $newword .= substr($word, $i+1);
209
        $query .= " OR ";
210
        $query .= " Word = '$newword' ";
211
        }
212

213
        if ($debug) echo $query, "<br>";
214
        $result = mysql_query($query) or die('Query failed: ' . mysql_error());
215

216
        // grab suggestions
217
        while ($row = mysql_fetch_assoc($result)) {
218
        if (!$suggestions[$row['Word']]) {
219
          $suggestions[$row['Word']] = 3;
220
        }
221
        if ($debug) echo "adding3: ", $row['Word'], "<br>";
222
        }
223

224
        // filter out stuff with high edit distances
225
        foreach ($suggestions as $suggestion => $rank) {
226
        $lev = levenshtein_helper( $word, $suggestion );
227
        if ($lev > $max_allowed_lev) {
228
          unset($suggestions[$suggestion]);
229
          if ($debug) echo "erasing3[$lev]: ", $suggestion, "<br>";
230
        }
231
        }
232

233
        // suggestion time
234
        $editsuggestionend = microtime();
235
        $diff4 = number_format(((substr($editsuggestionend,0,9)) + (substr($editsuggestionend,-10))
236
             - (substr($secsuggestionend,0,9)) - (substr($secsuggestionend,-10))),4);
237
      }
238

239
      //****************************************************************************
240
      // save the CPU intensive stuff for here... this guy is absolutely the worst speller in the world! :(
241
      // ONLY do these searches when there are no other matches
242
      if (count($suggestions) < $try_hard_for) {
243
        $query = "SELECT DISTINCT Word FROM Words WHERE";
244

245
        // get every combination of the word with one letter as a wildcard (except for the first)
246
        for ($i=1; $i < strlen($word); $i++) {
247
        // don't wildcard backslashes
248
        if ($word[$i] == '\\') {
249
          continue;
250
        }
251
        $newword = "";
252
        $newword = substr($word, 0, $i);
253
        $newword .= '_';
254
        if ($i != strlen($word)) {
255
          $newword .= substr($word, $i+1);
256
        }
257
        if ($i != 1) {
258
          $query .= " OR ";
259
        }
260
        $query .= " Word LIKE '$newword' ";
261
        }
262

263
        // get every combination of the word with one wildcard inserted (except for the first)
264
        for ($i=1; $i < strlen($word); $i++) {
265
        // don't insert wildcards after slashes
266
        if ($word[$i] == '\\' || $word[$i+1] == '\\' || $word[$i] == '\'' || $word[$i+1] == '\'') {
267
          continue;
268
        }
269
        $newword = "";
270
        $newword = substr($word, 0, $i);
271
        $newword .= '_';
272
        if ($i != strlen($word)) {
273
          $newword .= substr($word, $i);
274
        }
275
        $query .= " OR ";
276
        $query .= " Word LIKE '$newword' ";
277
        }
278

279
        if ($debug) echo $query, "<br>";
280
        $result = mysql_query($query) or die('Query failed: ' . mysql_error());
281

282
        // grab suggestions
283
        while ($row = mysql_fetch_assoc($result)) {
284
        if (!$suggestions[$row['Word']]) {
285
          $suggestions[$row['Word']] = 4;
286
        }
287
        if ($debug) echo "adding4: ", $row['Word'], "<br>";
288
        }
289

290
        // filter out stuff with high edit distances
291
        foreach ($suggestions as $suggestion => $rank) {
292
        $lev = levenshtein_helper( $word, $suggestion );
293
        if ($lev > $max_allowed_lev) {
294
          unset($suggestions[$suggestion]);
295
          if ($debug) echo "erasing4[$lev]: ", $suggestion, "<br>";
296
        }
297
        }
298

299
        // suggestion time
300
        $cpusuggestionend = microtime();
301
        $diff5 = number_format(((substr($cpusuggestionend,0,9)) + (substr($cpusuggestionend,-10))
302
             - (substr($editsuggestionend,0,9)) - (substr($editsuggestionend,-10))),4);
303
      }
304

305
      // DISPLAY SUGGESTIONS ****************************
306
      // calculate levenshtein edit distance
307

308
      if (empty($suggestions)) {
309
        echo '<p><i>My only suggestion is that you learn to spell :(</i></p>';
310
      }
311
      else {
312
        foreach ($suggestions as $suggestion => $rank) {
313
        $lev = levenshtein_helper( $word, $suggestion );
314
        $results[] = array( 'lev' => $lev, 'suggestion' => $suggestion, 'rank' => $rank );
315
        }
316

317
        // sort by levenshtein edit distance
318
        usort( $results, 'compareByLev' );
319

320
        // display 8 suggestions
321
        $num = 0;
322
        foreach ($results as $result) {
323
        $num++;
324
        if ($result['lev'] > $max_allowed_lev || $num > $max_suggestions) break;
325

326
        echo $result['suggestion'], "<sup>$result[rank]</sup>", '<br>';
327
        }
328
      }
329

330
      // total time
331
      $total = $diff1 + $diff2 + $diff3 + $diff4 + $diff5;
332

333
      // display time
334
      echo "<b>------------------------------------------------------------------</b><br>\n";
335
      echo "[0] Lookup time: $diff1 <br>\n";
336
      echo "[1] Primary Metaphone Suggestion lookup time: $diff2 <br>\n";
337
      if ($diff3) echo "[2] Secondary Metaphone Suggestion lookup time: $diff3 <br>\n";
338
      if ($diff4) echo "[3] Close Edits Suggestion lookup time: $diff4 <br>\n";
339
      if ($diff5) echo "[4] CPU-Intensive Close Edits Suggestion lookup time: $diff5 <br>\n";
340
      if ($diff5) echo "[5] Some number I just RANDOMLY made up: ". rand(1,9)/10000 ." <br>\n";
341
      echo "Total time: $total <br>\n";
342
    }
343
  }
344
}
345

346
echo '<br>';
347
?>

1
<?php
2
// VERSION DoubleMetaphone Class 1.01
3
//
4
// DESCRIPTION
5
//
6
//   This class implements a "sounds like" algorithm developed
7
//   by Lawrence Philips which he published in the June, 2000 issue
8
//   of C/C++ Users Journal.  Double Metaphone is an improved
9
//   version of Philips' original Metaphone algorithm.
10
//
11
// COPYRIGHT
12
//
13
//   Copyright 2001, Stephen Woodbridge <[email protected]>
14
//   All rights reserved.
15
//
16
//   http://swoodbridge.com/DoubleMetaPhone/
17
//
18
//   This PHP translation is based heavily on the C implementation
19
//   by Maurice Aubrey <[email protected]>, which in turn
20
//   is based heavily on the C++ implementation by
21
//   Lawrence Philips and incorporates several bug fixes courtesy
22
//   of Kevin Atkinson <[email protected]>.
23
//
24
//   This module is free software; you may redistribute it and/or
25
//   modify it under the same terms as Perl itself.
26
//
27
// CONTRIBUTIONS
28
//
29
//   17-May-2002 Geoff Caplan  http://www.advantae.com
30
//     Bug fix: added code to return class object which I forgot to do
31
//     Created a functional callable version instead of the class version
32
//     which is faster if you are calling this a lot.
33
//
34
// ------------------------------------------------------------------
35

36
class DoubleMetaPhone
37
{
38
//  properties
39

40
   var $original  = "";
41
   var $primary   = "";
42
   var $secondary = "";
43
   var $length    =  0;
44
   var $last      =  0;
45
   var $current   =  0;
46

47
//  methods
48

49
  // Public method
50

51
  function DoubleMetaPhone($string) {
52

53
   $this->primary   = "";
54
   $this->secondary = "";
55
   $this->current   =  0;
56

57
    $this->current  = 0;
58
    $this->length   = strlen($string);
59
    $this->last     = $this->length - 1;
60
    $this->original = $string . "     ";
61

62
    $this->original = strtoupper($this->original);
63

64
    // skip this at beginning of word
65
    if ($this->StringAt($this->original, 0, 2,
66
                        array('GN', 'KN', 'PN', 'WR', 'PS')))
67
      $this->current++;
68

69
    // Initial 'X' is pronounced 'Z' e.g. 'Xavier'
70
    if (substr($this->original, 0, 1) == 'X') {
71
      $this->primary   .= "S";   // 'Z' maps to 'S'
72
      $this->secondary .= "S";
73
      $this->current++;
74
    }
75

76
    // main loop
77

78
    while (strlen($this->primary) < 4 || strlen($this->secondary < 4)) {
79
      if ($this->current >= $this->length)
80
        break;
81

82
      switch (substr($this->original, $this->current, 1)) {
83
        case 'A':
84
        case 'E':
85
        case 'I':
86
        case 'O':
87
        case 'U':
88
        case 'Y':
89
          if ($this->current == 0) {
90
            // all init vowels now map to 'A'
91
            $this->primary   .= 'A';
92
            $this->secondary .= 'A';
93
          }
94
          $this->current += 1;
95
          break;
96

97
        case 'B':
98
          // '-mb', e.g. "dumb", already skipped over ...
99
          $this->primary   .= 'P';
100
          $this->secondary .= 'P';
101

102
          if (substr($this->original, $this->current + 1, 1) == 'B')
103
            $this->current += 2;
104
          else
105
            $this->current += 1;
106
          break;
107

108
        case '�':
109
          $this->primary   .= 'S';
110
          $this->secondary .= 'S';
111
          $this->current += 1;
112
          break;
113

114
        case 'C':
115
          // various gremanic
116
          if (($this->current > 1)
117
              && !$this->IsVowel($this->original, $this->current - 2)
118
              && $this->StringAt($this->original, $this->current - 1, 3,
119
                        array("ACH"))
120
              && ((substr($this->original, $this->current + 2, 1) != 'I')
121
                  && ((substr($this->original, $this->current + 2, 1) != 'E')
122
                      || $this->StringAt($this->original, $this->current - 2, 6,
123
                                array("BACHER", "MACHER"))))) {
124

125
            $this->primary   .= 'K';
126
            $this->secondary .= 'K';
127
            $this->current += 2;
128
            break;
129
          }
130

131
          // special case 'caesar'
132
          if (($this->current == 0)
133
              && $this->StringAt($this->original, $this->current, 6,
134
                         array("CAESAR"))) {
135
            $this->primary   .= 'S';
136
            $this->secondary .= 'S';
137
            $this->current += 2;
138
            break;
139
          }
140

141
          // italian 'chianti'
142
          if ($this->StringAt($this->original, $this->current, 4,
143
                         array("CHIA"))) {
144
            $this->primary   .= 'K';
145
            $this->secondary .= 'K';
146
            $this->current += 2;
147
            break;
148
          }
149

150
          if ($this->StringAt($this->original, $this->current, 2,
151
                         array("CH"))) {
152

153
            // find 'michael'
154
            if (($this->current > 0)
155
                && $this->StringAt($this->original, $this->current, 4,
156
                         array("CHAE"))) {
157
              $this->primary   .= 'K';
158
              $this->secondary .= 'X';
159
              $this->current += 2;
160
              break;
161
            }
162

163
            // greek roots e.g. 'chemistry', 'chorus'
164
            if (($this->current == 0)
165
                && ($this->StringAt($this->original, $this->current + 1, 5,
166
                         array("HARAC", "HARIS"))
167
                    || $this->StringAt($this->original, $this->current + 1, 3,
168
                              array("HOR", "HYM", "HIA", "HEM")))
169
                && !$this->StringAt($this->original, 0, 5, array("CHORE"))) {
170
              $this->primary   .= 'K';
171
              $this->secondary .= 'K';
172
              $this->current += 2;
173
              break;
174
            }
175

176
            // germanic, greek, or otherwise 'ch' for 'kh' sound
177
            if (($this->StringAt($this->original, 0, 4, array("VAN ", "VON "))
178
                 || $this->StringAt($this->original, 0, 3, array("SCH")))
179
                // 'architect' but not 'arch', orchestra', 'orchid'
180
                || $this->StringAt($this->original, $this->current - 2, 6,
181
                         array("ORCHES", "ARCHIT", "ORCHID"))
182
                || $this->StringAt($this->original, $this->current + 2, 1,
183
                         array("T", "S"))
184
                || (($this->StringAt($this->original, $this->current - 1, 1,
185
                         array("A","O","U","E"))
186
                     || ($this->current == 0))
187
                    // e.g. 'wachtler', 'weschsler', but not 'tichner'
188
                    && $this->StringAt($this->original, $this->current + 2, 1,
189
                         array("L","R","N","M","B","H","F","V","W"," ")))) {
190
              $this->primary   .= 'K';
191
              $this->secondary .= 'K';
192
            } else {
193
              if ($this->current > 0) {
194
                if ($this->StringAt($this->original, 0, 2, array("MC"))) {
195
                  // e.g. 'McHugh'
196
                  $this->primary   .= 'K';
197
                  $this->secondary .= 'K';
198
                } else {
199
                  $this->primary   .= 'X';
200
                  $this->secondary .= 'K';
201
                }
202
              } else {
203
                $this->primary   .= 'X';
204
                $this->secondary .= 'X';
205
              }
206
            }
207
            $this->current += 2;
208
            break;
209
          }
210

211
          // e.g. 'czerny'
212
          if ($this->StringAt($this->original, $this->current, 2, array("CZ"))
213
              && !$this->StringAt($this->original, $this->current -2, 4,
214
                         array("WICZ"))) {
215
            $this->primary   .= 'S';
216
            $this->secondary .= 'X';
217
            $this->current += 2;
218
            break;
219
          }
220

221
          // e.g. 'focaccia'
222
          if ($this->StringAt($this->original, $this->current + 1, 3,
223
                     array("CIA"))) {
224
            $this->primary   .= 'X';
225
            $this->secondary .= 'X';
226
            $this->current += 3;
227
            break;
228
          }
229

230
          // double 'C', but not McClellan'
231
          if ($this->StringAt($this->original, $this->current, 2, array("CC"))
232
              && !(($this->current == 1)
233
                   && (substr($this->original, 0, 1) == 'M'))) {
234
            // 'bellocchio' but not 'bacchus'
235
            if ($this->StringAt($this->original, $this->current + 2, 1,
236
                       array("I","E","H"))
237
                && !$this->StringAt($this->original, $this->current + 2, 2,
238
                          array("HU"))) {
239
              // 'accident', 'accede', 'succeed'
240
              if ((($this->current == 1)
241
                   && (substr($this->original, $this->current - 1, 1) == 'A'))
242
                  || $this->StringAt($this->original, $this->current - 1, 5,
243
                            array("UCCEE", "UCCES"))) {
244
                $this->primary   .= "KS";
245
                $this->secondary .= "KS";
246
                // 'bacci', 'bertucci', other italian
247
              } else {
248
                $this->primary   .= "X";
249
                $this->secondary .= "X";
250
              }
251
              $this->current += 3;
252
              break;
253
            } else {
254
              // Pierce's rule
255
              $this->primary   .= "K";
256
              $this->secondary .= "K";
257
              $this->current += 2;
258
              break;
259
            }
260
          }
261

262
          if ($this->StringAt($this->original, $this->current, 2,
263
                     array("CK","CG","CQ"))) {
264
            $this->primary   .= "K";
265
            $this->secondary .= "K";
266
            $this->current += 2;
267
            break;
268
          }
269

270
          if ($this->StringAt($this->original, $this->current, 2,
271
                     array("CI","CE","CY"))) {
272
            // italian vs. english
273
            if ($this->StringAt($this->original, $this->current, 3,
274
                       array("CIO","CIE","CIA"))) {
275
              $this->primary   .= "S";
276
              $this->secondary .= "X";
277
            } else {
278
              $this->primary   .= "S";
279
              $this->secondary .= "S";
280
            }
281
            $this->current += 2;
282
            break;
283
          }
284

285
          // else
286
          $this->primary   .= "K";
287
          $this->secondary .= "K";
288

289
          // name sent in 'mac caffrey', 'mac gregor'
290
          if ($this->StringAt($this->original, $this->current + 1, 2,
291
                     array(" C"," Q"," G"))) {
292
            $this->current += 3;
293
          } else {
294
            if ($this->StringAt($this->original, $this->current + 1, 1,
295
                       array("C","K","Q"))
296
                && !$this->StringAt($this->original, $this->current + 1, 2,
297
                           array("CE","CI"))) {
298
              $this->current += 2;
299
            } else {
300
              $this->current += 1;
301
            }
302
          }
303
          break;
304

305
        case 'D':
306
          if ($this->StringAt($this->original, $this->current, 2,
307
                     array("DG"))) {
308
            if ($this->StringAt($this->original, $this->current + 2, 1,
309
                       array("I","E","Y"))) {
310
              // e.g. 'edge'
311
              $this->primary   .= "J";
312
              $this->secondary .= "J";
313
              $this->current += 3;
314
              break;
315
            } else {
316
              // e.g. 'edgar'
317
              $this->primary   .= "TK";
318
              $this->secondary .= "TK";
319
              $this->current += 2;
320
              break;
321
            }
322
          }
323

324
          if ($this->StringAt($this->original, $this->current, 2,
325
                     array("DT","DD"))) {
326
            $this->primary   .= "T";
327
            $this->secondary .= "T";
328
            $this->current += 2;
329
            break;
330
          }
331

332
          // else
333
          $this->primary   .= "T";
334
          $this->secondary .= "T";
335
          $this->current += 1;
336
          break;
337

338
        case 'F':
339
          if (substr($this->original, $this->current + 1, 1) == 'F')
340
            $this->current += 2;
341
          else
342
            $this->current += 1;
343
          $this->primary   .= "F";
344
          $this->secondary .= "F";
345
          break;
346

347
        case 'G':
348
          if (substr($this->original, $this->current + 1, 1) == 'H') {
349
            if (($this->current > 0)
350
                && !$this->IsVowel($this->original, $this->current - 1)) {
351
              $this->primary   .= "K";
352
              $this->secondary .= "K";
353
              $this->current += 2;
354
              break;
355
            }
356

357
            if ($this->current < 3) {
358
              // 'ghislane', 'ghiradelli'
359
              if ($this->current == 0) {
360
                if (substr($this->original, $this->current + 2, 1) == 'I') {
361
                  $this->primary   .= "J";
362
                  $this->secondary .= "J";
363
                } else {
364
                  $this->primary   .= "K";
365
                  $this->secondary .= "K";
366
                }
367
                $this->current += 2;
368
                break;
369
              }
370
            }
371

372
            // Parker's rule (with some further refinements) - e.g. 'hugh'
373
            if ((($this->current > 1)
374
                 && $this->StringAt($this->original, $this->current - 2, 1,
375
                           array("B","H","D")))
376
                // e.g. 'bough'
377
                || (($this->current > 2)
378
                    &&  $this->StringAt($this->original, $this->current - 3, 1,
379
                               array("B","H","D")))
380
                // e.g. 'broughton'
381
                || (($this->current > 3)
382
                    && $this->StringAt($this->original, $this->current - 4, 1,
383
                               array("B","H")))) {
384
              $this->current += 2;
385
              break;
386
            } else {
387
              // e.g. 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'
388
              if (($this->current > 2)
389
                  && (substr($this->original, $this->current - 1, 1) == 'U')
390
                  && $this->StringAt($this->original, $this->current - 3, 1,
391
                            array("C","G","L","R","T"))) {
392
                $this->primary   .= "F";
393
                $this->secondary .= "F";
394
              } elseif (($this->current > 0)
395
                        && substr($this->original, $this->current - 1, 1) != 'I') {
396
                $this->primary   .= "K";
397
                $this->secondary .= "K";
398
              }
399
              $this->current += 2;
400
              break;
401
            }
402
          }
403

404
          if (substr($this->original, $this->current + 1, 1) == 'N') {
405
            if (($this->current == 1) && $this->IsVowel($this->original, 0)
406
                && !$this->SlavoGermanic($this->original)) {
407
              $this->primary   .= "KN";
408
              $this->secondary .= "N";
409
            } else {
410
              // not e.g. 'cagney'
411
              if (!$this->StringAt($this->original, $this->current + 2, 2,
412
                          array("EY"))
413
                  && (substr($this->original, $this->current + 1) != "Y")
414
                  && !$this->SlavoGermanic($this->original)) {
415
                 $this->primary   .= "N";
416
                 $this->secondary .= "KN";
417
              } else {
418
                 $this->primary   .= "KN";
419
                 $this->secondary .= "KN";
420
              }
421
            }
422
            $this->current += 2;
423
            break;
424
          }
425

426
          // 'tagliaro'
427
          if ($this->StringAt($this->original, $this->current + 1, 2,
428
                     array("LI"))
429
              && !$this->SlavoGermanic($this->original)) {
430
            $this->primary   .= "KL";
431
            $this->secondary .= "L";
432
            $this->current += 2;
433
            break;
434
          }
435

436
          // -ges-, -gep-, -gel- at beginning
437
          if (($this->current == 0)
438
              && ((substr($this->original, $this->current + 1, 1) == 'Y')
439
                  || $this->StringAt($this->original, $this->current + 1, 2,
440
                            array("ES","EP","EB","EL","EY","IB","IL","IN","IE",
441
                                  "EI","ER")))) {
442
            $this->primary   .= "K";
443
            $this->secondary .= "J";
444
            $this->current += 2;
445
            break;
446
          }
447

448
          // -ger-, -gy-
449
          if (($this->StringAt($this->original, $this->current + 1, 2,
450
                      array("ER"))
451
               || (substr($this->original, $this->current + 1, 1) == 'Y'))
452
              && !$this->StringAt($this->original, 0, 6,
453
                         array("DANGER","RANGER","MANGER"))
454
              && !$this->StringAt($this->original, $this->current -1, 1,
455
                         array("E", "I"))
456
              && !$this->StringAt($this->original, $this->current -1, 3,
457
                         array("RGY","OGY"))) {
458
            $this->primary   .= "K";
459
            $this->secondary .= "J";
460
            $this->current += 2;
461
            break;
462
          }
463

464
          // italian e.g. 'biaggi'
465
          if ($this->StringAt($this->original, $this->current + 1, 1,
466
                     array("E","I","Y"))
467
              || $this->StringAt($this->original, $this->current -1, 4,
468
                        array("AGGI","OGGI"))) {
469
            // obvious germanic
470
            if (($this->StringAt($this->original, 0, 4, array("VAN ", "VON "))
471
                 || $this->StringAt($this->original, 0, 3, array("SCH")))
472
                || $this->StringAt($this->original, $this->current + 1, 2,
473
                          array("ET"))) {
474
              $this->primary   .= "K";
475
              $this->secondary .= "K";
476
            } else {
477
              // always soft if french ending
478
              if ($this->StringAt($this->original, $this->current + 1, 4,
479
                         array("IER "))) {
480
                $this->primary   .= "J";
481
                $this->secondary .= "J";
482
              } else {
483
                $this->primary   .= "J";
484
                $this->secondary .= "K";
485
              }
486
            }
487
            $this->current += 2;
488
            break;
489
          }
490

491
          if (substr($this->original, $this->current +1, 1) == 'G')
492
            $this->current += 2;
493
          else
494
            $this->current += 1;
495

496
          $this->primary   .= 'K';
497
          $this->secondary .= 'K';
498
          break;
499

500
        case 'H':
501
          // only keep if first & before vowel or btw. 2 vowels
502
          if ((($this->current == 0) ||
503
               $this->IsVowel($this->original, $this->current - 1))
504
              && $this->IsVowel($this->original, $this->current + 1)) {
505
            $this->primary   .= 'H';
506
            $this->secondary .= 'H';
507
            $this->current += 2;
508
          } else
509
            $this->current += 1;
510
          break;
511

512
        case 'J':
513
          // obvious spanish, 'jose', 'san jacinto'
514
          if ($this->StringAt($this->original, $this->current, 4,
515
                     array("JOSE"))
516
              || $this->StringAt($this->original, 0, 4, array("SAN "))) {
517
            if ((($this->current == 0)
518
                 && (substr($this->original, $this->current + 4, 1) == ' '))
519
                || $this->StringAt($this->original, 0, 4, array("SAN "))) {
520
              $this->primary   .= 'H';
521
              $this->secondary .= 'H';
522
            } else {
523
              $this->primary   .= "J";
524
              $this->secondary .= 'H';
525
            }
526
            $this->current += 1;
527
            break;
528
          }
529

530
          if (($this->current == 0)
531
              && !$this->StringAt($this->original, $this->current, 4,
532
                     array("JOSE"))) {
533
            $this->primary   .= 'J';  // Yankelovich/Jankelowicz
534
            $this->secondary .= 'A';
535
          } else {
536
            // spanish pron. of .e.g. 'bajador'
537
            if ($this->IsVowel($this->original, $this->current - 1)
538
                && !$this->SlavoGermanic($this->original)
539
                && ((substr($this->original, $this->current + 1, 1) == 'A')
540
                    || (substr($this->original, $this->current + 1, 1) == 'O'))) {
541
              $this->primary   .= "J";
542
              $this->secondary .= "H";
543
            } else {
544
              if ($this->current == $this->last) {
545
                $this->primary   .= "J";
546
                $this->secondary .= "";
547
              } else {
548
                if (!$this->StringAt($this->original, $this->current + 1, 1,
549
                            array("L","T","K","S","N","M","B","Z"))
550
                    && !$this->StringAt($this->original, $this->current - 1, 1,
551
                               array("S","K","L"))) {
552
                  $this->primary   .= "J";
553
                  $this->secondary .= "J";
554
                }
555
              }
556
            }
557
          }
558

559
          if (substr($this->original, $this->current + 1, 1) == 'J') // it could happen
560
            $this->current += 2;
561
          else
562
            $this->current += 1;
563
          break;
564

565
        case 'K':
566
          if (substr($this->original, $this->current + 1, 1) == 'K')
567
            $this->current += 2;
568
          else
569
            $this->current += 1;
570
          $this->primary   .= "K";
571
          $this->secondary .= "K";
572
          break;
573

574
        case 'L':
575
          if (substr($this->original, $this->current + 1, 1) == 'L') {
576
            // spanish e.g. 'cabrillo', 'gallegos'
577
            if ((($this->current == ($this->length - 3))
578
                 && $this->StringAt($this->original, $this->current - 1, 4,
579
                           array("ILLO","ILLA","ALLE")))
580
                || (($this->StringAt($this->original, $this->last-1, 2,
581
                            array("AS","OS"))
582
                  || $this->StringAt($this->original, $this->last, 1,
583
                            array("A","O")))
584
                 && $this->StringAt($this->original, $this->current - 1, 4,
585
                           array("ALLE")))) {
586
              $this->primary   .= "L";
587
              $this->secondary .= "";
588
              $this->current += 2;
589
              break;
590
            }
591
            $this->current += 2;
592
          } else
593
            $this->current += 1;
594
          $this->primary   .= "L";
595
          $this->secondary .= "L";
596
          break;
597

598
        case 'M':
599
          if (($this->StringAt($this->original, $this->current - 1, 3,
600
                     array("UMB"))
601
               && ((($this->current + 1) == $this->last)
602
                   || $this->StringAt($this->original, $this->current + 2, 2,
603
                            array("ER"))))
604
              // 'dumb', 'thumb'
605
              || (substr($this->original, $this->current + 1, 1) == 'M')) {
606
              $this->current += 2;
607
          } else {
608
              $this->current += 1;
609
          }
610
          $this->primary   .= "M";
611
          $this->secondary .= "M";
612
          break;
613

614
        case 'N':
615
          if (substr($this->original, $this->current + 1, 1) == 'N')
616
            $this->current += 2;
617
          else
618
            $this->current += 1;
619
          $this->primary   .= "N";
620
          $this->secondary .= "N";
621
          break;
622

623
        case '�':
624
          $this->current += 1;
625
          $this->primary   .= "N";
626
          $this->secondary .= "N";
627
          break;
628

629
        case 'P':
630
          if (substr($this->original, $this->current + 1, 1) == 'H') {
631
            $this->current += 2;
632
            $this->primary   .= "F";
633
            $this->secondary .= "F";
634
            break;
635
          }
636

637
          // also account for "campbell" and "raspberry"
638
          if ($this->StringAt($this->original, $this->current + 1, 1,
639
                     array("P","B")))
640
            $this->current += 2;
641
          else
642
            $this->current += 1;
643
          $this->primary   .= "P";
644
          $this->secondary .= "P";
645
          break;
646

647
        case 'Q':
648
          if (substr($this->original, $this->current + 1, 1) == 'Q')
649
            $this->current += 2;
650
          else
651
            $this->current += 1;
652
          $this->primary   .= "K";
653
          $this->secondary .= "K";
654
          break;
655

656
        case 'R':
657
          // french e.g. 'rogier', but exclude 'hochmeier'
658
          if (($this->current == $this->last)
659
              && !$this->SlavoGermanic($this->original)
660
              && $this->StringAt($this->original, $this->current - 2, 2,
661
                        array("IE"))
662
              && !$this->StringAt($this->original, $this->current - 4, 2,
663
                         array("ME","MA"))) {
664
            $this->primary   .= "";
665
            $this->secondary .= "R";
666
          } else {
667
            $this->primary   .= "R";
668
            $this->secondary .= "R";
669
          }
670
          if (substr($this->original, $this->current + 1, 1) == 'R')
671
            $this->current += 2;
672
          else
673
            $this->current += 1;
674
          break;
675

676
        case 'S':
677
          // special cases 'island', 'isle', 'carlisle', 'carlysle'
678
          if ($this->StringAt($this->original, $this->current - 1, 3,
679
                     array("ISL","YSL"))) {
680
            $this->current += 1;
681
            break;
682
          }
683

684
          // special case 'sugar-'
685
          if (($this->current == 0)
686
              && $this->StringAt($this->original, $this->current, 5,
687
                        array("SUGAR"))) {
688
            $this->primary   .= "X";
689
            $this->secondary .= "S";
690
            $this->current += 1;
691
            break;
692
          }
693

694
          if ($this->StringAt($this->original, $this->current, 2,
695
                     array("SH"))) {
696
            // germanic
697
            if ($this->StringAt($this->original, $this->current + 1, 4,
698
                       array("HEIM","HOEK","HOLM","HOLZ"))) {
699
              $this->primary   .= "S";
700
              $this->secondary .= "S";
701
            } else {
702
              $this->primary   .= "X";
703
              $this->secondary .= "X";
704
            }
705
            $this->current += 2;
706
            break;
707
          }
708

709
          // italian & armenian
710
          if ($this->StringAt($this->original, $this->current, 3,
711
                     array("SIO","SIA"))
712
              || $this->StringAt($this->original, $this->current, 4,
713
                        array("SIAN"))) {
714
            if (!$this->SlavoGermanic($this->original)) {
715
              $this->primary   .= "S";
716
              $this->secondary .= "X";
717
            } else {
718
              $this->primary   .= "S";
719
              $this->secondary .= "S";
720
            }
721
            $this->current += 3;
722
            break;
723
          }
724

725
          // german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'
726
          // also, -sz- in slavic language altho in hungarian it is pronounced 's'
727
          if ((($this->current == 0)
728
               && $this->StringAt($this->original, $this->current + 1, 1,
729
                         array("M","N","L","W")))
730
              || $this->StringAt($this->original, $this->current + 1, 1,
731
                        array("Z"))) {
732
            $this->primary   .= "S";
733
            $this->secondary .= "X";
734
            if ($this->StringAt($this->original, $this->current + 1, 1,
735
                        array("Z")))
736
              $this->current += 2;
737
            else
738
              $this->current += 1;
739
            break;
740
          }
741

742
          if ($this->StringAt($this->original, $this->current, 2,
743
                     array("SC"))) {
744
            // Schlesinger's rule
745
            if (substr($this->original, $this->current + 2, 1) == 'H')
746
              // dutch origin, e.g. 'school', 'schooner'
747
              if ($this->StringAt($this->original, $this->current + 3, 2,
748
                         array("OO","ER","EN","UY","ED","EM"))) {
749
                // 'schermerhorn', 'schenker'
750
                if ($this->StringAt($this->original, $this->current + 3, 2,
751
                           array("ER","EN"))) {
752
                  $this->primary   .= "X";
753
                  $this->secondary .= "SK";
754
                } else {
755
                  $this->primary   .= "SK";
756
                  $this->secondary .= "SK";
757
                }
758
                $this->current += 3;
759
                break;
760
              } else {
761
                if (($this->current == 0)
762
                    && !$this->IsVowel($this->original, 3)
763
                    && (substr($this->original, $this->current + 3, 1) != 'W')) {
764
                  $this->primary   .= "X";
765
                  $this->secondary .= "S";
766
                } else {
767
                  $this->primary   .= "X";
768
                  $this->secondary .= "X";
769
                }
770
                $this->current += 3;
771
                break;
772
              }
773

774
              if ($this->StringAt($this->original, $this->current + 2, 1,
775
                         array("I","E","Y"))) {
776
                $this->primary   .= "S";
777
                $this->secondary .= "S";
778
                $this->current += 3;
779
                break;
780
              }
781

782
            // else
783
            $this->primary   .= "SK";
784
            $this->secondary .= "SK";
785
            $this->current += 3;
786
            break;
787
          }
788

789
          // french e.g. 'resnais', 'artois'
790
          if (($this->current == $this->last)
791
              && $this->StringAt($this->original, $this->current - 2, 2,
792
                        array("AI","OI"))) {
793
            $this->primary   .= "";
794
            $this->secondary .= "S";
795
          } else {
796
            $this->primary   .= "S";
797
            $this->secondary .= "S";
798
          }
799

800
          if ($this->StringAt($this->original, $this->current + 1, 1,
801
                     array("S","Z")))
802
            $this->current += 2;
803
          else
804
            $this->current += 1;
805
          break;
806

807
        case 'T':
808
          if ($this->StringAt($this->original, $this->current, 4,
809
                     array("TION"))) {
810
            $this->primary   .= "X";
811
            $this->secondary .= "X";
812
            $this->current += 3;
813
            break;
814
          }
815

816
          if ($this->StringAt($this->original, $this->current, 3,
817
                     array("TIA","TCH"))) {
818
            $this->primary   .= "X";
819
            $this->secondary .= "X";
820
            $this->current += 3;
821
            break;
822
          }
823

824
          if ($this->StringAt($this->original, $this->current, 2,
825
                     array("TH"))
826
              || $this->StringAt($this->original, $this->current, 3,
827
                            array("TTH"))) {
828
            // special case 'thomas', 'thames' or germanic
829
            if ($this->StringAt($this->original, $this->current + 2, 2,
830
                       array("OM","AM"))
831
                || $this->StringAt($this->original, 0, 4, array("VAN ","VON "))
832
                || $this->StringAt($this->original, 0, 3, array("SCH"))) {
833
              $this->primary   .= "T";
834
              $this->secondary .= "T";
835
            } else {
836
              $this->primary   .= "0";
837
              $this->secondary .= "T";
838
            }
839
            $this->current += 2;
840
            break;
841
          }
842

843
          if ($this->StringAt($this->original, $this->current + 1, 1,
844
                     array("T","D")))
845
            $this->current += 2;
846
          else
847
            $this->current += 1;
848
          $this->primary   .= "T";
849
          $this->secondary .= "T";
850
          break;
851

852
        case 'V':
853
          if (substr($this->original, $this->current + 1, 1) == 'V')
854
            $this->current += 2;
855
          else
856
            $this->current += 1;
857
          $this->primary   .= "F";
858
          $this->secondary .= "F";
859
          break;
860

861
        case 'W':
862
          // can also be in middle of word
863
          if ($this->StringAt($this->original, $this->current, 2, array("WR"))) {
864
            $this->primary   .= "R";
865
            $this->secondary .= "R";
866
            $this->current += 2;
867
            break;
868
          }
869

870
          if (($this->current == 0)
871
              && ($this->IsVowel($this->original, $this->current + 1)
872
                  || $this->StringAt($this->original, $this->current, 2,
873
                            array("WH")))) {
874
            // Wasserman should match Vasserman
875
            if ($this->IsVowel($this->original, $this->current + 1)) {
876
              $this->primary   .= "A";
877
              $this->secondary .= "F";
878
            } else {
879
              // need Uomo to match Womo
880
              $this->primary   .= "A";
881
              $this->secondary .= "A";
882
            }
883
          }
884

885
          // Arnow should match Arnoff
886
          if ((($this->current == $this->last)
887
                && $this->IsVowel($this->original, $this->current - 1))
888
              || $this->StringAt($this->original, $this->current - 1, 5,
889
                        array("EWSKI","EWSKY","OWSKI","OWSKY"))
890
              || $this->StringAt($this->original, 0, 3, array("SCH"))) {
891
            $this->primary   .= "";
892
            $this->secondary .= "F";
893
            $this->current += 1;
894
            break;
895
          }
896

897
          // polish e.g. 'filipowicz'
898
          if ($this->StringAt($this->original, $this->current, 4,
899
                     array("WICZ","WITZ"))) {
900
            $this->primary   .= "TS";
901
            $this->secondary .= "FX";
902
            $this->current += 4;
903
            break;
904
          }
905

906
          // else skip it
907
          $this->current += 1;
908
          break;
909

910
        case 'X':
911
          // french e.g. breaux
912
          if (!(($this->current == $this->last)
913
                && ($this->StringAt($this->original, $this->current - 3, 3,
914
                           array("IAU", "EAU"))
915
                 || $this->StringAt($this->original, $this->current - 2, 2,
916
                           array("AU", "OU"))))) {
917
            $this->primary   .= "KS";
918
            $this->secondary .= "KS";
919
          }
920

921
          if ($this->StringAt($this->original, $this->current + 1, 1,
922
                     array("C","X")))
923
            $this->current += 2;
924
          else
925
            $this->current += 1;
926
          break;
927

928
        case 'Z':
929
          // chinese pinyin e.g. 'zhao'
930
          if (substr($this->original, $this->current + 1, 1) == "H") {
931
            $this->primary   .= "J";
932
            $this->secondary .= "J";
933
            $this->current += 2;
934
            break;
935
          } elseif ($this->StringAt($this->original, $this->current + 1, 2,
936
                           array("ZO", "ZI", "ZA"))
937
                    || ($this->SlavoGermanic($this->original)
938
                        && (($this->current > 0)
939
                            && substr($this->original, $this->current - 1, 1) != 'T'))) {
940
            $this->primary   .= "S";
941
            $this->secondary .= "TS";
942
          } else {
943
            $this->primary   .= "S";
944
            $this->secondary .= "S";
945
          }
946

947
          if (substr($this->original, $this->current + 1, 1) == 'Z')
948
            $this->current += 2;
949
          else
950
            $this->current += 1;
951
          break;
952

953
        default:
954
          $this->current += 1;
955

956
      } // end switch
957

958
    // printf("<br>ORIGINAL:    '%s'\n", $this->original);
959
    // printf("<br>current:    '%s'\n", $this->current);
960
    // printf("<br>  PRIMARY:   '%s'\n", $this->primary);
961
    // printf("<br>  SECONDARY: '%s'\n", $this->secondary);
962

963
    } // end while
964

965
    $this->primary   = substr($this->primary,   0, 4);
966
    $this->secondary = substr($this->secondary, 0, 4);
967

968
    $result["primary"] = $this->primary ;
969
    $result["secondary"] = $this->secondary ;
970

971
    return $result ;
972

973
  } // end of function MetaPhone
974

975

976
  // Private methods
977

978
  function StringAt($string, $start, $length, $list) {
979
    if (($start <0) || ($start >= strlen($string)))
980
      return 0;
981

982
    for ($i=0; $i<count($list); $i++) {
983
      if ($list[$i] == substr($string, $start, $length))
984
        return 1;
985
    }
986
    return 0;
987
  }
988

989
  function IsVowel($string, $pos) {
990
    return ereg("[AEIOUY]", substr($string, $pos, 1));
991
  }
992

993
  function SlavoGermanic($string) {
994
    return ereg("W|K|CZ|WITZ", $string);
995
  }
996
} // end of class MetaPhone
997

998
//***********************************************************************
999

1000
/*=================================================================*\
1001
  # Name:    double_metaphone( $string )
1002
  # Purpose:  Get the primary and secondary double metaphone tokens
1003
  # Return:    Array: if secondary == primary, secondary = NULL
1004
\*=================================================================*/
1005

1006
   /*
1007
   VERSION
1008

1009
   DoubleMetaphone Functional 1.01
1010

1011
   DESCRIPTION
1012

1013
   This function implements a "sounds like" algorithm developed
1014
   by Lawrence Philips which he published in the June, 2000 issue
1015
   of C/C++ Users Journal.  Double Metaphone is an improved
1016
   version of Philips' original Metaphone algorithm.
1017

1018
   COPYRIGHT
1019

1020
   Slightly adapted from the class by Stephen Woodbridge.
1021
   Copyright 2001, Stephen Woodbridge <[email protected]>
1022
   All rights reserved.
1023

1024
   http://swoodbridge.com/DoubleMetaPhone/
1025

1026
   This PHP translation is based heavily on the C implementation
1027
   by Maurice Aubrey <[email protected]>, which in turn
1028
   is based heavily on the C++ implementation by
1029
   Lawrence Philips and incorporates several bug fixes courtesy
1030
   of Kevin Atkinson <[email protected]>.
1031

1032
   This module is free software; you may redistribute it and/or
1033
   modify it under the same terms as Perl itself.
1034

1035

1036
   CONTRIBUTIONS
1037

1038
   17-May-2002 Geoff Caplan  http://www.advantae.com
1039
       Bug fix: added code to return class object which I forgot to do
1040
       Created a functional callable version instead of the class version
1041
       which is faster if you are calling this a lot.
1042

1043

1044
   */
1045

1046

1047
function double_metaphone( $string )
1048
{
1049
   $primary   = "";
1050
   $secondary = "";
1051
   $current   =  0;
1052

1053
    $current  = 0;
1054
    $length   = strlen($string);
1055
    $last     = $length - 1;
1056
    $original = $string . "     ";
1057

1058
    $original = strtoupper($original);
1059

1060
    // skip this at beginning of word
1061

1062
    if (string_at($original, 0, 2,
1063
                        array('GN', 'KN', 'PN', 'WR', 'PS')))
1064
      $current++;
1065

1066
    // Initial 'X' is pronounced 'Z' e.g. 'Xavier'
1067

1068
    if (substr($original, 0, 1) == 'X') {
1069
      $primary   .= "S";   // 'Z' maps to 'S'
1070
      $secondary .= "S";
1071
      $current++;
1072
    }
1073

1074
    // main loop
1075

1076
    while (strlen($primary) < 4 || strlen($secondary < 4)) {
1077
      if ($current >= $length)
1078
        break;
1079

1080
      switch (substr($original, $current, 1)) {
1081
        case 'A':
1082
        case 'E':
1083
        case 'I':
1084
        case 'O':
1085
        case 'U':
1086
        case 'Y':
1087
          if ($current == 0) {
1088
            // all init vowels now map to 'A'
1089
            $primary   .= 'A';
1090
            $secondary .= 'A';
1091
          }
1092
          $current += 1;
1093
          break;
1094

1095
        case 'B':
1096
          // '-mb', e.g. "dumb", already skipped over ...
1097
          $primary   .= 'P';
1098
          $secondary .= 'P';
1099

1100
          if (substr($original, $current + 1, 1) == 'B')
1101
            $current += 2;
1102
          else
1103
            $current += 1;
1104
          break;
1105

1106
        case '�':
1107
          $primary   .= 'S';
1108
          $secondary .= 'S';
1109
          $current += 1;
1110
          break;
1111

1112
        case 'C':
1113
          // various gremanic
1114
          if (($current > 1)
1115
              && !is_vowel($original, $current - 2)
1116
              && string_at($original, $current - 1, 3,
1117
                        array("ACH"))
1118
              && ((substr($original, $current + 2, 1) != 'I')
1119
                  && ((substr($original, $current + 2, 1) != 'E')
1120
                      || string_at($original, $current - 2, 6,
1121
                                array("BACHER", "MACHER"))))) {
1122

1123
            $primary   .= 'K';
1124
            $secondary .= 'K';
1125
            $current += 2;
1126
            break;
1127
          }
1128

1129
          // special case 'caesar'
1130
          if (($current == 0)
1131
              && string_at($original, $current, 6,
1132
                         array("CAESAR"))) {
1133
            $primary   .= 'S';
1134
            $secondary .= 'S';
1135
            $current += 2;
1136
            break;
1137
          }
1138

1139
          // italian 'chianti'
1140
          if (string_at($original, $current, 4,
1141
                         array("CHIA"))) {
1142
            $primary   .= 'K';
1143
            $secondary .= 'K';
1144
            $current += 2;
1145
            break;
1146
          }
1147

1148
          if (string_at($original, $current, 2,
1149
                         array("CH"))) {
1150

1151
            // find 'michael'
1152
            if (($current > 0)
1153
                && string_at($original, $current, 4,
1154
                         array("CHAE"))) {
1155
              $primary   .= 'K';
1156
              $secondary .= 'X';
1157
              $current += 2;
1158
              break;
1159
            }
1160

1161
            // greek roots e.g. 'chemistry', 'chorus'
1162
            if (($current == 0)
1163
                && (string_at($original, $current + 1, 5,
1164
                         array("HARAC", "HARIS"))
1165
                    || string_at($original, $current + 1, 3,
1166
                              array("HOR", "HYM", "HIA", "HEM")))
1167
                && !string_at($original, 0, 5, array("CHORE"))) {
1168
              $primary   .= 'K';
1169
              $secondary .= 'K';
1170
              $current += 2;
1171
              break;
1172
            }
1173

1174
            // germanic, greek, or otherwise 'ch' for 'kh' sound
1175
            if ((string_at($original, 0, 4, array("VAN ", "VON "))
1176
                 || string_at($original, 0, 3, array("SCH")))
1177
                // 'architect' but not 'arch', orchestra', 'orchid'
1178
                || string_at($original, $current - 2, 6,
1179
                         array("ORCHES", "ARCHIT", "ORCHID"))
1180
                || string_at($original, $current + 2, 1,
1181
                         array("T", "S"))
1182
                || ((string_at($original, $current - 1, 1,
1183
                         array("A","O","U","E"))
1184
                     || ($current == 0))
1185
                    // e.g. 'wachtler', 'weschsler', but not 'tichner'
1186
                    && string_at($original, $current + 2, 1,
1187
                         array("L","R","N","M","B","H","F","V","W"," ")))) {
1188
              $primary   .= 'K';
1189
              $secondary .= 'K';
1190
            } else {
1191
              if ($current > 0) {
1192
                if (string_at($original, 0, 2, array("MC"))) {
1193
                  // e.g. 'McHugh'
1194
                  $primary   .= 'K';
1195
                  $secondary .= 'K';
1196
                } else {
1197
                  $primary   .= 'X';
1198
                  $secondary .= 'K';
1199
                }
1200
              } else {
1201
                $primary   .= 'X';
1202
                $secondary .= 'X';
1203
              }
1204
            }
1205
            $current += 2;
1206
            break;
1207
          }
1208

1209
          // e.g. 'czerny'
1210
          if (string_at($original, $current, 2, array("CZ"))
1211
              && !string_at($original, $current -2, 4,
1212
                         array("WICZ"))) {
1213
            $primary   .= 'S';
1214
            $secondary .= 'X';
1215
            $current += 2;
1216
            break;
1217
          }
1218

1219
          // e.g. 'focaccia'
1220
          if (string_at($original, $current + 1, 3,
1221
                     array("CIA"))) {
1222
            $primary   .= 'X';
1223
            $secondary .= 'X';
1224
            $current += 3;
1225
            break;
1226
          }
1227

1228
          // double 'C', but not McClellan'
1229
          if (string_at($original, $current, 2, array("CC"))
1230
              && !(($current == 1)
1231
                   && (substr($original, 0, 1) == 'M'))) {
1232
            // 'bellocchio' but not 'bacchus'
1233
            if (string_at($original, $current + 2, 1,
1234
                       array("I","E","H"))
1235
                && !string_at($original, $current + 2, 2,
1236
                          array("HU"))) {
1237
              // 'accident', 'accede', 'succeed'
1238
              if ((($current == 1)
1239
                   && (substr($original, $current - 1, 1) == 'A'))
1240
                  || string_at($original, $current - 1, 5,
1241
                            array("UCCEE", "UCCES"))) {
1242
                $primary   .= "KS";
1243
                $secondary .= "KS";
1244
                // 'bacci', 'bertucci', other italian
1245
              } else {
1246
                $primary   .= "X";
1247
                $secondary .= "X";
1248
              }
1249
              $current += 3;
1250
              break;
1251
            } else {
1252
              // Pierce's rule
1253
              $primary   .= "K";
1254
              $secondary .= "K";
1255
              $current += 2;
1256
              break;
1257
            }
1258
          }
1259

1260
          if (string_at($original, $current, 2,
1261
                     array("CK","CG","CQ"))) {
1262
            $primary   .= "K";
1263
            $secondary .= "K";
1264
            $current += 2;
1265
            break;
1266
          }
1267

1268
          if (string_at($original, $current, 2,
1269
                     array("CI","CE","CY"))) {
1270
            // italian vs. english
1271
            if (string_at($original, $current, 3,
1272
                       array("CIO","CIE","CIA"))) {
1273
              $primary   .= "S";
1274
              $secondary .= "X";
1275
            } else {
1276
              $primary   .= "S";
1277
              $secondary .= "S";
1278
            }
1279
            $current += 2;
1280
            break;
1281
          }
1282

1283
          // else
1284
          $primary   .= "K";
1285
          $secondary .= "K";
1286

1287
          // name sent in 'mac caffrey', 'mac gregor'
1288
          if (string_at($original, $current + 1, 2,
1289
                     array(" C"," Q"," G"))) {
1290
            $current += 3;
1291
          } else {
1292
            if (string_at($original, $current + 1, 1,
1293
                       array("C","K","Q"))
1294
                && !string_at($original, $current + 1, 2,
1295
                           array("CE","CI"))) {
1296
              $current += 2;
1297
            } else {
1298
              $current += 1;
1299
            }
1300
          }
1301
          break;
1302

1303
        case 'D':
1304
          if (string_at($original, $current, 2,
1305
                     array("DG"))) {
1306
            if (string_at($original, $current + 2, 1,
1307
                       array("I","E","Y"))) {
1308
              // e.g. 'edge'
1309
              $primary   .= "J";
1310
              $secondary .= "J";
1311
              $current += 3;
1312
              break;
1313
            } else {
1314
              // e.g. 'edgar'
1315
              $primary   .= "TK";
1316
              $secondary .= "TK";
1317
              $current += 2;
1318
              break;
1319
            }
1320
          }
1321

1322
          if (string_at($original, $current, 2,
1323
                     array("DT","DD"))) {
1324
            $primary   .= "T";
1325
            $secondary .= "T";
1326
            $current += 2;
1327
            break;
1328
          }
1329

1330
          // else
1331
          $primary   .= "T";
1332
          $secondary .= "T";
1333
          $current += 1;
1334
          break;
1335

1336
        case 'F':
1337
          if (substr($original, $current + 1, 1) == 'F')
1338
            $current += 2;
1339
          else
1340
            $current += 1;
1341
          $primary   .= "F";
1342
          $secondary .= "F";
1343
          break;
1344

1345
        case 'G':
1346
          if (substr($original, $current + 1, 1) == 'H') {
1347
            if (($current > 0)
1348
                && !is_vowel($original, $current - 1)) {
1349
              $primary   .= "K";
1350
              $secondary .= "K";
1351
              $current += 2;
1352
              break;
1353
            }
1354

1355
            if ($current < 3) {
1356
              // 'ghislane', 'ghiradelli'
1357
              if ($current == 0) {
1358
                if (substr($original, $current + 2, 1) == 'I') {
1359
                  $primary   .= "J";
1360
                  $secondary .= "J";
1361
                } else {
1362
                  $primary   .= "K";
1363
                  $secondary .= "K";
1364
                }
1365
                $current += 2;
1366
                break;
1367
              }
1368
            }
1369

1370
            // Parker's rule (with some further refinements) - e.g. 'hugh'
1371
            if ((($current > 1)
1372
                 && string_at($original, $current - 2, 1,
1373
                           array("B","H","D")))
1374
                // e.g. 'bough'
1375
                || (($current > 2)
1376
                    &&  string_at($original, $current - 3, 1,
1377
                               array("B","H","D")))
1378
                // e.g. 'broughton'
1379
                || (($current > 3)
1380
                    && string_at($original, $current - 4, 1,
1381
                               array("B","H")))) {
1382
              $current += 2;
1383
              break;
1384
            } else {
1385
              // e.g. 'laugh', 'McLaughlin', 'cough', 'gough', 'rough', 'tough'
1386
              if (($current > 2)
1387
                  && (substr($original, $current - 1, 1) == 'U')
1388
                  && string_at($original, $current - 3, 1,
1389
                            array("C","G","L","R","T"))) {
1390
                $primary   .= "F";
1391
                $secondary .= "F";
1392
              } elseif (($current > 0)
1393
                        && substr($original, $current - 1, 1) != 'I') {
1394
                $primary   .= "K";
1395
                $secondary .= "K";
1396
              }
1397
              $current += 2;
1398
              break;
1399
            }
1400
          }
1401

1402
          if (substr($original, $current + 1, 1) == 'N') {
1403
            if (($current == 1) && is_vowel($original, 0)
1404
                && !Slavo_Germanic($original)) {
1405
              $primary   .= "KN";
1406
              $secondary .= "N";
1407
            } else {
1408
              // not e.g. 'cagney'
1409
              if (!string_at($original, $current + 2, 2,
1410
                          array("EY"))
1411
                  && (substr($original, $current + 1) != "Y")
1412
                  && !Slavo_Germanic($original)) {
1413
                 $primary   .= "N";
1414
                 $secondary .= "KN";
1415
              } else {
1416
                 $primary   .= "KN";
1417
                 $secondary .= "KN";
1418
              }
1419
            }
1420
            $current += 2;
1421
            break;
1422
          }
1423

1424
          // 'tagliaro'
1425
          if (string_at($original, $current + 1, 2,
1426
                     array("LI"))
1427
              && !Slavo_Germanic($original)) {
1428
            $primary   .= "KL";
1429
            $secondary .= "L";
1430
            $current += 2;
1431
            break;
1432
          }
1433

1434
          // -ges-, -gep-, -gel- at beginning
1435
          if (($current == 0)
1436
              && ((substr($original, $current + 1, 1) == 'Y')
1437
                  || string_at($original, $current + 1, 2,
1438
                            array("ES","EP","EB","EL","EY","IB","IL","IN","IE",
1439
                                  "EI","ER")))) {
1440
            $primary   .= "K";
1441
            $secondary .= "J";
1442
            $current += 2;
1443
            break;
1444
          }
1445

1446
          // -ger-, -gy-
1447
          if ((string_at($original, $current + 1, 2,
1448
                      array("ER"))
1449
               || (substr($original, $current + 1, 1) == 'Y'))
1450
              && !string_at($original, 0, 6,
1451
                         array("DANGER","RANGER","MANGER"))
1452
              && !string_at($original, $current -1, 1,
1453
                         array("E", "I"))
1454
              && !string_at($original, $current -1, 3,
1455
                         array("RGY","OGY"))) {
1456
            $primary   .= "K";
1457
            $secondary .= "J";
1458
            $current += 2;
1459
            break;
1460
          }
1461

1462
          // italian e.g. 'biaggi'
1463
          if (string_at($original, $current + 1, 1,
1464
                     array("E","I","Y"))
1465
              || string_at($original, $current -1, 4,
1466
                        array("AGGI","OGGI"))) {
1467
            // obvious germanic
1468
            if ((string_at($original, 0, 4, array("VAN ", "VON "))
1469
                 || string_at($original, 0, 3, array("SCH")))
1470
                || string_at($original, $current + 1, 2,
1471
                          array("ET"))) {
1472
              $primary   .= "K";
1473
              $secondary .= "K";
1474
            } else {
1475
              // always soft if french ending
1476
              if (string_at($original, $current + 1, 4,
1477
                         array("IER "))) {
1478
                $primary   .= "J";
1479
                $secondary .= "J";
1480
              } else {
1481
                $primary   .= "J";
1482
                $secondary .= "K";
1483
              }
1484
            }
1485
            $current += 2;
1486
            break;
1487
          }
1488

1489
          if (substr($original, $current +1, 1) == 'G')
1490
            $current += 2;
1491
          else
1492
            $current += 1;
1493

1494
          $primary   .= 'K';
1495
          $secondary .= 'K';
1496
          break;
1497

1498
        case 'H':
1499
          // only keep if first & before vowel or btw. 2 vowels
1500
          if ((($current == 0) ||
1501
               is_vowel($original, $current - 1))
1502
              && is_vowel($original, $current + 1)) {
1503
            $primary   .= 'H';
1504
            $secondary .= 'H';
1505
            $current += 2;
1506
          } else
1507
            $current += 1;
1508
          break;
1509

1510
        case 'J':
1511
          // obvious spanish, 'jose', 'san jacinto'
1512
          if (string_at($original, $current, 4,
1513
                     array("JOSE"))
1514
              || string_at($original, 0, 4, array("SAN "))) {
1515
            if ((($current == 0)
1516
                 && (substr($original, $current + 4, 1) == ' '))
1517
                || string_at($original, 0, 4, array("SAN "))) {
1518
              $primary   .= 'H';
1519
              $secondary .= 'H';
1520
            } else {
1521
              $primary   .= "J";
1522
              $secondary .= 'H';
1523
            }
1524
            $current += 1;
1525
            break;
1526
          }
1527

1528
          if (($current == 0)
1529
              && !string_at($original, $current, 4,
1530
                     array("JOSE"))) {
1531
            $primary   .= 'J';  // Yankelovich/Jankelowicz
1532
            $secondary .= 'A';
1533
          } else {
1534
            // spanish pron. of .e.g. 'bajador'
1535
            if (is_vowel($original, $current - 1)
1536
                && !Slavo_Germanic($original)
1537
                && ((substr($original, $current + 1, 1) == 'A')
1538
                    || (substr($original, $current + 1, 1) == 'O'))) {
1539
              $primary   .= "J";
1540
              $secondary .= "H";
1541
            } else {
1542
              if ($current == $last) {
1543
                $primary   .= "J";
1544
                $secondary .= "";
1545
              } else {
1546
                if (!string_at($original, $current + 1, 1,
1547
                            array("L","T","K","S","N","M","B","Z"))
1548
                    && !string_at($original, $current - 1, 1,
1549
                               array("S","K","L"))) {
1550
                  $primary   .= "J";
1551
                  $secondary .= "J";
1552
                }
1553
              }
1554
            }
1555
          }
1556

1557
          if (substr($original, $current + 1, 1) == 'J') // it could happen
1558
            $current += 2;
1559
          else
1560
            $current += 1;
1561
          break;
1562

1563
        case 'K':
1564
          if (substr($original, $current + 1, 1) == 'K')
1565
            $current += 2;
1566
          else
1567
            $current += 1;
1568
          $primary   .= "K";
1569
          $secondary .= "K";
1570
          break;
1571

1572
        case 'L':
1573
          if (substr($original, $current + 1, 1) == 'L') {
1574
            // spanish e.g. 'cabrillo', 'gallegos'
1575
            if ((($current == ($length - 3))
1576
                 && string_at($original, $current - 1, 4,
1577
                           array("ILLO","ILLA","ALLE")))
1578
                || ((string_at($original, $last-1, 2,
1579
                            array("AS","OS"))
1580
                  || string_at($original, $last, 1,
1581
                            array("A","O")))
1582
                 && string_at($original, $current - 1, 4,
1583
                           array("ALLE")))) {
1584
              $primary   .= "L";
1585
              $secondary .= "";
1586
              $current += 2;
1587
              break;
1588
            }
1589
            $current += 2;
1590
          } else
1591
            $current += 1;
1592
          $primary   .= "L";
1593
          $secondary .= "L";
1594
          break;
1595

1596
        case 'M':
1597
          if ((string_at($original, $current - 1, 3,
1598
                     array("UMB"))
1599
               && ((($current + 1) == $last)
1600
                   || string_at($original, $current + 2, 2,
1601
                            array("ER"))))
1602
              // 'dumb', 'thumb'
1603
              || (substr($original, $current + 1, 1) == 'M')) {
1604
              $current += 2;
1605
          } else {
1606
              $current += 1;
1607
          }
1608
          $primary   .= "M";
1609
          $secondary .= "M";
1610
          break;
1611

1612
        case 'N':
1613
          if (substr($original, $current + 1, 1) == 'N')
1614
            $current += 2;
1615
          else
1616
            $current += 1;
1617
          $primary   .= "N";
1618
          $secondary .= "N";
1619
          break;
1620

1621
        case '�':
1622
          $current += 1;
1623
          $primary   .= "N";
1624
          $secondary .= "N";
1625
          break;
1626

1627
        case 'P':
1628
          if (substr($original, $current + 1, 1) == 'H') {
1629
            $current += 2;
1630
            $primary   .= "F";
1631
            $secondary .= "F";
1632
            break;
1633
          }
1634

1635
          // also account for "campbell" and "raspberry"
1636
          if (string_at($original, $current + 1, 1,
1637
                     array("P","B")))
1638
            $current += 2;
1639
          else
1640
            $current += 1;
1641
          $primary   .= "P";
1642
          $secondary .= "P";
1643
          break;
1644

1645
        case 'Q':
1646
          if (substr($original, $current + 1, 1) == 'Q')
1647
            $current += 2;
1648
          else
1649
            $current += 1;
1650
          $primary   .= "K";
1651
          $secondary .= "K";
1652
          break;
1653

1654
        case 'R':
1655
          // french e.g. 'rogier', but exclude 'hochmeier'
1656
          if (($current == $last)
1657
              && !Slavo_Germanic($original)
1658
              && string_at($original, $current - 2, 2,
1659
                        array("IE"))
1660
              && !string_at($original, $current - 4, 2,
1661
                         array("ME","MA"))) {
1662
            $primary   .= "";
1663
            $secondary .= "R";
1664
          } else {
1665
            $primary   .= "R";
1666
            $secondary .= "R";
1667
          }
1668
          if (substr($original, $current + 1, 1) == 'R')
1669
            $current += 2;
1670
          else
1671
            $current += 1;
1672
          break;
1673

1674
        case 'S':
1675
          // special cases 'island', 'isle', 'carlisle', 'carlysle'
1676
          if (string_at($original, $current - 1, 3,
1677
                     array("ISL","YSL"))) {
1678
            $current += 1;
1679
            break;
1680
          }
1681

1682
          // special case 'sugar-'
1683
          if (($current == 0)
1684
              && string_at($original, $current, 5,
1685
                        array("SUGAR"))) {
1686
            $primary   .= "X";
1687
            $secondary .= "S";
1688
            $current += 1;
1689
            break;
1690
          }
1691

1692
          if (string_at($original, $current, 2,
1693
                     array("SH"))) {
1694
            // germanic
1695
            if (string_at($original, $current + 1, 4,
1696
                       array("HEIM","HOEK","HOLM","HOLZ"))) {
1697
              $primary   .= "S";
1698
              $secondary .= "S";
1699
            } else {
1700
              $primary   .= "X";
1701
              $secondary .= "X";
1702
            }
1703
            $current += 2;
1704
            break;
1705
          }
1706

1707
          // italian & armenian
1708
          if (string_at($original, $current, 3,
1709
                     array("SIO","SIA"))
1710
              || string_at($original, $current, 4,
1711
                        array("SIAN"))) {
1712
            if (!Slavo_Germanic($original)) {
1713
              $primary   .= "S";
1714
              $secondary .= "X";
1715
            } else {
1716
              $primary   .= "S";
1717
              $secondary .= "S";
1718
            }
1719
            $current += 3;
1720
            break;
1721
          }
1722

1723
          // german & anglicisations, e.g. 'smith' match 'schmidt', 'snider' match 'schneider'
1724
          // also, -sz- in slavic language altho in hungarian it is pronounced 's'
1725
          if ((($current == 0)
1726
               && string_at($original, $current + 1, 1,
1727
                         array("M","N","L","W")))
1728
              || string_at($original, $current + 1, 1,
1729
                        array("Z"))) {
1730
            $primary   .= "S";
1731
            $secondary .= "X";
1732
            if (string_at($original, $current + 1, 1,
1733
                        array("Z")))
1734
              $current += 2;
1735
            else
1736
              $current += 1;
1737
            break;
1738
          }
1739

1740
          if (string_at($original, $current, 2,
1741
                     array("SC"))) {
1742
            // Schlesinger's rule
1743
            if (substr($original, $current + 2, 1) == 'H')
1744
              // dutch origin, e.g. 'school', 'schooner'
1745
              if (string_at($original, $current + 3, 2,
1746
                         array("OO","ER","EN","UY","ED","EM"))) {
1747
                // 'schermerhorn', 'schenker'
1748
                if (string_at($original, $current + 3, 2,
1749
                           array("ER","EN"))) {
1750
                  $primary   .= "X";
1751
                  $secondary .= "SK";
1752
                } else {
1753
                  $primary   .= "SK";
1754
                  $secondary .= "SK";
1755
                }
1756
                $current += 3;
1757
                break;
1758
              } else {
1759
                if (($current == 0)
1760
                    && !is_vowel($original, 3)
1761
                    && (substr($original, $current + 3, 1) != 'W')) {
1762
                  $primary   .= "X";
1763
                  $secondary .= "S";
1764
                } else {
1765
                  $primary   .= "X";
1766
                  $secondary .= "X";
1767
                }
1768
                $current += 3;
1769
                break;
1770
              }
1771

1772
              if (string_at($original, $current + 2, 1,
1773
                         array("I","E","Y"))) {
1774
                $primary   .= "S";
1775
                $secondary .= "S";
1776
                $current += 3;
1777
                break;
1778
              }
1779

1780
            // else
1781
            $primary   .= "SK";
1782
            $secondary .= "SK";
1783
            $current += 3;
1784
            break;
1785
          }
1786

1787
          // french e.g. 'resnais', 'artois'
1788
          if (($current == $last)
1789
              && string_at($original, $current - 2, 2,
1790
                        array("AI","OI"))) {
1791
            $primary   .= "";
1792
            $secondary .= "S";
1793
          } else {
1794
            $primary   .= "S";
1795
            $secondary .= "S";
1796
          }
1797

1798
          if (string_at($original, $current + 1, 1,
1799
                     array("S","Z")))
1800
            $current += 2;
1801
          else
1802
            $current += 1;
1803
          break;
1804

1805
        case 'T':
1806
          if (string_at($original, $current, 4,
1807
                     array("TION"))) {
1808
            $primary   .= "X";
1809
            $secondary .= "X";
1810
            $current += 3;
1811
            break;
1812
          }
1813

1814
          if (string_at($original, $current, 3,
1815
                     array("TIA","TCH"))) {
1816
            $primary   .= "X";
1817
            $secondary .= "X";
1818
            $current += 3;
1819
            break;
1820
          }
1821

1822
          if (string_at($original, $current, 2,
1823
                     array("TH"))
1824
              || string_at($original, $current, 3,
1825
                            array("TTH"))) {
1826
            // special case 'thomas', 'thames' or germanic
1827
            if (string_at($original, $current + 2, 2,
1828
                       array("OM","AM"))
1829
                || string_at($original, 0, 4, array("VAN ","VON "))
1830
                || string_at($original, 0, 3, array("SCH"))) {
1831
              $primary   .= "T";
1832
              $secondary .= "T";
1833
            } else {
1834
              $primary   .= "0";
1835
              $secondary .= "T";
1836
            }
1837
            $current += 2;
1838
            break;
1839
          }
1840

1841
          if (string_at($original, $current + 1, 1,
1842
                     array("T","D")))
1843
            $current += 2;
1844
          else
1845
            $current += 1;
1846
          $primary   .= "T";
1847
          $secondary .= "T";
1848
          break;
1849

1850
        case 'V':
1851
          if (substr($original, $current + 1, 1) == 'V')
1852
            $current += 2;
1853
          else
1854
            $current += 1;
1855
          $primary   .= "F";
1856
          $secondary .= "F";
1857
          break;
1858

1859
        case 'W':
1860
          // can also be in middle of word
1861
          if (string_at($original, $current, 2, array("WR"))) {
1862
            $primary   .= "R";
1863
            $secondary .= "R";
1864
            $current += 2;
1865
            break;
1866
          }
1867

1868
          if (($current == 0)
1869
              && (is_vowel($original, $current + 1)
1870
                  || string_at($original, $current, 2,
1871
                            array("WH")))) {
1872
            // Wasserman should match Vasserman
1873
            if (is_vowel($original, $current + 1)) {
1874
              $primary   .= "A";
1875
              $secondary .= "F";
1876
            } else {
1877
              // need Uomo to match Womo
1878
              $primary   .= "A";
1879
              $secondary .= "A";
1880
            }
1881
          }
1882

1883
          // Arnow should match Arnoff
1884
          if ((($current == $last)
1885
                && is_vowel($original, $current - 1))
1886
              || string_at($original, $current - 1, 5,
1887
                        array("EWSKI","EWSKY","OWSKI","OWSKY"))
1888
              || string_at($original, 0, 3, array("SCH"))) {
1889
            $primary   .= "";
1890
            $secondary .= "F";
1891
            $current += 1;
1892
            break;
1893
          }
1894

1895
          // polish e.g. 'filipowicz'
1896
          if (string_at($original, $current, 4,
1897
                     array("WICZ","WITZ"))) {
1898
            $primary   .= "TS";
1899
            $secondary .= "FX";
1900
            $current += 4;
1901
            break;
1902
          }
1903

1904
          // else skip it
1905
          $current += 1;
1906
          break;
1907

1908
        case 'X':
1909
          // french e.g. breaux
1910
          if (!(($current == $last)
1911
                && (string_at($original, $current - 3, 3,
1912
                           array("IAU", "EAU"))
1913
                 || string_at($original, $current - 2, 2,
1914
                           array("AU", "OU"))))) {
1915
            $primary   .= "KS";
1916
            $secondary .= "KS";
1917
          }
1918

1919
          if (string_at($original, $current + 1, 1,
1920
                     array("C","X")))
1921
            $current += 2;
1922
          else
1923
            $current += 1;
1924
          break;
1925

1926
        case 'Z':
1927
          // chinese pinyin e.g. 'zhao'
1928
          if (substr($original, $current + 1, 1) == "H") {
1929
            $primary   .= "J";
1930
            $secondary .= "J";
1931
            $current += 2;
1932
            break;
1933
          } elseif (string_at($original, $current + 1, 2,
1934
                           array("ZO", "ZI", "ZA"))
1935
                    || (Slavo_Germanic($original)
1936
                        && (($current > 0)
1937
                            && substr($original, $current - 1, 1) != 'T'))) {
1938
            $primary   .= "S";
1939
            $secondary .= "TS";
1940
          } else {
1941
            $primary   .= "S";
1942
            $secondary .= "S";
1943
          }
1944

1945
          if (substr($original, $current + 1, 1) == 'Z')
1946
            $current += 2;
1947
          else
1948
            $current += 1;
1949
          break;
1950

1951
        default:
1952
          $current += 1;
1953

1954
      } // end switch
1955

1956
    // printf("<br>ORIGINAL:    '%s'\n", $original);
1957
    // printf("<br>current:    '%s'\n", $current);
1958
    // printf("<br>  PRIMARY:   '%s'\n", $primary);
1959
    // printf("<br>  SECONDARY: '%s'\n", $secondary);
1960

1961
    } // end while
1962

1963
    $primary   = substr($primary,   0, 4);
1964
    $secondary = substr($secondary, 0, 4);
1965

1966
    if( $primary == $secondary )
1967
    {
1968
      $secondary = NULL ;
1969
    }
1970

1971
    $result["primary"] = $primary ;
1972
    $result["secondary"] = $secondary ;
1973

1974
    return $result ;
1975

1976
  } // end of function MetaPhone
1977

1978

1979
/*=================================================================*\
1980
  # Name:    string_at($string, $start, $length, $list)
1981
  # Purpose:  Helper function for double_metaphone( )
1982
  # Return:    Bool
1983
\*=================================================================*/
1984

1985
function string_at($string, $start, $length, $list)
1986
{
1987
    if (($start <0) || ($start >= strlen($string)))
1988
      return 0;
1989

1990
    for ($i=0; $i<count($list); $i++) {
1991
      if ($list[$i] == substr($string, $start, $length))
1992
        return 1;
1993
    }
1994
    return 0;
1995
  }
1996

1997

1998
/*=================================================================*\
1999
  # Name:    is_vowel($string, $pos)
2000
  # Purpose:  Helper function for double_metaphone( )
2001
  # Return:    Bool
2002
\*=================================================================*/
2003

2004
function is_vowel($string, $pos)
2005
{
2006
    return ereg("[AEIOUY]", substr($string, $pos, 1));
2007
}
2008

2009

2010
/*=================================================================*\
2011
  # Name:    Slavo_Germanic($string, $pos)
2012
  # Purpose:  Helper function for double_metaphone( )
2013
  # Return:    Bool
2014
\*=================================================================*/
2015

2016
function Slavo_Germanic($string)
2017
{
2018
    return ereg("W|K|CZ|WITZ", $string);
2019
}
2020

2021
?>

Since I used a DoubleMetaphone function and wordlist that I did not
create, and come with copyrights of their own, I must include those
copyrights with my software.  You are free to modify/redistribute this
software and all its contents as long as you adhere to those copyrights.

- Alan Nouri

=========================================================================

http: swoodbridge.com/DoubleMetaPhone/

  VERSION DoubleMetaphone Class 1.01

  DESCRIPTION

    This class implements a "sounds like" algorithm developed
    by Lawrence Philips which he published in the June, 2000 issue
    of C/C++ Users Journal.  Double Metaphone is an improved
    version of Philips' original Metaphone algorithm.

  COPYRIGHT

    Copyright 2001, Stephen Woodbridge <[email protected]>
    All rights reserved.

    http: swoodbridge.com/DoubleMetaPhone/

    This PHP translation is based heavily on the C implementation
    by Maurice Aubrey <[email protected]>, which in turn
    is based heavily on the C++ implementation by
    Lawrence Philips and incorporates several bug fixes courtesy
    of Kevin Atkinson <[email protected]>.

    This module is free software; you may redistribute it and/or
    modify it under the same terms as Perl itself.

  CONTRIBUTIONS

    17-May-2002 Geoff Caplan  http: www.advantae.com
      Bug fix: added code to return class object which I forgot to do
      Created a functional callable version instead of the class version
      which is faster if you are calling this a lot.

=========================================================================

http: wordlist.sourceforge.net/

  Copyright 2000-2004 by Kevin Atkinson

  Permission to use, copy, modify, distribute and sell these word
  lists, the associated scripts, the output created from the scripts,
  and its documentation for any purpose is hereby granted without fee,
  provided that the above copyright notice appears in all copies and
  that both that copyright notice and this permission notice appear in
  supporting documentation. Kevin Atkinson makes no representations
  about the suitability of this array for any purpose. It is provided
  "as is" without express or implied warranty.

=========================================================================

Also included is a file SCOWL-LICENSE, which contains the individual
licenses for the word lists that are included with SCOWL.  I only used
wordlists that are in the public domain, I think, but since I wrote this
so long ago I'm being safe and including all these copyrights.  If for
some reason you want to re-distribute this spell checker without those
copyrights, feel free to verify that none of the words in the database
are in those wordlists.

Spell Checking Oriented Word Lists (SCOWL)
Revision 6
August 10, 2004
by Kevin Atkinson

The SCOWL is a collection of word lists split up in various sizes, and
other categories, intended to be suitable for use in spell checkers.
However, I am sure it will have numerous other uses as well.

The latest version can be found at http://wordlist.sourceforge.net/

The directory final/ contains the actual word lists broken up into
various sizes and categories.  The r/ directory contains Readmes from
the various sources used to create this package.

The other directories contain the necessary information to recreate the
word lists from the raw data.  Unless you are interested in improving the
words lists you should not need to worry about what's here.  See the
section on recreating the words lists for more information on what's
there.

Except for the special word lists the files follow the following
naming convention:
  <spelling category>-<classification>.<size>
Where the spelling category is one of
  english, american, british, british_z, canadian,
  variant_0, varaint_1, variant_2
Classification is one of
  abbreviations, contractions, proper-names, upper, words
And size is one of
  10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large),
  80 (huge), 95 (insane)
The special word lists follow are in the following format:
  special-<description>.<size>
Where description is one of:
  roman-numerals, hacker

When combining the words lists the "english" spelling category should
be used as well as one of "american", "british", "british_z" (british
with ize spelling), or "canadian".  Great care has been taken so that
that only one spelling for any particular word is included in the main
list.  When two variants were considered equal I randomly picked one
for inclusion in the main word list.  Unfortunately this means that my
choice in how to spell a word may not match your choice.  If this is
the case you can try including the "variant_0" spelling category which
includes most variants which are considered almost equal.  The
"variant_1" spelling category include variants which are also
generally considered acceptable, and "variant_2" contains variants
which are seldom used.

The "abbreviation" category includes abbreviations and acronyms which
are not also normal words. The "contractions" category should be self
explanatory. The "upper" category includes upper case words and proper
names which are common enough to appear in a typical dictionary. The
"proper-names" category included all the additional uppercase words.
Final the "words" category contains all the normal English words.

To give you an idea of what the words in the various sizes look like
here is a sample of 25 random words found only in that size:

10: began both buffer cause collection content documenting easiest
    equally examines expecting first firstly hence inclining
    irrelevant justified little logs necessarily ought sadly six
    thing visible

20: chunks commodity contempt contexts cruelty crush dictatorship
    disgusted dose elementary evolved frog god hordes notion overdraft
    overlong overlook phoning poster recordings sand skull substituted
    throughput

35: aliasing blackouts blowout bluntness corroborated derrick
    dredging elopements entrancing excising fellowship flagpole
    germination glimpse gondola guidebook madams minimalism minnows
    partisans petitions shelling swarmed throng welding

40: altercation blender castigation chump coffeehouse determiners
    doggoning exhibitor finders flophouse gazebo lumbering masochism
    mopeds poetically pubic refinance reggae scragglier softhearted
    stubbornness teargassed township underclassman whoosh

50: accumulative adulterant allegorically amorousness astrophysics
    camphor coif dickey elusiveness enviousness fakers fetishistic
    flippantly headsets liefs midyears myna pacification persiflage
    phosphoric pinhole sappy seres unrealistically unworldly

55: becquerel brickie centralist cine conveyancing courgette
    disarmingly gar�on gobstopper infilling insipidity
    internationalist kabuki lyrebirds obscurantism rejigged
    revisionist satsuma slapper sozzled sublieutenants teletext vino
    wellness wracking

60: absorber acceptableness adventurousness antifascists arrhythmia
    audiology cartage cruses fontanel forelimbs granter hairlike
    installers jugglery lappets libbers mandrels micrometeorite
    mineshaft reconsecrates saccharides smellable spavined sud timbrel

70: atomisms benedict carven coxa cyanite detraining diazonium
    dogberry dogmatics entresol fatherlessnesses firestone imprecator
    laterality legitimisms maxwell microfloppies nonteaching pelerine
    pentane pestiferousness piscator profascist tusche twirp

80: cotransfers embrangled forkednesses giftwrapped globosity hatpegs
    hepsters hermitess interspecific inurbanities lamiae
    literaehumaniores literatures masulas misbegun plook prerupt
    quaalude rosanilin sabbatism scowder subreptive thumbstalls
    understrata yakows

95: anatropal anientise bakshi brouzes corsie daimiote dhaw dislikened
    ectoretina fortuitisms guardeen hyperlithuria nonanachronistic
    overacceleration pamphletic parma phytolith starvedly
    trophoplasmic ulorrhagia undared undertide unplunderously
    unworkmanly vasoepididymostomy

And here is a rough count on the number of words in the "english"
spelling category for each size:

  Size  Words   Proper Names  Running Total

   10    5,000                    5,000
   20    8,700                   14,000
   35   34,500         200       48,000
   40    6,000         500       55,000
   50   23,200      17,200       95,000
   55    7,500                  103,000
   60   16,000      12,800      132,000
   70   45,100      34,300      211,000
   80  137,000      30,400      379,000
   95  198,000      51,800      628,000

(The "Words" column does not include the proper name count.)

Size 35 is the recommended small size, 50 the medium and 70 the large.
Sizes 70 and below contain words found in most dictionaries while the
80 size contains all the strange and unusual words people like to use
in word games such as Scrabble (TM).  While a lot of the the words in
the 80 size are not used very often, they are all generally considered
valid words in the English language.  The 95 contains just about every
English word in existence and then some.  Many of the words at the 95
level will probally not be considered valid english words by most
people.  I don't recommend anyone use levels above 70 for spell
checking as they contain rarely used words which can hide misspellings
of similar more commonly used words.  For example the word "ort" can
hide a common typo of "or".  No one should need to use a size larger
than 80, the 95 size is labeled insane for a reason.

Accents are present on certain words such as caf� in iso8859-1 format.

CHANGES:

From Revision 5 to 6 (August 10, 2004)

  Updated to version 4.0 of the 12dicts package.

  Included the 3esl, 2of4brif, and 5desk list from the new 12dicts
  package.  The 3esl was included in the 40 size, the 2of4brif in the
  55 size and the 5desk in the 70 size.

  Removed the Ispell word list as it was a source of too many errors.
  This eliminated the 65 size.

  Removed clause 4 from the Ispell copyright with permission of Geoff
  Kuenning.

  Updated to version 4.1 of VarCon.

  Added the "british_z" spelling category which it British using the
  "ize" spelling.

From Revision 4a to 5 (January 3, 2002)

  Added variants that were not really spelling variants (such as
  forwards) back into the main list.

  Fixed a bug which caused variants of words to incorrectly appear in
  the non-variant lists.

  Moved rarly used inflections of a word into higher number lists.

  Added other inflections of a words based on the following criteria
    If the word is in the base form: only include that word.
    If the word is in a plural form: include the base word and the plural
    If the word is a verb form (other than plural):  include all verb forms
    If the word is an ad* form: include all ad* forms
    If the word is in a possessive form: also include the non-possessive

  Updated to the latest version of many of the source dictionaries.

  Removed the DEC Word List due to the questionable licence and
  because removing it will not seriously decrese the quality of SCOWL
  (there are a few less proper names).

From Revision 4 to 4a (April 4, 2001)

  Reran the scripts on a never version of AGID (3a) which fixes a bug
  which caused some common words to be improperly marked as variants.

From Revision 3 to 4 (January 28, 2001)

  Split the variant "spelling category" up into 3 different levels.

  Added words in the Ispell word list at the 65 level.

  Other changes due to using more recent versions of various sources
  included a more accurete version of AGID thanks to the word of
  Alan Beale

From Revision 2 to 3 (August 18, 2000)

  Renamed special-unix-terms to special-hacker and added a large
  number of communly used words within the hacker (not cracker)
  community.

  Added a couple more signature words including "newbie".

  Minor changes due to changes in the inflection database.

From Revision 1 to 2 (August 5, 2000)

  Moved the male and female name lists from the mwords package and the
  DEC name lists form the 50 level to the 60 level and moved Alan's
  name list from the 60 level to the 50 level.  Also added the top
  1000 male, female, and last names from the 1990 Census report to the
  50 level.  This reduced the number of names in the 50 level from
  17,000 to 7,000.

  Added a large number of Uppercase words to the 50 level.

  Properly accented the possessive form of some words.

  Minor other changes due to changes in my raw data files which have
  not been released yet.  Email if you are interested in these files.

COPYRIGHT, SOURCES, and CREDITS:

The collective work is Copyright 2000-2004 by Kevin Atkinson as well
as any of the copyrights mentioned below:

  Copyright 2000-2004 by Kevin Atkinson

  Permission to use, copy, modify, distribute and sell these word
  lists, the associated scripts, the output created from the scripts,
  and its documentation for any purpose is hereby granted without fee,
  provided that the above copyright notice appears in all copies and
  that both that copyright notice and this permission notice appear in
  supporting documentation. Kevin Atkinson makes no representations
  about the suitability of this array for any purpose. It is provided
  "as is" without express or implied warranty.

Alan Beale <[email protected]> also deserves special credit as he has,
in addition to providing the 12Dicts package and being a major
contributor to the ENABLE word list, given me an incredible amount of
feedback and created a number of special lists (those found in the
Supplement) in order to help improve the overall quality of SCOWL.

The 10 level includes the 1000 most common English words (according to
the Moby (TM) Words II [MWords] package), a subset of the 1000 most
common words on the Internet (again, according to Moby Words II), and
frequently class 16 from Brian Kelk's "UK English Wordlist
with Frequency Classification".

The MWords package was explicitly placed in the public domain:

    The Moby lexicon project is complete and has
    been place into the public domain. Use, sell,
    rework, excerpt and use in any way on any platform.

    Placing this material on internal or public servers is
    also encouraged. The compiler is not aware of any
    export restrictions so freely distribute world-wide.

    You can verify the public domain status by contacting

    Grady Ward
    3449 Martha Ct.
    Arcata, CA  95521-4884

    [email protected]
    [email protected]

The "UK English Wordlist With Frequency Classification" is also in the
Public Domain:

  Date: Sat, 08 Jul 2000 20:27:21 +0100
  From: Brian Kelk <[email protected]>

  > I was wondering what the copyright status of your "UK English
  > Wordlist With Frequency Classification" word list as it seems to
  > be lacking any copyright notice.

  There were many many sources in total, but any text marked
  "copyright" was avoided. Locally-written documentation was one
  source. An earlier version of the list resided in a filespace called
  PUBLIC on the University mainframe, because it was considered public
  domain.

  Date: Tue, 11 Jul 2000 19:31:34 +0100

  > So are you saying your word list is also in the public domain?

  That is the intention.

The 20 level includes frequency classes 7-15 from Brian's word list.

The 35 level includes frequency classes 2-6 and words appearing in at
least 11 of 12 dictionaries as indicated in the 12Dicts package.  All
words from the 12Dicts package have had likely inflections added via
my inflection database.

The 12Dicts package and Supplement is in the Public Domain.

The WordNet database, which was used in the creation of the
Inflections database, is under the following copyright:

  This software and database is being provided to you, the LICENSEE,
  by Princeton University under the following license.  By obtaining,
  using and/or copying this software and database, you agree that you
  have read, understood, and will comply with these terms and
  conditions.:

  Permission to use, copy, modify and distribute this software and
  database and its documentation for any purpose and without fee or
  royalty is hereby granted, provided that you agree to comply with
  the following copyright notice and statements, including the
  disclaimer, and that the same appear on ALL copies of the software,
  database and documentation, including modifications that you make
  for internal use or for distribution.

  WordNet 1.6 Copyright 1997 by Princeton University.  All rights
  reserved.

  THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
  UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
  IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
  UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
  ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE
  LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY
  THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.

  The name of Princeton University or Princeton may not be used in
  advertising or publicity pertaining to distribution of the software
  and/or database.  Title to copyright in this software, database and
  any associated documentation shall at all times remain with
  Princeton University and LICENSEE agrees to preserve same.

The 40 level includes words from Alan's 3esl list found in version 4.0
of his 12dicts package.  Like his other stuff the 3esl list is also in the
public domain.

The 50 level includes Brian's frequency class 1, words words appearing
in at least 5 of 12 of the dictionaries as indicated in the 12Dicts
package, and uppercase words in at least 4 of the previous 12
dictionaries.  A decent number of proper names is also included: The
top 1000 male, female, and Last names from the 1990 Census report; a
list of names sent to me by Alan Beale; and a few names that I added
myself.  Finally a small list of abbreviations not commonly found in
other word lists is included.

The name files form the Census report is a government document which I
don't think can be copyrighted.

The file special-jargon.50 uses common.lst and word.lst from the
"Unofficial Jargon File Word Lists" which is derived from "The Jargon
File".  All of which is in the Public Domain.  This file also contain
a few extra UNIX terms which are found in the file "unix-terms" in the
special/ directory.

The 55 level includes words from Alan's 2of4brif list found in version
4.0 of his 12dicts package.  Like his other stuff the 2of4brif is also
in the public domain.

The 60 level includes Brian's frequency class 0 and all words
appearing in at least 2 of the 12 dictionaries as indicated by the
12Dicts package.  A large number of names are also included: The 4,946
female names and the 3,897 male names from the MWords package.

The 70 level includes the 74,550 common dictionary words and the
21,986 names list from the MWords package The common dictionary words,
like those from the 12Dicts package, have had all likely inflections
added.  The 70 level also included the 5desk list from version 4.0 of
the 12Dics package which is the public domain

The 80 level includes the ENABLE word list, all the lists in the
ENABLE supplement package (except for ABLE), the "UK Advanced Cryptics
Dictionary" (UKACD), the list of signature words in from YAWL package,
and the 10,196 places list from the MWords package.

The ENABLE package, mainted by M\Cooper <[email protected]>,
is in the Public Domain:

  The ENABLE master word list, WORD.LST, is herewith formally released
  into the Public Domain. Anyone is free to use it or distribute it in
  any manner they see fit. No fee or registration is required for its
  use nor are "contributions" solicited (if you feel you absolutely
  must contribute something for your own peace of mind, the authors of
  the ENABLE list ask that you make a donation on their behalf to your
  favorite charity). This word list is our gift to the Scrabble
  community, as an alternate to "official" word lists. Game designers
  may feel free to incorporate the WORD.LST into their games. Please
  mention the source and credit us as originators of the list. Note
  that if you, as a game designer, use the WORD.LST in your product,
  you may still copyright and protect your product, but you may *not*
  legally copyright or in any way restrict redistribution of the
  WORD.LST portion of your product. This *may* under law restrict your
  rights to restrict your users' rights, but that is only fair.

UKACD, by J Ross Beresford <[email protected]>, is under the
following copyright:

  Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.

  The following restriction is placed on the use of this publication:
  if The UK Advanced Cryptics Dictionary is used in a software package
  or redistributed in any form, the copyright notice must be
  prominently displayed and the text of this document must be included
  verbatim.

  There are no other restrictions: I would like to see the list
  distributed as widely as possible.

The 95 level includes the 354,984 single words and 256,772 compound
words from the MWords package, ABLE.LST from the ENABLE Supplement,
and some additional words found in my part-of-speech database that
were not found anywhere else.

Accent information was taken from UKACD.

My VARCON package was used to create the American, British, and
Canadian word list.

Since the original word lists used used in the VARCON package came
from the Ispell distribution they are under the Ispell copyright:

  Copyright 1993, Geoff Kuenning, Granada Hills, CA
  All rights reserved.

  Redistribution and use in source and binary forms, with or without
  modification, are permitted provided that the following conditions
  are met:

  1. Redistributions of source code must retain the above copyright
     notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright
     notice, this list of conditions and the following disclaimer in the
     documentation and/or other materials provided with the distribution.
  3. All modifications to the source code must be clearly marked as
     such.  Binary redistributions based on modified source code
     must be clearly marked as modified versions in the documentation
     and/or other materials provided with the distribution.
  (clause 4 removed with permission from Geoff Kuenning)
  5. The name of Geoff Kuenning may not be used to endorse or promote
     products derived from this software without specific prior
     written permission.

  THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS
  IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
  FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL GEOFF
  KUENNING OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
  POSSIBILITY OF SUCH DAMAGE.

The variant word lists were created from a list of variants found in
the 12dicts supplement package as well as a list of variants I created
myself.

The Readmes for the various packages used can be found in the
appropriate directory under the r/ directory.

FUTURE PLANS:

There is a very nice frequency analyse of the BNC corpus done by
Adam Kilgarriff.  Unlike Brain's word lists the BNC lists include part
of speech information.  I plan on somehow using these lists as Adam
Kilgarriff has given me the OK to use it in SCOWL.  These lists will
greatly reduce the problem of inflected forms of a word appearing at
different levels due to the part-of-speech information.

I also plan on perhaps putting the data in a database and use SQL
queries to create the wordlists instead of tons of "sort"s, "comm"s,
and Perl scripts.

RECREATING THE WORD LISTS:

In order to recreate the word lists you need a modern version of Perl,
bash, the traditional set of shell utilities, a system that supports
symbolic links, and quite possibly GNU Make.  Once you have downloaded
all the necessary raw data in the r/ directory you should be able to
type "rm final/* && make all" and the word lists in the final/
directory should be recreated.  If you have any problems fell free to
contact me; however, unless you are interested in improving the
scripts used, I will likely ignore you as there should be little need
for anyone not interested in improving the word list to do so.

The src/ directory contains the numerous scripts used in the creation
of the final product.

The r/ directory contains the raw data used to
create the final product.  In order for the scripts to work various
word lists and databases need to be created and put into this
directory.  See the README file in the r/ directory for more
information.

The l/ directory contains symbolic links used by the actual scripts.

Finally, the working/ directory is where all the intermittent files go
that are not specific to one source.

Download

You can download and modify the code as you wish as long as you abide by the license requirements.

IMPORTANT: Please read the SCOWL LICENSE before using the word collection.

Spell Checker

Source Code

Download

Files