Skip to main content

%Text.English

Class %Text.English Extends %Text.Text [ System = 4 ]

See %Text.Text

The %Text.English class implements the English language-specific stemming algorithm and initializes the language-specific list of noise words.

Parameters

DICTIONARY

Parameter DICTIONARY = 2;

SOURCELANGUAGE

Parameter SOURCELANGUAGE = "en";

NOISEWORDS100

Parameter NOISEWORDS100 = "the of and a to in is you that it he for was on are as with his they at be this from I have or by one had not but what all were when we there can an your which their said if do will each about how up out them then she many some so these would other into has more her two like him see time could no make than first been its who now my made over did down only way find use may long little very after called just where most know get through back";

NOISEWORDS200

Parameter NOISEWORDS200 = "much before go good new write our used me man too any day same right look also around another came come work three word must because does part even place well such here take why things help put years different away again off went old number great tell men say small every found still between name should Mr Mrs home big give set own under read last never us left end along while might next below saw something thought both few those always looked show often together asked don going want people water words air line sound large house";

NOISEWORDS300

Parameter NOISEWORDS300 = "world school important until 1 form food keep children feet land side without boy once animals life enough took sometimes four head above kind began almost live page got earth need far hand high year mother light parts country father let night following 2 picture being study second eyes soon times story boys since white days ever paper hard near sentence better best across during today others however sure means knew its try told young miles sun ways thing whole hear example heard several change answer room against top turned 3 learn point city play toward five using himself usually";

NOISEBIGRAMS100

Parameter NOISEBIGRAMS100 = "thousand dollar,last night,twenti five,half hour,five hundr,hundr fifti,next morn,feet high,never heard,sundai school,hundr dollar,never mind,don want,hundr mile,never seen,hundr feet,human be,pretti soon,few dai,four hundr,those dai,those peopl,never saw,hundr thousand,per cent,human race,young ladi,look upon,hundr yard,half dozen,young fellow,ever seen,young girl,yes sir,four hour,twenti four,sever time,ten thousand,ever sinc,don care,five minute,fell upon,don think,ten dai,thousand feet,sure enough,six hundr,ever saw,thirti five,ten minute,should think,didn want,col seller,four five,five thousand,ask question,let alone,thousand mile,five mile,ever mark,whole thing,pilot hous,five six,everi night,differ between,hundr ago,half past,both side,yrs ever,middl ag,ever heard,next letter,don mind,noth els,few minute,without doubt,scienc health,don mean,fifteen minute,anybodi els,week ago,women children,dear sir,anyth els,shall never,left hand,everi thing,sai don,never got,human nature,half mile,don believ,centuri ago,never thought,last year,sort thing,six month,poor thing,next moment";

NOISEBIGRAMS200

Parameter NOISEBIGRAMS200 = "poor fellow,five dollar,sai myself,feet above,worth while,sincere your,four dai,month ago,thou art,mother church,gener grant,letter written,fifti mile,keep still,wait till,someth els,low voic,seven hundr,run across,never anyth,ladi gentlemen,everi year,dai ago,ain got,ain go,ten mile,six feet,hour half,fifti dollar,eight hundr,don don,shook head,own hand,onc twice,never never,mont blanc,feet deep,without know,side side,sever dai,last moment,hour ago,think think,feet wide,don ever,depend upon,twenti minute,thou shalt,thing done,talk talk,rest upon,mile below,left behind,god bless,five feet,face face,six seven,four thousand,five cent,dai later,thousand time,quarter mile,hand upon,found himself,boi girl,read book,quarri farm,last week,gener thing,eye upon,clock morn,noth left,father peter,year year,ten twelv,nobodi ever,hour hour,haven got,four time,fifteen hundr,don rememb,didn anyth,stood still,somebodi els,poor creature,hundr time,forti five,young peopl,yes yes,whole world,twenti seven";

NOISEBIGRAMS300

Parameter NOISEBIGRAMS300 = "four feet,upon head,everybodi els,etc etc,done done,don anyth,thou hast,thing ever,six thousand,set forth,odd end,month later,hundr twenti,hour later,fifti thousand,didn seem,care noth,yet never,till got,ten dollar,own self,never let,minute later,fifti ago,far wide,everi bodi,confer upon,call mind";

Methods

stemWord

ClassMethod stemWord(ByRef b As %String) As %String

The main part of the stemming algorithm starts here. b is a buffer holding a word to be stemmed. The letters are in b[k0], b[k0+1] ... ending at b[k]. k is readjusted downwards as the stemming progresses. Note that only lower case sequences are stemmed. Forcing to lower case should be done before stem(...) is called. See: http://www.tartarus.org/\~martin/PorterStemmer/c.txt

step1ab

ClassMethod step1ab(ByRef b As %String, ByRef k As %Integer)

gets rid of plurals and -ed or -ing.

step1c

ClassMethod step1c(ByRef b As %String, ByRef k As %String)

turns terminal y to i when there is another vowel in the stem.

step2

ClassMethod step2(ByRef b As %String, ByRef k As %Integer)

maps double suffixes to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0.

step3

ClassMethod step3(ByRef b As %String, ByRef k As %Integer)

Replace -ic-, -full, -ness etc. similar strategy to step2.

step4

ClassMethod step4(ByRef b As %String, ByRef k As %Integer)

Take off -ant, -ence etc., in context vcvc.

step5

ClassMethod step5(ByRef b As %String, ByRef k As %Integer, hadTrailingY As %Boolean)

Remove a final -e if m() > 1, and change -ll to -l if m() > 1.

cons

ClassMethod cons(b As %String, pos As %String) As %Boolean

Returns TRUE if character is a consonant, else returns FALSE

cvc

ClassMethod cvc(b As %String, i As %Integer) As %Boolean

cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g. cav(e), lov(e), hop(e), crim(e), but snow, box, tray.

doublec

ClassMethod doublec(b As %String, j As %Integer) As %Boolean

m

ClassMethod m(b As %String, j As %Integer) As %Integer

m() measures the number of consonant sequences between positions k0=1 and j. if c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence, gives 0 vc gives 1 vcvc gives 2 vcvcvc gives 3 ....

r

ClassMethod r(ByRef b As %String, s As %String, j As %Integer, ByRef k As %Integer)

vowelInStem

ClassMethod vowelInStem(b As %String, j As %Integer) As %Boolean