![]() Speaking of which… Stemming Algorithm Examples Solving one or two over/under stemming issues can result in two more popping up! Making a good stemming algorithm is hard work. Enforcing new rules and heuristics can quickly get out of hand. Those questions quickly become issues when it comes to stemming. This can be seen if we have a stemming algorithm that stems the words data and datum to “dat” and “datu.” And you might be thinking, well, just resolve these both to “dat.” However, then what do we do with date? And is there a good general rule? Or are we just enforcing a very specific rule for a very specific example? It would be nice for them to all resolve to the same stem, but unfortunately, they do not. It comes from when we have several words that actually are forms of one another. A better resolution might have the first two resolve to “univers” and the latter two resolve to “universi.” But enforcing rules that make that so might result in more issues arising. While it might be nice to have universal and universe stemmed together and university and universities stemmed together, all four do not fit. A stemming algorithm that resolves these four words to the stem “univers” has overstemmed. Take the four words university, universal, universities, and universe. Or it can result in words being resolved to the same stems, even though they probably should not be. This can result in nonsensical stems, where all the meaning of the word is lost or muddled. Overstemming comes from when too much of a word is cut off. In fact, it commonly suffers from two issues in particular: overstemming and understemming. However, because stemming is usually based on heuristics, it is far from perfect. In the English language, we have suffixes like “-ed” and “-ing” which may be useful to cut off in order to map the words “cook,” “cooking,” and “cooked” all to the same stem of “cook.” Overstemming and Understemming A word is looked at and run through a series of conditionals that determine how to cut it down.įor example, we may have a suffix rule that, based on a list of known suffixes, cuts them off. You can view them as heuristic process that sort-of lops off the ends of words. ![]() Stemming algorithms are typically rule-based. A word stem need not be the same root as a dictionary-based morphological root, it just is an equal to or smaller form of the word. With stemming, words are reduced to their word stems. Stemming is definitely the simpler of the two approaches. Though they both wish to solve this same idea, they go about it completely different ways. It’s often a data pre-processing step and is something good to be familiar with. This is where something like stemming or lemmatization comes in, something that you may have heard of before! But what’s the difference between the two? And what do they actually do? These are two questions that we are going to explore today! So What Are They?Īt their core, both of these techniques tackle the same idea: Reduce a word to its root or base unit. Either way, this technique of text normalization may be useful to you. ![]() ![]() Or perhaps you are trying to analyze word usage in a corpus and wish to condense related words so that you don’t have as much variability. Maybe this is in an information retrieval setting and you want to boost your algorithm’s recall. Words that are derived from one another can be mapped to a central word or symbol, especially if they have the same core meaning. This is the idea of reducing different forms of a word to a core root. In natural language processing, there may come a time when you want your program to recognize that the words “ask” and “asked” are just different tenses of the1 same verb. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |