Few weeks ago, ZJS was busy with his Graduation Design (GD), after weeks of efforts he finally get it. His GD is on the Chinese Word Segment.
As we know western language is used to write the word separated by a space between words. Chinese does not write words continuously, and because word is the basic unit of any language, so the Word Segment (Separates the word with the word by a space or other markings) has become the necessary first working procedure in Chinese auto analysis.
There is a “the biggest probability algorithm” in ZJS' GD. The algorithm described as follows: Before the Word Segment there has been a dictionary with the word frequency and the dictionary is believable. In the dictionary, each word takes a line, behind each word is its frequency, and these words are stored in the dictionary by lexicographic. At the time of segmenting word we find out all the words according to the dictionary, then find out all the possible cut paths(the word strings),and choose a best (that is, the biggest probability) path as the output, the key of this approach is to find out the best path efficiently. For simplicity, we transform the biggest probability into the minimum cost and to find the best path with which the cost is the smallest.
fee ( word ) = - log ( ( fre ( word ) + 1) / MaxFre ), which fee ( word ) is the cost of the word, fre ( word ) is the frequency of the word which can be found in the dictionary.And we suppose MaxFre = 5196588.If we can not find the word in the dictionary, its frequency is 0.
For example, a part of the dictionary is like this:
Transform the expense into the frequency, the result is as follows:
成 7.50075 成分 10.3821 合 9.8395
合成 12.3725 分 7.86913 分子 9.53926
结 9.95008 结合 7.76322 时 6.46674
“结合成分子时” can be segmented in to “结/合/成/分/子/时/”, “结合/成/分/子/时/”, “结合/成分/子/时/”, “结合/成/分子/时/”, and so on.
Obviously, the best path of “结合成分子时” is “结合/成/分子/时”.Its cost is 31.26997 which is smallest among all the paths. And now your task is to segment Chinese sentences, then output the best path, use “ / ” as separation mark. For simplicity, in the input does not have the punctuation mark and the western languages character.
A integer m first line which is the number of the words in the dictionary, then a dictionary with m lines as described above, and every word contains four Chinese characters at most , then an integer n. Each of the following n lines contains a Chinese sentences, no punctuation, only composed of Chinese characters.
N sentences with separation marks after segmenting. Each sentence a line.