![]() Research in this field is very recent and mostly on English. Text normalization is an indispensable stage in processing noncanonical language from natural sources, such as speech, social media or short text messages. Our proposed method yields 5% absolute improvement over the state‐of‐the‐art Turkish spelling correction systems in a test set which contains human‐made misspellings from Twitter messages. In the experiments, we observe that the data augmentation significantly improves the spelling correction performance. Training data of the spelling corrector is augmented by the generator's human‐like misspellings. As a result, it is trained to add realistic spelling errors to the valid words. Reference words and their misspellings are used as inputs and outputs of the generator, respectively. The generator is trained using a relatively small number of human‐made misspellings and their manual corrections. The proposed system consists of two separate models a misspelling generator and a spelling corrector. Generated human‐like misspellings are used to improve the performance of a seq2seq spelling correction system. ![]() In this paper, we propose a novel procedure to automatically introduce human‐like misspellings to legitimate words in Turkish language. This might degrade the performance in realistic test scenarios. Although misspelling‐reference pairs can be synthesized with a random procedure, the generated dataset may not well match to genuine human‐made misspellings. Low‐resource languages such as Turkish usually lack such large annotated datasets. A large set of misspelled words together with their corrections is needed to train a seq2seq spelling correction system. Sequence to sequence models (seq2seq) require a large amount of labelled training data to learn the mapping between the input and output. As a result, our study reveals the achievable top performance with the proposed approach and gives directions for a better future implementation plan. ![]() Our best performing model uses a unigram language model and this error model, and improves the performance scores by almost 20 percentage points over the widely used baselines. As the result of this preliminary work, we propose a new automatic training data collection process where existing spelling correctors help to develop an error model for a better system. We test with seven different spelling correction approaches, four of which are introduced in this study. In this study, we explore the impact of different spelling correction approaches for Turkish and ways to eliminate the training data scarcity. Turkish is an agglutinative language with a very complex morphology and lacks annotated language resources. ![]() The spelling correction of morphologically rich languages is hard to be solved with traditional approaches since in these languages, words may have hundreds of different surface forms which do not occur in a dictionary. ![]()
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |