Zeynep YILMAZ – CRYPTTECH AI LAB.
https://github.com/zeynobia/stst
Özetçe—Diller zamanla değişime uğrar. Eski metinlerde kullanılan kelimeler günümüz metinlerinde kullanılmayabilir. Eski metinleri günümüz neslinin anlayabilmesi amacıyla, uzman kişiler tarafından metin sadeleştirmesi işleminin yapılması gerekmektedir. Tarihi dizi ve filmlerdeki diyaloglarda ise, tam tersi bir durum söz konusudur. Uzman kişiler tarafından, diyaloglarda eski dönemlerde kullanılan kelimelerin kullanılması gerekmektedir. Ancak bu metot kaynak ve zaman açısından maliyetlidir. Bu probleme bir çözüm getirebilmek adına bu çalışma yapılmıştır. Çalışmada, problem kelime öbeği tabanlı istatiksel model olarak tanımlanmıştır. İlk olarak paralel veri kümesi oluşturulmuştur. Kelimeleri hizalamak için IBM modeli kullanılmıştır. Dil modeli için de n gram dil modeli kullanılmıştır. Önerilen sistemi değerlendirilmesi için, BLEU, Rouge, Meteor, Wer, Word2vec skorları kullanılarak, sonuç olarak, 87 BLEU puanı elde edilmiştir.
Anahtar Kelimeler — Text Stili Transferi, Gözetimli Makine Öğrenmesi, Yeniden Açıklama, Kelime Öbeği Çevirisi
Abstract—Languages change over time. The words used in old texts may not be used in today’s texts. For this reason, the task of simplifying the old texts should be done by experts in order to be able to understand today’s generation. The opposite condition is also present in the dialogues of historical media. Unfortunately, this method is costly and consumes too much time. In this study, I tried to solve this problem. The problem is defined as a phrase-based statistical model. First, a parallel dataset is created. IBM model is used for word alignments. N-gram language model is used for the language model. In order to evaluate the proposed system, BLEU, Rouge, Meteor, Wer, Word2vec scores are used. As a result, 87 BLEU points were achieved.
Key Words — Text Style Transfer, Supervised Machine Learning, Paraphrasing, Phrase-Based Translation.
I. INTRODUCTION
While communities are improvingdüzeltme / iyileştirme, the usage of some words is decreasing. Some words disappear completely [1]. Even texts written in old times such as the Republic period contain many words and word groups that cannot be used today. Since the meaning of these words is not known, the texts are not easily understood by many readers. Old texts are completely simplified and presented as a new text by linguists.
In this study, old Turkish texts are converted to modern Turkish. Plus, modern Turkish texts are also converted to old Turkish texts. It is aimed to change today’s texts while providing the same meaning.
On the problem of text simplification, many important studies have been done. In these studies; for text simplification tasks, different methods are used including such that lexical, syntactic methods, statistical Bayes model, artificial neural networks, and hybrid models [2]. Torunoglu-Selamet et al. [3] have created syntactic and morphological rules for the simplification task. Zhu et al. [4] created the PWKP text simplification dataset. They have trained the tree-based statistical Bayes model using this dataset. Xu et al. [5] tried to convert the writing style in texts written by Shakespear into modern English. In [6] Jhamtani et al. tried to rewrite modern English texts in Shakespeare’s style by using sequence to sequence models in artificial neural networks.
This study is organized as follows. In the second section, the data set is explained. In the third section, the steps in the proposed method applied are detailed. In the fourth chapter, the experiments and the results of these experiments are presented. In the fifth chapter, the conclusion is given.
II. DATA SET
In this study, the 1938 Nutuk edition is accepted as old Turkish. This edition has not been simplified. The Nutuk edition which was simplified and published by Bedi Yazıcı, has been accepted as modern Turkish. This edition was published in 1995.
In addition to the Nutuk data set, a parallel data set containing common words was also created.
Huge corpora collected from Wikipedia, news, review texts, old and modern book texts. Word, stem statistics, bigram statistics, phrase statistics are obtained. By using this corpora, Parallel data is created using the most frequent synonyms and these statistics.
III. MODEL
In this study, for the paraphrasing system, Bayes probabilistic model is used. The formula of the model is given in equation 1. Paraphrased text is denoted by p. Source text is denoted by s. P(p) indicates the probability of language model.
P(s|p) indicates the probability of the phrase model.
arg max (P(p|s)) = arg max ( P(s|p) *P(p) ) (1)
A.Preprocessing
In the preprocessing step, Firstly, sentences are split as words. Secondly, unwanted characters and punctuation marks are removed from the text. Then uppercase letters are converted to lowercase letters. If this process is not done, the success rate reduces. Because the same words are considereddikkate alınan / saygıdeğer as different words. This step is important for increasing the success rate.
B.Word Alignment
In the word alignment step, translation relationships between words in sentence pairs are calculated. In Supervised text style transfer applications, the high quality of word alignment directly contributes to the accuracy score. Brown et al. [7] introduced 5 different IBM models in their study. Until today, many studies have been conducted in which these models are developed and presented. In this study, the library named fast align is used. This library is created by parameterizing IBM Model2 [8]. Since some words do not change in the paraphrasing process, very successful results are obtained in the word alignment process.
Example: Word Alignment
Sentence: İstikbalde dahi, seni bu hazineden mahrum etmek isteyecek, dahili ve harici bedhahların olacaktır.
Paraphrasing: Gelecekte bile, seni bu hazineden yoksun bırakmak isteyecek, iç ve dış düşmanların olacaktır.
Table1: Word Alignment Example Results
Source | Target |
istikbalde | gelecekte |
dahi | bile |
seni | seni |
bu | bu |
hazineden | hazineden |
mahrum etmek | yoksun bırakmak |
isteyecek | isteyecek |
dahili | iç |
ve | ve |
harici | dış |
bedhahların | düşmanların |
olacaktır | olacaktır |
C.Language Model
The language model is used for paraphrasing applications in order to increase the fluency of the generated text. In this study, the n-grams of the probabilistic language model are used. Kenelm n-gram language model is preferred. This language model uses the Modified Kneser-Ney smoothing approach [9]. It is faster than other n-gram language models. It also uses less memory [9]. N-gram language model is used to predict the next element using previous elements in a sequence. It also gives the probability of the sentence. N-gram language model is used in speech recognition, machine translation, spell checker applications.
Simple N-gram Model
Example: Bi gram of Simple Corpora
seni bu hazineden yoksun bırakmak isteyecek düşmanların olacaktır
seni bu hazineden mahrum etmek isteyecek bedhahların olacaktır
Table2: Bigram Model For Simple Corpora
How Calculate N gram Probability
Smoothing techniques are used to avoid zero possibilities. It is assumed that there is a possibility of unknown words.The <unk> indicates unknown words. The <s> expression indicates the beginning of sentences. The </s> statement indicates the end of the sentence. N gram file The actual probabilities are replaced by their logs. The logbase is generally 10.
-1.278754 düşmanların -0.2527253
-1.278754 is probability of “düşmanların” words
-0.2527253 is back of weights of “düşmanların” words.
Back of weights (BW) is used to calculate the probability if unknown expressions occur.
Example “düşmanların isteyecek” expression is an unknown expression in corpora. Let’s calculate this probability.
P(isteyecek|düşmanların)=P(isteyecek)*BW (düşmanların)
=-0.9777236 –0.2527253 =-1,2305
D.TRAINING
Data are divided into training and test data for evaluation. In this step, a phrase table is created from the training data. In the phrase table, there are translation probabilities of words and word groups with each other.
Example: Simple Phrase Table
Sentence1: İstikbal göklerdedir
Paraphrase1: Gelecek göklerdedir
Sentence2: İstikbal ne manaya gelir
Paraphrase2: Ati ne anlama gelir
Table3: Simple Phrase Table Example
Source | Target | Probability |
istikbal | gelecek | 0.50 |
istikbal | ati | 0.50 |
göklerdedir | göklerdedir | 1.0 |
ne | ne | 1.0 |
manaya | anlama | 1.0 |
gelir | gelir | 1.0 |
E.FINE TUNING
Parameters to be used in the paraphrasing model in this step adjusted to increase accuracy. These parameters can be adding a new phrase to the Phrase table and removing a phrase. It may also be the determination of optimum weights between the phrase model and the language model. Stems can be used instead of words.
F.DECODING
More than one possibility on the target side of a given sentence available. A sentence word or phrase given in this step, divided into groups. Then, candidate sentences are created by paraphrasing the word groups. Finally, the sentence with the highest score among the candidate sentences returns as a result. Words can be deleted from the sentence while analyzing. A word on the source side can be expressed in more than one word on the target side.
G.EVALUATION
In order to measure accuracy in text simplification tasks have developed many different methods. there must be a high similarity between simplifications made by humans and simplifications made by machines.
BLEU
BLEU metric is firstly proposed in order to measure machine translation results in 2002 [10]. This metric can work independently of the language. In the following years, this metric has also become the standard in evaluating machine translation results. Candidate translation text and reference text in metric n-gram based as compared. The formula for the standard BLEU metric is seen in equation 2. BP is the penalty parameter for long candidate translations. The output of the metric varies between 0 and 1, but the result is multiplied by 10, 100. In the study, the BLEU score is multiplied by 100 and is presented. Wn indicates the n-gram precision weights. For BLEU4, weights of 1, 2, 3, 4 gram precision values are 0.25. For BLEU3, weights of 1, 2, 3 gram precision values are 0.333.
WER
The WER metric indicates the word error rate. WER metric is derived from the Levenshtein distance metric. Levenshtein distance is firstly introduced by Vladimir Levenshtein [11]. Especially, it is the most popular metric of speech recognition systems. Because the paraphrasing process is between the same languages, the word order is the same. It can also be used to evaluate paraphrasing results.
S indicates the number of substitutions,
D indicates the number of deletions,
I indicates the number of insertions,
C indicates the number of correct words,
N indicates the number of in the reference (N=S+D+C)
Ref: Birinci görevin Türk istiklalini muhafaza etmektir
Hyp: Birinci ödevin Türk istiklalini muhafaza etmektir
Subsition: görevin -> ödevin (Count:1 )
Deletion: 0
Insertion: 0
Correct: 5
In this study, the WordAccuracy score is multiplied by 100 and is presented.
ROUGE
The rouge metric is the popular metric for summarization systems. This metric can be used in the evaluation of paraphrasing systems. It is firstly introduced in [12]. Rouge-N indicates an overlap of n-grams between reference and hypothesis texts. Rouge-2 indicates an overlap of bi grams between reference and hypothesis texts[12].
METEOR
It is firstly proposed in [13]. This metric is based on the harmonic mean of unigram precision and recall [13]. It also has several features like stemming and synonymy matching, along with the standard exact word matching. Meteor is implemented in pure Java programming language and requires no installation or dependencies to score output [14] .
Word2Vec Semantic Similarity
The word2vec algorithm uses a neural network algorithm in order to learn word associations from a large corpus of text.
This model can detect synonymous words or very relevant words. It is firstly proposed in [15]. It is also used to calculate the semantic similarity of sentences.
Proposed Evaluation Metric
In the paraphrasing system, there are no problems with word order. BLEU score is not suitable for paraphrasing system evaluations [16]. Because this process is done in the same language. In this case, the word accuracy metric system can evaluate better than the BLEU metric. A word can have more than one synonym. For example, the words “harika” and “mükemmel” , “kusursuz” and “muhteşem” are synonyms. There is one in the reference sentence. This situation causes the success to be lower than expected. In the Word2vec model, although “harika” and “kötü” are not synonyms, they have certain similarities because they are adjectives. This causes Word2vec similarity to be higher than expected.. To avoid this situation, the most similar N words from the Word2vec model are taken into account. N can typically be between 10 and 50. N = 25 has been chosen in this study.
Different words between the reference and the hypothesis text are determined. If different words are in Word2vec TopN document, word2vec similarity is added.
Example: ProposedScore, Word2Vec, WordAccuracy
Hyp: Geçen hafta izlediğimiz film harika
Ref1: Geçen hafta izlediğimiz film mükemmel
Ref2: Geçen hafta izlediğimiz film kötü
DifferentWord(Hyp,Ref1) = mükemmel
DifferentWord(Hyp,Ref2) = kötü
Word2VecSim(Harika->mükemmel)= 0.80 (in top sim)
Word2vecSim(Harika->kötü)= 0.40(not in top sim)
“mükemmel”
Hyp-Ref1: WER=1 WordAcc: 4 /5 =0.80
Hyp-Ref2: WER=1 WordAcc: 4 /5 =0.80
Hyp-Ref1: Word2vecSimilarity: 0.96
Hyp-Ref2: Word2vecSimilarity: 0.88
Hyp-Ref1 ProposedMetric: 0.96
Hyp-Ref2 ProposedMetric: 0.80
IV. EXPERIMENTAL RESULTS
Feature Type
Table4: Effects on Feature Type
Type | Train | Test | W2vec | Wacc | Met. | Rouge2 | Bleu |
Word | 36500 | 4060 | 80.5 | 71.5 | 63.7 | 58.0 | 49.3 |
Stem | 36500 | 4060 | 87.9 | 68.5 | 58.8 | 53.0 | 43.3 |
Words performed better than stems. Because there are many different forms of additions to the stems. This situation reduces the word accuracy. For example, the ‘in’complement suffix has many states like’n’ ‘ın’,’in’, ‘un’, ‘ün’ ‘nın’, ‘nin’, ‘nun’, nün’. Stem also increases the sequence size.
Example: Word or Stem Tokens
Word:bu kelimenin anlamı nedir|bu sözcüğün manası nedir
Stem:bu kelime nin anlam ı ne dir|bu sözcüğ ün mana sı ne dir
Data Size
Table5: Effects on Data Size
Train | Test | Score | W2vec | Wacc | Met. | Rouge | Bleu |
74160 | 8295 | 85.6 | 87.3 | 84.7 | 81.2 | 74.5 | 72.7 |
143011 | 10767 | 88.1 | 89.7 | 87.7 | 83.6 | 79.0 | 75.6 |
182818 | 14694 | 90.0 | 91.0 | 89.4 | 86.3 | 79.9 | 79.4 |
215800 | 17930 | 93.8 | 93.9 | 93.5 | 91.7 | 86.8 | 87.0 |
As the size of the data increases, success rates increase. Because, the more data, the better word alignment is done. Rare words decrease. The paraphrasing process is in the same language. The word order is the same. Some words have not changed, the accuracy of the Word alignment process is high.
93.7% word accuracy is achieved in the training data of 215800 sentences. Also, 87.0% BLEU value is obtained.
Sample Results
Table6: Sample Paraphrasing Results
Ori: Birinci vazifen Türk istiklalini Türk Cumhuriyetini ilelebet muhafaza ve müdafaa etmektir |
Paraph: Birinci görevin Türk bağımsızlığını Türk Cumhuriyetini sonsuza kadar korumak ve savunmaktır |
Ori: Mevcudiyetinin ve istikbalinin yegane temeli budur |
Paraph: Varlığının ve geleceğinin tek temeli budur. |
Ori: Bu temel senin en kıymetli hazinendir |
Paraph: Bu esas senin en değerli hazinendir |
Ori: İstikbalde dahi seni bu hazineden mahrum etmek isteyecek dahili ve harici bedhahların olacaktır |
Paraph: Gelecekte bile seni bu hazineden yoksun bırakmak isteyecek iç ve dış düşmanların olacaktır |
Ori: Bir gün istiklal ve cumhuriyeti müdafaa mecburiyetine düşersen vazifeye atılmak için içinde bulunacağın vaziyetin imkan ve şeraitini düşünmeyeceksin |
Paraph: Bir gün bağımsızlık ve Cumhuriyeti savunmak zorunluluğuna düşersen, göreve atılmak için, bulunduğun durumun olanak ve şartlarını düşünmeyeceksin |
Ori: Bu imkan ve şerait çok namüsait bir mahiyette tezahür edebilir |
Paraph: Bu olanak ve şartlar, çok elverişsiz bir özellikte ortaya çıkabilir |
Ori: İstiklal ve cumhuriyetine kastedecek düşmanlar bütün dünyada emsali görülmemiş bir galibiyetin mümessili olabilirler |
Paraph: Bağımsızlık ve cumhuriyetini yok etmek isteyecek düşmanlar bütün dünyada eşi görülmemiş bir galibiyetin temsilcisi olabilirler |
V. CONCLUSION
In this study, we present the details of an automatic style transfer system that utilizes a supervised phrase-based statistical model. When a statistical phrase-based model is used for the paraphrasing system, when there is sufficient data, success rates are quite high. The main reason for this is that the accuracy of the word alignment process in the paraphrasing system is high. Because the system is developed in the same language. Word order is similar and some words remain the same. We apply it to convert a text written in old Turkish to modern Turkish. Also, We apply it to convert a text written in modern Turkish to old Turkish. The importance of our proposed system lies in making old literature accessible to the new generation. At the same time, old texts can be generated for historical films and TV series.
REFERENCES
- Akay R., ” The social and language reasons of the changes in language” ,Uluslararası İnsan Bilimleri Dergisi ,Volume: 4, No:1, pp. 1-9, 2007
- Shardlo M., “A survey of automated text simplification. International Journal of Advanced Computer Science and Applications”, 4(1), pp. 58-70, 2014.
- Torunoglu-Selamet D., Pamay T., Eryigit G.,“Simplification of Turkish sentences”, In The First International Conference on Turkic Computational Linguistics, pp. 55-59, 2016.
- Zhu Z., Bernhard D., Gurevych, I. “A monolingual tree-based translation model for sentence simplification.”, In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 1353-1361,2010.
- Xu W., Ritter, A., Dolan, B., Grishman, R., Cherry, C., “Paraphrasing for style”, In Proceedings of COLING 2012, pp. 2899-2914, 2012.
- Jhamtani H., Gangal V., Hovy E., Nyberg E., “Shakespearizing modern language using copy-enriched sequence-to-sequence models,” arXiv preprint arXiv:1707.01161, 2017.
- Brown P. F., Della Pietra S. A., Della Pietra, V. J., Mercer R. L., “The mathematics of statistical machine translation: Parameter estimation.”, Computational linguistics, Vol. 19, No.2, pp. 263-311, 1993.
- Dyer C., Chahuneau V., Smith N. A. ,“A simple, fast, and effective reparameterization of ibm model 2”, In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 644-648) 20
- Heafield K., “KenLM: Faster and smaller language model queries.”, In Proceedings of the sixth workshop on statistical machine translation, pp. 187-197, 2011.
- Papineni K., Roukos S., Ward T., Zhu W. J., “BLEU: a method for automatic evaluation of machine translation”, In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311-318, 2002.
- Levenshtein V. I, “Binary codes capable of correcting deletions, insertions, and reversals.” In Soviet physics doklady ,Vol. 10, No. 8, pp. 707-710, 1966.
- Lin C. Y., “Rouge: A package for automatic evaluation of summaries”, In Text summarization branches out, pp. 74-81, 2004.
- Banerjee S., Lavie, A. “METEOR: An automatic metric for MT evaluation with improved correlation with human judgments.”, In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65-72, 2005.
- https://github.com/cmu-mtlab/meteor [Web Access Time: 14.11.2020)
- Mikolov T., Chen K., Corrado G., Dean J.. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781, 2013.
- Sulem E., Abend O., Rappoport A. “Bleu is not suitable for the evaluation of text simplification”. arXiv preprint arXiv:1810.05995, 2018.
from CRYPTTECH BLOG https://ift.tt/3qi8C6G
via IFTTT