فهرست:
1. مقدمه. 2
1-1. مقدمه. 2
1-1-1. ترجمه ماشینی مبتنی بر فرهنگ لغت... 3
1-1-2. ترجمه ماشینی مبتنی بر قانون.. 4
1-1-3. ترجمه ماشینی مبتنی بر دانش.... 5
1-1-4. ترجمه ماشینی مبتنی بر پیکره. 5
ترجمه ماشینی آماری.. 6
ترجمه ماشینی مبتنی بر مثال.. 6
ترجمه ماشینی مبتنی بر متن.. 7
1-2. ضرورت ساخت پیکره موازی.. 7
1-3. مسئله تحقیق: ساخت پیکره موازی.. 9
1-4. هدف تحقیق: ساخت پیکره موازی از روی پیکره تطبیقی.. 10
1-5. سرفصلها 10
1-5-1. فصل دوم: مبانی نظری.. 10
1-5-2. فصل سوم: مروری بر تحقیقات انجام شده. 11
1-5-3. فصل چهارم: مدل پیشنهادی.. 11
1-5-4. فصل پنجم: ارزیابی و نتیجه گیری.. 12
2. مبانی نظری.. 14
2-1. پیکره. 14
2-1-1. پیکره موازی.. 15
2-1-2. پیکره تطبیقی.. 17
2-2. همترازی.. 18
2-2-1. همترازی در سطح سند.. 19
2-2-2. همترازی در سطح جمله. 19
2-2-3. همترازی در سطح کلمه (همترازی لغوی). 21
همترازی لغوی با استفاده از مدلهای آیبیام. 22
2-3. ارزیابی ترجمه ماشینی.. 23
2-3-1. بلو. 23
2-3-2. متریک NIST. 24
2-3-3. نرخ خطای کلمه. 24
2-3-4. نرخ خطای ترجمه (TER). 25
3. مروری بر تحقیقات انجام شده. 28
3-1. مقدمه. 28
3-2. ساخت پیکره موازی از روی متون همترجمه. 28
3-3. استخراج جملات موازی از وب... 30
3-4. استخراج جملات موازی از پیکرههای تطبیقی.. 32
3-5. تشخیص جملات موازی با استفاده از طبقهبند آنتروپی بیشینه. 34
3-6. ساخت پیکره موازی انگلیسی – فارسی.. 36
4. مدل پیشنهادی.. 39
4-1. مقدمه. 39
4-2. انتخاب جفت جملات کاندید موازی بودن.. 40
4-2-1. فیلتر کلمات مشترک... 41
تبدیل کدگذاری کاراکترها 42
مشخص کردن مرز جملهها و کلمهها 43
ریشهیابی.. 44
حذف کلمات پرتکرار 45
رفع ابهام. 45
جستجوی معانی از دیکشنری.. 46
گروه بندی کلمات تکراری جمله به همراه تعداد رخدادشان در جمله. 46
الگوریتم یافتن نرخ کلمات مشترک (از طرف مبدأ) 47
4-3. انتخاب جفت جملات موازی از بین جفت جملات کاندید.. 48
4-3-1. طبقهبند آنتروپی بیشینه. 48
4-3-2. ویژگیهای عمومی.. 49
ویژگیهای مبتنی بر طول دو جمله. 49
نرخ کلمات مشترک... 50
4-3-3. ویژگیهای مبتنی بر همترازی در سطح کلمه یک جفت جمله. 50
کلمات همتراز نشده 50
باروری.. 51
محدوده پیوسته. 52
نمره همترازی.. 53
4-4. بالا بردن دقت جفت جملات موازی استخراج شده. 54
4-5. شیوه ارزیابی مدل.. 55
5. ارزیابی و نتیجه گیری.. 58
5-1. ارزیابی طبقهبند آنتروپی بیشینه. 58
5-1-1. ارزیابی ویژگیها 58
5-1-2. حساسیت به دامنه. 60
5-2. تنظیمات و آزمایشات ساخت پیکره موازی از پیکره تطبیقی.. 63
5-2-1. پیکره تطبیقی مورد استفاده. 63
پیکره تطبیقی فارسی – انگلیسی دانشگاه تهران (UTPECC) 63
پیکره تطبیقی گرفته شده از مقالات ویکی پدیا 65
5-2-2. پارامترهای تنظیم شده و ابزار مورد استفاده. 66
انتخاب جفت جملات کاندید: 66
انتخاب جفت جملات موازی: 68
بالا بردن دقت جفت جملات استخراج شده: 69
5-2-3. ارزیابی جملات موازی استخراج شده با استفاده از ماشین ترجمه. 69
5-3. نتیجه گیری.. 72
5-4. پیشنهادات آینده. 75
منبع:
[1]S. Tripathi and J. K. Sarkhel, “Approaches to machine translation”, Annals of Library and Information Studies, vol. 57, pp. 388-393, December 2010.
A. Lopez, “statistical machine translation”, ACM Computing Surveys, vol. 40, no. 3, pp. 1-49, 2008.
P. F. Brown, J. Cocke, S. A. Della-Pietra, V. J. Della-Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer and P. S. Roossin, “A statistical approach to machine translation”, Comput Linguist, vol. 16, no. 2, pp. 79-85, 1990.
F. J. Och and H. Ney, “Discriminative training and maximum entropy models for statistical machine translation”, in 40th Annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, pp. 295–302, 2002.
P. Koehn, “Europarl: a parallel corpus for statistical machine translation”, in MT Summit X: the tenth machine translation summit, Phuket, Thailand, pp. 79–86, 2005.
M. Mohaghegh, A. Sarrafzadeh and T. Moir, “Improved Language Modeling for English-Persian Statistical Machine Translation”, Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation (COLING 2010), Beijing, pp. 75–82, August 2010.
Supreme Council of Information and Communication Technology. (2013). Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.
A. Mansouri and H. Faili, “State-of-the-art English to Persian Statistical Machine Translation System”, in 16th CSI International Symposium on Artificial Intelligence and Signal Processing, pp. 174-179. IEEE, Fars, 2012.
T. Ishisaka, K. Yamamoto, M. Utiyama and E. Sumita, “Development of a Japanese-English software manual parallel corpus”, MT Summit XII: proceedings of the twelfth machine translation summit, Ottawa, ON, Canada, pp. 254–259, 2009.
M. T. Pilevar, A. H. Pilevar and H. Faili, “TEP: Tehran English-Persian Parallel Corpus”, In: Gelbukh, A. (eds.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6609, pp. 68-79. Springer, Heidelberg, 2011.
F. Jabbari, S. Bakhshaei, S. M. Mohammadzadeh Ziabary and S. Khadivi, “Developing an Open-domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus”, Fourth Workshop on Computational Approaches to Arabic-Script-based Languages( AMTA 2012), San Diego, CA, USA, November 2012.
J. Nie, M. Simard, P. Isabelle and R. Dur, “Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web”, Proceedings of the 22nd annual international ACMSIGIR conference on research and development in information retrieval (SIGIR ’99), Berkeley, CA, pp. 74–81, 1999.
P. Resnik and N. A. Smith, “The web as a parallel corpus”, Comput Linguist, vol. 29, no. 3, pp. 349-380, 2003.
Y. Zhang, K. Wu, J. Gao, and P. Vines, “Automatic acquisition of Chinese-English parallel corpus from the Web”, Proceedings of 28th European Conference on Information Retrieval, pages 420–431. Lecture Notes in Computer Science, Vol. 3936, Springer, January 2006.
D. W. Oard, “Alternative approaches for cross-language text retrieval”, In AAAI symposium on cross-Language text and speech retrieval, Stanford, CA, USA, pp. 154–162, 1997.
J. Tiedemann, "Parallel Data, Tools and Interfaces in OPUS", In Proceedings of the 8th International Conference on Language Resources
[16]and Evaluation (LREC'2012), 2012.
R. Zajac, S. Helmreich and K. Megerdoomian, “Black-Box/Glass-Box Evaluation in Shiraz”, Workshop on Machine Translation Evaluation at LREC-2000, Athens, Greece, 2000.
R. S. Belvin, W. May, S. Narayanan, P. Georgiou and S. Ganjavi, “Creation of a Doctor-Patient Dialogue Corpus Using Standardized Patients”, International Conference on Language Resources and Evaluation (LREC), 2004.
B. Qasemizadeh and S. Rahimi, “The First Parallel Multilingual Corpus of Persian: Toward a Persian BLARK”, the second workshop on Computational Approaches to Arabic Script-based Languages (CAASL-2), California, USA, 2007.
M. Mohaghegh and A. Sarrafzadeh, “Performance evaluation of various training data in English-Persian Statistical Machine translation”, 10th International Conference on the Statistical Analysis of Textual Data (JADT2010), Rome, Italy, 2010.
M. A. Farajian, “Pen: Parallel English-Persian News Corpus”, Proceedings of the 2011th World Congress in Computer Science, Computer Engineering and Applied Computing, 2011.
F. Jabbari, S. Bakhshaei, S. M. Mohammadzadeh Ziabary and S. Khadivi, “Developing an Open-domain English-Farsi Translation System Using AFEC: Amirkabir Bilingual Farsi-English Corpus”, Fourth Workshop on Computational Approaches to Arabic-Script-based Languages( AMTA 2012), San Diego, CA, USA, November 2012.
S. Abdul Rauf and H. Schwenk, “On the use of comparable corpora to improve SMT performance”, Proceedings of the 12th conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), Athens, Greece, pp. 16-23, 2009.
W. A. Gale and K. W. Church, “A program for aligning sentences in bilingual corpora”, Comput Linguist, vol. 19, no. 1, pp. 75–102, 1993.
R. C. Moor, “Fast and accurate sentence alignment of bilingual corpora”, In S. Richardson (ed.), Machine Translation: From Research to Real Users (Proceedings, 5th Conference of the Association for Machine Translation in the Americas, Tiburon, California), pp.135–244, Springer-Verlag, Heidelberg, Germany, 2002.
R. Nazar, “Parallel corpus alignment at the document, sentence and vocabulary levels”, Natural Language Processing, vol. 47, pp. 129-136, ISSN 1989-7553, sep. 2011.
J. Xu, J. Gao, K. Toutanova and H. Ney, “Bayesian Semi-Supervised ChineseWord Segmentation for Statistical Machine Translation”, in Proceedings of the 22nd International Conference on Computational Linguistics (COLING '08), vol. 1, pp. 1017-1024, Association for Computational Linguistics Stroudsburg, PA, USA, 2008.
F. J. Och and H. Ney, "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, vol. 29, no. 1, pp. 19-51, March 2003.
M. Snover, B. Dorr, R. Schwartz, L. Micciulla and J. Makhoul, “A study of translation edit rate with targeted human annotation”, Proceedings of the 7th conference of the Association for Machine Translation in the Americas: visions for the future of machine translation (AMTA 2006), Cambridge, MA, USA, pp. 223–231, 2006.
M. G. Snover, N. Madnani, B. Dorr and R. Schwartz,”TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate”, Journal of Machine Translation, vol. 23, Issue 2-3 , pp. 117-127, September 2009.
B. Chang, “Chinese-English parallel corpus construction and its application”, in Proceedings of the Eighteenth Pacific Asia Conference on Language, Information, and Computation, pp. 283–290, 2004.
M. Utiyama and H. Isahara, “Reliable measures for aligning Japanese-English news articles and sentences”, 41st Annual meeting of the Association for Computational Linguistics, proceedings of the conference, Sapporo, Japan, pp. 72–79, 2003.
P. Fung, E. Prochasson and S. Shi, “Trillions of comparable documents”, Proceedings of the 3rdworkshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp. 26–34, 2010.
C. Hoang, L. A. Cuong, N. P. Thai and H. T. Bao, “Exploiting Non-Parallel Corpora for Statistical Machine Translation", Proceedings of International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future, pp. 1-6, 2012.
A. Antonova and A. Misyurev, “building a web-based parallel corpus and filtering out machine-translated text”, in Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC '11), pp. 136-144, Association for Computational Linguistics Stroudsburg, PA, USA, 2011.
H. Masuichi, R. Flournoy, S. Kaufmann and S. Peters, “A bootstrapping method for extracting bilingual text pairs”, proceedings of the 18th international conference on computational linguistics, COLING 2000 in Europe, vol. 2, Saarbrücken, Germany, pp. 1066–1070, 2000.
B. Zhao and S. Vogel, “Adaptive parallel sentences mining from web bilingual news collection”, Proceedings of the 2002 IEEE international conference on data mining (ICDM 2002), IEEE Computer Society, Maebashi, Japan, pp. 745–748, 2002.
C. C. Yang and K. W. Li, “Automatic construction of English/Chinese parallel corpora”, Journal of the American Society for Information Science and Technology (JASIST), vol. 54, no. 8, pp. 730–742, 2003.
P. Fung and P. Cheung, “Mining very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM”, Proceedings of the 2004 conference on empirical methods in natural language processing, Barcelona, Spain, pp. 57–63, 2004.
D. Wu and P. Fung, “Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora”, Proceedings of the 2nd international joint conference on natural language proceedings (IJCNLP 05), lecture notes in computer science, vol. 3651, Springer, Berlin, pp. 257–268, 2005.
D. S. Munteanu and D. Marcu, “Improving machine translation performance by exploiting non-parallel corpora”, Comput Linguist, vol. 31, no. 4, pp. 477–504, 2005.
A. Eisele and J. Xu, “Improving machine translation performance using comparable corpora”, Proceedings of the 3rd workshop on building and using comparable corpora: from parallel to non-parallel corpora, Valletta, Malta, pp. 35–41, 2010.
D. Munteanu, A. Fraser and D. Marcu, “Improved Machine Translation Performance via Parallel Sentence Extraction from Comparable Corpora”, In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association For Computational Linguistics, pp. 265–272, Boston, MA, 2004.
J. M. Kaufmann, “JMaxAlign: A Maximum Entropy Parallel Sentence Alignment Tool”, in Proceedings of COLING 2012: Demonstration Papers, pp. 277-288. COLING 2012, Mumbai, 2012.
C. Chu, T. Nakazawa and S. Kurohashi, “Chinese–Japanese Parallel Sentence Extraction from Quasi–Comparable Corpora”, Proceedings of ACL 2013, Sofia, Bulgaria, 2013.
G. Minnen, J. Carroll and D. Pearce, “Applied morphological processing of English”, Natural Language Engineering, vol. 7, no. 3, pp. 207-223, 2001.
D. Marecek, M. Popel and Z. Zabokrtsky, “Maximum Entropy Translation Model in Dependency-Based MT Framework”, in 5th Workshop on Statistical Machine Translation and Metrics MATR, pp. 207-212. Association for Computational Linguistics, Uppsala, 2010.
K. Taghipour, N. Afhami, S. Khadivi and S. Shiry, “A Discriminative Approach to Filter out Noisy Sentence Pairs from Bilingual Corpora”, in 5th International Symposium on Telecommunications (IST'2010), pp. 537-541. Tehran, 2010.
P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, “The mathematics of statistical machine translation: parameter estimation”, Computational Linguistics, vol. 19, pp. 263-311, 1993.
H. Baradaran Hashemi, A. Shakery and H. Faili, “Creating a Persian-English Comparable Corpus”, Conference on Multilingual and Multimodal Information Access Evaluation (CLEF), pp. 27-39, 2010.
P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin and E. Herbst, “Moses: Open Source Toolkit for Statistical Machine Translation”, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007.