การระบุคำไทยและคำทับศัพท์ด้วยแบบจำลองเอ็นแกรม

อัครพล เอกวงศ์อนันต์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/8413

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	วิโรจน์ อรุณมานะกุล	-
dc.contributor.author	อัครพล เอกวงศ์อนันต์	-
dc.contributor.other	จุฬาลงกรณ์มหาวิทยาลัย. คณะอักษรศาสตร์	-
dc.date.accessioned	2008-11-07T01:28:36Z	-
dc.date.available	2008-11-07T01:28:36Z	-
dc.date.issued	2548	-
dc.identifier.isbn	9745323608	-
dc.identifier.uri	http://cuir.car.chula.ac.th/handle/123456789/8413	-
dc.description	วิทยานิพนธ์ (อ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2548	en
dc.description.abstract	วัตถุประสงค์ของการวิจัยครั้งนี้ เพื่อต้องการสายอักขระเฉพาะสำหรับใช้ในการระบุภาษาของคำโดยใช้ คลังข้อมูลคำไทย คำทับศัพท์ภาษาอังกฤษ ภาษาญี่ปุ่นและภาษาฝรั่งเศส และพัฒนาระบบการระบุภาษา ของคำไทยและคำทับศัพท์ภาษาต่างประเทศโดยใช้สายอักขระเฉพาะและใช้แบบจำลองเอ็นแกรมขนาด 1-5 แกรม คลังขลังข้อมูลที่ใช้ในงานวิจัยนี้ คือ คลังข้อมูลคำไทย คำทับศัพท์ภาษาอังกฤษ ภาษาญี่ปุ่น ภาษาละ 10,000 คำ และคำทับศัพท์ภาษาฝรั่งเศส 1,000 คำ โดยเก็บจากข้อมูลที่พบในภาษาธรรมชาติซึ่ง อาจจะไม่ได้ทับศัพท์ถูกต้องตามเกณฑ์ของราชบัณฑิตยสถานก็ได้ 80% ของคลังข้อมูลถูกนำมาใช้เพื่อหา สายอักขระเฉพาะและสร้างแบบจำลองเอ็นแกรมของแต่ละภาษา ในขณะที่อีก 20% ถูกใช้เพื่อการทดสอบ ระบบแบบต่าง ๆ สายอักขระเฉพาะที่พบสะท้อนให้เห็นถึงลักษณะเฉพาะของแต่ละภาษาได้ในระดับหนึ่ง จึงมีผลให้ระบบที่ใช้สายอักขระเฉพาะในการระบุภาษาสามารถตัดสินภาษาได้ถูกต้อง 50.58% 48.71% 54.09% และ 20.40% สำหรับคำไทย คำทับศัพท์ภาษาอังกฤษ ภาษาญี่ปุ่น และ ฝรั่งเศส ตามสำดับ เมื่อใช้ แบบจำลองเอ็นแกรมในการระบุภาษา ระบบสามารถระบุภาษาของคำไทย คำทับศัพท์ภาษาอังกฤษ และ ญี่ปุ่นได้ถูกต้องกว่า 90% แต่ได้เพียงประมาณ 60% สำหรับคำทับศัพท์ฝรั่งเศส ผลที่ได้ยืนยันว่าขนาดของ ข้อมูลการฝึกมีผลต่อการทำงานของระบบการระบุภาษาทั้งสองระบบ นอกจากนี้ จากผลที่พบว่าระบบที่ใช้ แบบจำลอง 3-แกรมให้ผลดีกว่าระบบที่ใช้ขนาดแกรมอื่นๆ ทำให้สรุปได้ว่า ขนาดของเอ็นแกรมมีผลต่อ การทำงานของระบบการระบุภาษา	en
dc.description.abstractalternative	This research aims to find the unique character sequences of Thai and transliterated words (English, Japanese, and French), and implement language identification systems using unique character sequences and n-gram models (1-5 gram). The corpora in this research consist of 10.000 Thai words, 10.000 English transliterated words, 10,000 Japanese transliterated words, and 1,000 French transliterated words. Transliterated words are collected from naturally occurring texts, even some of them are not conformed to the Royal Institute guidelines of transliteration. 80% of the Corpus is used to extract unique character sequences and to build and n-gram language model of each language, while the other 20% is used for testing the systems. The unique character sequences reflect some characteristics of the languages. As a result, the system using unique character sequence can identify languages correctly 50.58%, 48.71%, 54.09% and 20.40% for Thai words, English, Japanese, and French transliterated words respectively. When an n-gram language model is used, the system can identify languages correctly more than 90% for Thai, English and Japanese transliterated word, but only about 60% for French transliterated words. This confirms that the size of training corpus affects the performances of both systems. The results also show that the system using 3-gram model performs better than other n-gram models. Therefore, we can conclude that the size of n-gram does affect the performance of the language identification system.	en
dc.format.extent	2580607 bytes	-
dc.format.mimetype	application/pdf	-
dc.language.iso	th	es
dc.publisher	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.rights	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.subject	ภาษาไทย -- การถอดตัวอักษร	en
dc.subject	ภาษาไทย -- การใช้ภาษา	en
dc.subject	แบบจำลองเอ็นแกรม	en
dc.title	การระบุคำไทยและคำทับศัพท์ด้วยแบบจำลองเอ็นแกรม	en
dc.title.alternative	Identification of Thai and transliterated words by N-Gram Models	en
dc.type	Thesis	es
dc.degree.name	อักษรศาสตรมหาบัณฑิต	es
dc.degree.level	ปริญญาโท	es
dc.degree.discipline	ภาษาศาสตร์	es
dc.degree.grantor	จุฬาลงกรณ์มหาวิทยาลัย	en
dc.email.advisor	[email protected]	-
Appears in Collections:	Arts - Theses

Files in This Item:

File	Description	Size	Format
akarapol.pdf		2.52 MB	Adobe PDF	View/Open

Show simple item record