การแบ่งประโยคภาษาไทยโดยแคททิกอเรียลแกรมม่าและหลักเกณฑ์ไวยากรณ์

ณัฐชา ตังศิริรัตน์

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/37617

Title:	การแบ่งประโยคภาษาไทยโดยแคททิกอเรียลแกรมม่าและหลักเกณฑ์ไวยากรณ์
Other Titles:	Thai sentence segmentation using categorial grammar and grammar rules
Authors:	ณัฐชา ตังศิริรัตน์
Advisors:	อติวงศ์ สุชาโต โปรดปราน บุณยพุกกณะ ชัย วุฒิวิวัฒน์ชัย
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Advisor's Email:	[email protected] [email protected] ไม่มีข้อมูล
Subjects:	ภาษาไทย -- ประโยค ภาษาไทย -- แคทิกอเรียลแกรมมา การประมวลผลภาษาธรรมชาติ (คอมพิวเตอร์) Thai language -- Sentences Thai language -- Categorial grammar Natural language processing ‪(Computer science)‬
Issue Date:	2555
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	ประโยคจัดได้ว่าเป็นองค์ประกอบพื้นฐานที่สำคัญมากในงานด้านการประมวลผลข้อความ เช่น การแปลภาษาอัตโนมัติ (Machine translation) การค้นคืนสารสนเทศ (Information retrieval) และการสรุปข้อความ (Text summarization) ประสิทธิภาพของการประมวลผลดังกล่าวขึ้นอยู่กับความถูกต้องของประโยคที่ใช้เป็นสิ่งเข้า (Input) โดยเฉพาะอย่างยิ่งในภาษาไทยซึ่งไม่มีการแสดงการสิ้นสุดประโยคอย่างชัดเจน ดังนั้นวิทยานิพนธ์นี้จึงเสนอ การใช้แคททิกอเรียลแกรมม่า จำนวนคำระหว่างการเว้นวรรคที่พิจารณากับการเว้นวรรคใกล้เคียง และจำนวนคำระหว่างการเว้นวรรคที่กำลังพิจารณากับจุดสิ้นสุดของข้อความ เป็นลักษณะสำคัญในระเบียบวิธีทางสถิติและเสนอการประยุกต์ใช้กฎบางส่วนจากหลักเกณฑ์การใช้เครื่องหมายวรรคตอน และหลักเกณฑ์การเว้นวรรคที่กำหนดโดยราชบัณฑิตยสถาน เพื่อเพิ่มความถูกต้องให้กับผลลัพท์ที่ได้จากระเบียบวิธีเรียนรู้ทางสถิติ เพื่อแก้ปัญหาการแบ่งประโยคภาษาไทย โดยการทดลองได้ใช้ข้อความและการกำกับข้อความจากฐานข้อมูล Thai speech corpus for speech synthesis (TsynC) และได้ผลการทดลองดังนี้ ความถูกต้องของการแบ่งประโยค (sentence-break-recall) เท่ากับ 84.11% ความถูกต้องโดยรวม (space-correct) เท่ากับ 93.54% และความผิดพลาดของการแบ่งประโยค (false-break) เท่ากับ 2.99%
Other Abstract:	A sentence is regarded as a key fundamental element in many text processing tasks such as Machine translation, Information retrieval, and text summarization. So, performance of many text processing tasks relies on correct sentences used as input especially in Thai which has no explicit sentence boundary. This thesis proposes to use the integration of statistical method using Categorial grammar, number of words between the considering space and the preceding and succeeding space, and number of words between the considering space and the previous sentence-break as features and rule-based method derived from “Rules for punctuation, space, and abbreviation” composed by The royal institute to improve accuracy of Thai sentence-breaking. Rule-based method is applied to statistical method’s results in order to minimize false-break and increase total accuracy. This research uses Thai speech corpus for speech synthesis (TsynC) as training and testing data. The sentence-break-recall, space-correct and false-break scores are 84.11%, 93.54% and 2.99% respectively.
Description:	วิทยานิพนธ์ (วศ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2555
Degree Name:	วิศวกรรมศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	วิศวกรรมคอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/37617
URI:	http://doi.org/10.14457/CU.the.2012.1170
metadata.dc.identifier.DOI:	10.14457/CU.the.2012.1170
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
nathacha_ta.pdf		2.75 MB	Adobe PDF	View/Open

Show full item record