การออกแบบแฟ้มผกผันเพื่อการค้นคืนข้อความไทย

สมชาย ประสิทธิ์จูตระกูล

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/5608

Title:	การออกแบบแฟ้มผกผันเพื่อการค้นคืนข้อความไทย
Other Titles:	Design of inverted file for Thai-text retrieval
Authors:	สมชาย ประสิทธิ์จูตระกูล
Email:	[email protected]
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. ภาควิชาวิศวกรรมคอมพิวเตอร์
Subjects:	แฟ้มดัชนี ระบบการจัดเก็บและค้นข้อสนเทศ
Issue Date:	2541
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	งานวิจัยนี้นำเสนอขั้นตอนวิธีการหาคำเพื่อจัดทำดัชนีสำหรับระบบการค้นคืนข้อความไทยที่ใช้โครงสร้างแฟ้มผกผัน โดยอาศัยพจนานุกรมช่วยในการแยกคำ และยังสามารถจัดการกับกรณีที่ข้อความที่ได้รับมีคำที่ไม่ปรากฏพจนานุกรม อาทิเช่นคำทับศัพท์ หรือคำที่สะกดผิดเป็นต้น โดยอาศัยกฎการแบ่งพยางค์ข้อความไทย ขั้นตอนวิธีนี้จำลองปัญหาด้วยกราฟการต่อและซ้อนกันของคำ ซึ่งมีโหนดแทนคำและเส้นเชื่อมแทนการต่อหรือซ้อนกันของคำ โดยมีเส้นทางสั้นสุดจากซ้ายไปขวาในกราฟนี้ แทนรายการคำพื้นฐานที่ควรถูกจัดทำดัชนีสำหรับแฟ้มผกผันเวลาการทำงานของการหาคำนี้เป็น O(n[superscript 2] ) โดยที่ n คือความยาวข้อความ ขั้นตอนวิธีนี้จะถูกใช้ทั้งในขั้นตอนการเตรียมเอกสารก่อนการทำดัชนี และการประมวลข้อคำถามก่อนการสืบค้น ผลการทดลองพบว่าจำนวนคำที่หาได้เพื่อทำดัชนีนั้นมีจำนวนประมาณ 30-50% ของจำนวนคำที่เป็นไปได้ทั้งหมดที่ปรากฏในข้อความทดสอบ นอกจากนี้งานวิจัยนี้ยังได้นำเสนอขั้นตอนวิธีในการเข้ารหัสคำทับศัพท์ เพื่อรองรับการค้นคืนคำทับศัพท์ข้ามภาษาจากอังกฤษมาไทย นั่นคือระบบสามารถค้นคืนเอกสารที่มีคำสำคัญภาษาอังกฤษ หรือคำทับศัพท์เป็นภาษาไทยของคำอังกฤษนั้น การเข้ารหัสนี้ปรับปรุงวิธีการเข้ารหัสเสียงและตารางการเข้ารหัสในระบบซาวน์เดกซ์ วิธีนี้ใช้เวลาการเข้ารหัสแปรเชิงเส้นตามความยาว จากผลที่ได้จากการทดลองพบว่าได้ค่าเรียกคืนและความแม่นยำมากกว่า 80% เมื่อจำกัดการพิจารณาเฉพาะคำที่รหัสเสียงมีความยาวเกิน 4
Other Abstract:	This work presents an algorithm for finding words used for indexing in a Thai-text retrieval system using inverted file structures. A dictionary is used during word separation. The algorithm can deal with text containing unknown words to the system dictionary such as transliterated words and words with typographical errors using a set of Thai syllable separation rules. The algorithm models the problems by constructing a word-adjacency-overlapping graph where vertices represent words and edges represent the word adjacency-overlapping relationships. A shortest path from the left-most vertex to the right-most vertex of the graph is a list of words reserved to be used as indices in the inverted file. The running time is O (n [superscript 2]) where n is the text length. The algorithm is used both in text preparation preprocessing before indexing and also in query processing before the actual search. Experimental results showed that the number of words obtained is approximately 30-50% of the total number of possible words appearing in the given text. In addition, this work also presents an algorithm for encoding transliterated words suitable for cross-language retrieval system. Incorporating this feature enables the system to retrieve not only documents containing the English keywords, but also documents containing the corresponding transliterated words in Thai. The encoding algorithm modifies the Soundex encoding table and algorithm whose running time is linearly proportional to the word length. Experimental results showed that a high recall and precision of more than 80% can be achieved especially when the phonetic codes are longer than four.
URI:	http://cuir.car.chula.ac.th/handle/123456789/5608
Type:	Technical Report
Appears in Collections:	Eng - Research Reports

Files in This Item:

File	Description	Size	Format
somchaiPra.pdf		5.84 MB	Adobe PDF	View/Open

Show full item record