การตรวจเทียบภายในหาการลักลอกงานวิชาการภาษาไทยโดยใช้แบบจำลองซัพพอร์ตเวกเตอร์แมชชีน

ศิวพร ทวนไธสง

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/43733

Title:	การตรวจเทียบภายในหาการลักลอกงานวิชาการภาษาไทยโดยใช้แบบจำลองซัพพอร์ตเวกเตอร์แมชชีน
Other Titles:	AN INTRINSIC PLAGIARISM DETECTION OF THAI ACADEMIC TEXTS USING A SUPPORT VECTOR MACHINE MODEL
Authors:	ศิวพร ทวนไธสง
Advisors:	วิโรจน์ อรุณมานะกุล
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะอักษรศาสตร์
Advisor's Email:	[email protected]
Subjects:	การลอกเลียนวรรณกรรม ซัพพอร์ตเวกเตอร์แมชชีน Plagiarism Support vector machines
Issue Date:	2556
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	วิทยานิพนธ์ฉบับนี้มีวัตถุประสงค์เพื่อพัฒนาระบบการตรวจเทียบภายในหาการลักลอกงานวิชาการในภาษาไทยด้วยแบบจำลองซัพพอร์ตเวกเตอร์แมชชีน (SVM.) โดยเปรียบเทียบประสิทธิภาพของระบบระหว่างแบบจำลองที่ใช้ข้อมูลรับเข้าเป็นคำกับแบบจำลองที่ใช้ข้อมูลรับเข้าเป็นตัวอักษร ประสิทธิภาพของลักษณ์ทางสถิติและลักษณ์ทางภาษาที่มีผลกับแบบจำลอง และความแม่นยำของการหาคำตอบเมื่อพิจารณาจากความยาวของข้อความที่ลักลอก งานวิจัยนี้ใช้คลังข้อมูลที่สร้างจากวิทยานิพนธ์ภาษาไทยระดับบัณฑิตศึกษา จุฬาลงกรณ์มหาวิทยาลัย จำนวน 300 เล่ม จำนวนคำทั้งสิ้น 5,155,589 คำ ใช้แบบจำลองทางสถิติซัพพอร์ตเวกเตอร์แมชชีน ในโปรแกรม weka เวอร์ชัน 3.7.10 ทดลองกับข้อมูลรับเข้าเป็นย่อหน้าแบบคำและแบบตัวอักษร ใช้การเรียนรู้ระบบแบบ supervised learning ให้คำตอบ 2 ประเภท คือ ใช่สำหรับย่อหน้าที่มีการลักลอก และไม่ใช่สำหรับย่อหน้าที่ไม่ได้ลักลอก ผลการทดลองกับลักษณ์ทางสถิติพบว่าชุดลักษณ์ที่ให้ผลดีที่สุดในการตรวจหาย่อหน้าลักลอก คือ ชุดลักษณ์ทางสถิติ จำนวน 7 ลักษณ์ จากข้อมูลรับเข้าแบบคำ สามารถตรวจจับย่อหน้าที่ลักลอกได้ถูกต้อง 318 ย่อหน้า จาก 735 ย่อหน้า มีค่าความครบถ้วนที่ 0.43 สำหรับ สำหรับการทดลองกับลักษณ์ทางภาษา ที่เปรียบเทียบค่าเฉลี่ยคำที่มีความถี่สูงสุด การเลือกใช้คำและชุดคำเขียนผิดพบว่า ลักษณ์ประเภทนี้ไม่สามารถแยกประเภทของย่อหน้าทั้ง 2 ประเภทได้ แม้จะพบการใช้ต่างกันจริงในข้อมูล ปัจจัยที่ทำให้แบบจำลองไม่ได้ผลเนื่องจากลักษณ์นั้นๆพบแบบไม่สม่ำเสมอในคลังข้อมูล สำหรับปัจจัยเรื่องความยาวของย่อหน้าลักลอกต่อการตรวจเทียบภายใน ผลจากการทดลองนี้ยังไม่สามารถระบุถึงความสัมพันธ์ของความยาวย่อหน้าที่มีต่อความแม่นยำในการตรวจจับได้ เพราะย่อหน้าลักลอกที่ตรวจจับได้ถูกต้องมากที่สุดในการทดลอง คือ ย่อหน้าลักลอกขนาดกลางและขนาดยาวซึ่งมีผลตรวจจับผิดพลาด 16.55% และ 36.67% ตามลำดับ ขณะที่ ไม่สามารถตรวจจับย่อหน้าขนาดสั้นได้เลย คือมีผลตรวจจับผิดพลาด 100%
Other Abstract:	The main purpose of this study is to develop the intrinsic plagiarism detection in Thai academic writing system using Support Vector Machine model (SVM.) as well as comparing performance of two different kinds of input and feature and then analyzes whether the length of input has an effect on accuracy. This study uses 300 pieces of master theses of undergraduate students from Chulalongkorn University consists of 5,155,589 words in total. Support Vector Machine model applied in the research is libsvm available in weka 3.7.10 software. To compare the performance of word-based and character-based inputs, both types of input are prepared from the same data and use the same set of statistic features in experiments. Supervised learning is applied to train the model with 2 answers, “yes” for plagiarized paragraph and “no” for non-plagiarized paragraph. Result from word-based input using the set of 7 statistic features shows the best recall score at 0.43 on testing data while 318 out of 735 plagiarized paragraphs are correctly classified. A demonstrative experiment in linguistics feature using spelling variation fails to correctly identify plagiarized paragraphs though those linguistic features are found in some plagiarized paragraphs. The reason why these linguistic features could not be used in the model is because they do not occur regularly in plagiarized paragraphs. To examine whether length of input has an effect on the model, the correct answers are grouped by their length, however, the analysis still could not shows any relation between the performance and length of data as a result of 16.55%, 36.67 % and 100% wrong prediction in middle length, long length and short length plagiarized paragraph respectively.
Description:	วิทยานิพนธ์ (อ.ม.)--จุฬาลงกรณ์มหาวิทยาลัย, 2556
Degree Name:	อักษรศาสตรมหาบัณฑิต
Degree Level:	ปริญญาโท
Degree Discipline:	ภาษาศาสตร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/43733
URI:	http://doi.org/10.14457/CU.the.2013.1193
metadata.dc.identifier.DOI:	10.14457/CU.the.2013.1193
Type:	Thesis
Appears in Collections:	Arts - Theses

Files in This Item:

File	Description	Size	Format
5380173722.pdf		2.27 MB	Adobe PDF	View/Open

Show full item record