การปรับปรุงแบบจำลองเสียงเพื่อเพิ่มความเป็นธรรมชาติของเสียงสังเคราะห์ภาษาไทย

ศุภเดช ฉันจรัสวิชัย

Please use this identifier to cite or link to this item: https://cuir.car.chula.ac.th/handle/123456789/58103

Title:	การปรับปรุงแบบจำลองเสียงเพื่อเพิ่มความเป็นธรรมชาติของเสียงสังเคราะห์ภาษาไทย
Other Titles:	An Improvement of Acoustic Model for Enhancing Naturalness in the Synthesized Thai Speech
Authors:	ศุภเดช ฉันจรัสวิชัย
Advisors:	อติวงศ์ สุชาโต โปรดปราน บุณยพุกกณะ
Other author:	จุฬาลงกรณ์มหาวิทยาลัย. คณะวิศวกรรมศาสตร์
Advisor's Email:	[email protected],[email protected],[email protected] [email protected]
Issue Date:	2560
Publisher:	จุฬาลงกรณ์มหาวิทยาลัย
Abstract:	เสียงสังเคราะห์เป็นเทคโนโลยีทางเลือกสำหรับการรับรู้ข้อมูลประเภทข้อความ ความชัดเจน และความเป็นธรรมชาติของเสียงสังเคราะห์ส่งผลโดยตรงกับความเข้าใจของผู้ฟังที่มีต่อข้อมูลในสัญญาณเสียง ดังนั้นในงานวิจัยนี้จึงได้พัฒนาด้านความเป็นธรรมชาติ และความชัดเจนของเสียงสังเคราะห์ที่สร้างมาจากค่าพารามิเตอร์ของตัวเข้ารหัสเสียง STRAIGHT ซึ่งค่าพารามิเตอร์เหล่านั้นถูกสร้างขึ้นมาจากแบบจำลองฮิดเดนมาร์คอฟ และแบบจำลองโครงข่ายประสาทเทียมแบบลึก โดยการนำเสนอแนวคิด 3 แนวคิด ได้แก่ 1) แนวคิดการแยกกันของแบบจำลองคุณลักษณะความถี่มูลฐาน และค่าคุณลักษณะสเปกตรัม โดยทั้งสองแบบจำลองถูกฝึกฝนแยกกันเพื่อสร้างเป็นแบบจำลองฮิดเดนมาร์คอฟสำหรับสร้างค่าพารามิเตอร์ของตัวเข้ารหัสเสียง STRAIGHT ที่สอดคล้องกับแบบจำลองดังกล่าว ในวิทยานิพนธ์นี้ได้นำเสนอขั้นตอนวิธีในการปรับแนวเวลาของค่าพารามิเตอร์ที่สร้างขึ้นมาจากการใช้สองแบบจำลอง 2) เสนอการปรับเปลี่ยนค่าคุณลักษณะส่วนรับเข้าของโครงข่ายประสาทเทียมแบบลึกที่ถูกใช้ในการสร้างค่าพารามิเตอร์ของตัวเข้ารหัสเสียง STRAIGHT จากเดิมที่ใช้ค่าคุณลักษณะทางบริบท เป็นแบบจำลองฮิดเดนมาร์คอฟที่เป็นผลลัพธ์จากต้นไม้ตัดสินใจที่ใช้ในการจัดกลุ่มบริบท 3) นำเสนอวิธีการนอร์มัลไลเซชันค่าคุณลักษณะส่วนส่งออกของแบบจำลองโครงข่ายประสาทเทียมแบบลึก ที่ใช้ค่ากลาง และค่าความแปรปรวนจากแบบจำลองฮิดเดนมาร์คอฟที่เป็นผลลัพธ์จากต้นไม้ตัดสินใจ ในการทดสอบได้ทำการทดสอบ 2 รูปแบบ คือ 1) การทดสอบปรนัยที่ใช้ตัวชี้วัดค่าความเพี้ยนของเซปตรัลในระดับเมลของค่าสัมประสิทธิ์เมลเคปสตรัม (MGC_MCD) ค่าความเพี้ยนของเซปตรัลในระดับเมลของค่าแถบคลื่นความถี่ของความไม่เป็นคาบ (BAP_MCD) ความไม่สอดคล้องกันของสถานะความก้องของเสียง (LF0_UVU) และความผิดพลาดกำลังสองเฉลี่ยของค่าความถี่มูลฐาน (LF0_RMSE) 2) การทดสอบอัตนัยที่ใช้ผู้ทดสอบ 9 คน โดยวัดในด้านของความชัดเจน และความเป็นธรรมชาติของเสียงสังเคราะห์ ผลการทดสอบปรนัยการใช้แนวคิดที่ 2 และ 3 กับแบบจำลองโครงข่ายประสาทเทียมแบบลึก สามารถสังเคราะห์ค่าพารามิเตอร์ของตัวเข้ารหัสเสียง STRAIGHT ได้ใกล้เคียงกับเสียงต้นฉบับมากกว่าการใช้แนวคิดที่ 1 กับแบบจำลองฮิดเดนมาร์คอฟ และแบบจำลองดั้งเดิมทั้งในส่วนของแบบจำลองฮิดเดนมาร์คอฟ และแบบจำลองโครงข่ายประสาทเทียมแบบลึก สำหรับในการทดสอบอัตนัยพบว่าการใช้แนวคิดที่ 1 กับแบบจำลองฮิดเดนมาร์คอฟสามารถสังเคราะห์ค่าคุณลักษณะที่มีความเป็นธรรมชาติ และชัดเจนมากกว่าการใช้แนวคิดอื่น และแบบจำลองดั้งเดิมทั้งสองแบบจำลอง
Other Abstract:	Speech synthesis converts text to speech signals. The naturalness and intelligibility of synthesized speech affect the listeners’ understanding of the content conveyed by the speech signal. This dissertation proposed 3 aspects of improving the naturalness and intelligibility of synthesized speech generating from STRAIGHT parameters. The first aspect was the separation of spectral-feature models and the fundamental-frequency models. The two types of models were trained independently to obtain the Hidden Markov Model (HMM) parameters, optimized for generating their respective STRAIGHT parameters. Algorithms handling the time-alignment of parameters, generating separately from the two models were proposed. In this work, we focused on generating STRAIGHT parameters from either HMMs or Deep Neural Networks (DNNs). The second aspect was concerned with the modification of typical inputs to DNNs used for generating STRAIGHT parameters from direct phonetic contexts, to HMMs resulting from context clustering decision trees. The third aspect was the DNN output normalization using means and variances from HMMs, which were the results of the decision trees. Tools for objective evaluations were Mel cepstral distortion for Mel cepstral coefficient of spectral filter (MGC_MCD), Mel cepstral distortion for coefficient of band aperiodicity filter (BAP_MCD), root mean square error of fundamental frequency (LF0_RMSE), and count of unmatched voicing condition between natural speech and synthesized speech (LF0_UVU). Nine participants were recruited to perform a subjective evaluation in which they were asked to evaluate the synthesized speech utterances in terms of their naturalness and intelligibility. The results of the objective test showed that applying the second and the third proposed aspects to DNN generated STRAIGHT parameters resulted in better synthesized speech than applying the first aspect to HMM models as well as using baseline HMM and DNN methods. The subjective results showed that the application of the first aspect to HMM outperformed other methods.
Description:	วิทยานิพนธ์ (วศ.ด.)--จุฬาลงกรณ์มหาวิทยาลัย, 2560
Degree Name:	วิศวกรรมศาสตรดุษฎีบัณฑิต
Degree Level:	ปริญญาเอก
Degree Discipline:	วิศวกรรมคอมพิวเตอร์
URI:	http://cuir.car.chula.ac.th/handle/123456789/58103
URI:	http://doi.org/10.58837/CHULA.THE.2017.1378
metadata.dc.identifier.DOI:	10.58837/CHULA.THE.2017.1378
Type:	Thesis
Appears in Collections:	Eng - Theses

Files in This Item:

File	Description	Size	Format
5571431321.pdf		4.57 MB	Adobe PDF	View/Open

Show full item record