Welcome Guest [Log In] [Register]




Welcome to the UTAforum!

UTAU is a Japanese voice-synthesis program used to create singing vocals for music. It is much like the more popular, commercially sold Vocaloid products, but UTAU is freeware!

To learn more about UTAU and various UTAU voices/characters, you can click here.

You're currently viewing our forum as a guest. This means you are limited to certain areas of the board and there are some features you can't use. If you join our community, you'll be able to access member-only sections, and use many member-only features such as customizing your profile, sending personal messages, and voting in polls. Registration is simple, fast, and completely free.


Join our community!


If you're already a member please log in to your account to access all of our features:

Username:   Password:
Add Reply
WinterdrivE's/AyuRox's CV oto Tutorial; Proofread and everything! :D
Topic Started: Dec 6 2011, 07:51 PM (397 Views)
WinterdrivE
No Avatar

Firstly, and this is for otoing in general, you need to be familiar with the types of consonants, as each type of consonant is otoed differently. The types of consonants are as follows:
-plosive/stop (k, t, p, g, d, and b ): formed with a total obstruction of the airflow from the mouth
-fricative (s, sh, f, h, th, z, zh (the 's' in "pleasure"), v, and dh (the 'th' in "they")): Formed by forcing air through a narrow gap by placing 2 articulators close together (articulators including, but not limited to, the tongue, teeth and the roof of the mouth)
-affricative (ch, j, and in Japanese, ts): Start as plosives and release as fricatives
-approximant (l, y, w, and the English r): Not quite vowels, but not quite fricatives, approximants fall somewhere in between
-nasal (n, m, ng): formed by only allowing air to flow through the nose

*Please note that the above consonants are only examples of each type of consonant. Others do exist.

The Japanese 'r' is classified as a tap consonant, so its essentially a very fast plosive, but I've noticed that it can lean towards being a fricative or approximant if its sustained.

Now, as I said, each type of consonant is otoed slightly differently.

Plosive - Because they stop airflow, there is no overlap between it and what comes before it, so when otoing plosives, leave the green overlap bar at 0. Depending on whether the way the fricatives are pronounced is exaggerated (i.e. Defoko) or not (i.e Teto), it can be relatively easy to find the consonant on the waveform. But generally, its the short quiet part that comes before the vowel. If its unvoiced (k, t, p) it looks kinda like static, so make sure not to cut it off entirely. If its voiced (g, d, B), it usually has more of a regular waveform, but its easily distinguishable from the vowel. The Red Note-On line should go right between the consonant and the vowel, where the vowel starts. If the consonant is very short (and again, Teto is a good example of this) you may want to have negative overlap or include some extra silence at the beginning of the sample rather than cutting it out. Alternately, you can do this manually through the note properties in the editor itself for each note that has a plosive.

Fricative - Fricatives kinda glide between vowels, and thus you're gonna want to have the Green Overlap line moved over some. Generally I try to move it over about 30 ms. I try to have the Red Note-On line be at about 80-90 ms, but if the consonant isn't that long, then as long as possible. Again, like plosives, the consonant generally looks like static, just very loud static, so its fairly easy to distinguish from the vowel. The Red note-On line needs to be positioned, again, right where the vowel starts.

Affricative - Since they have a stop at the beginning, the overlap doesn't need to be inside the consonant. And actually, I find it best for the overlap to be set to about -15-20 (yes, negative). The reason for the negative overlap is so that if the consonant happens to be particularly long, part of it can be cut off and the stop can still be imitated by the negative overlap. I like to have the Red Note-on line at about 60-70 ms for affricatives and, as always, positioned right where the vowel starts. And like fricatives, the consonant portion will look like static and will be easily distinguishable from the vowel.

approximant/nasal - These two are otoed pretty much the same way, so I'll explain them both at once. The consonant can be a little harder to find with these, particularly if the consonant is the same volume as the vowel. So its a bit easier to look at the waveform itself to find the vowel. The waveform should look one way at the far left, and then as you scan to the right, it'll change, and that's where the vowel starts. For approximants I like to have the overlap set to 30-40 ms and the Note On at 70-80 ms. For nasal consonants, i like to have the the overlap at about 25-35 ms and the note on at 60-70. The Red Note-on line should be positioned before where the vowel is fully voiced, but after where the consonant is fully voiced. I find it better to have it closer to the left (closer to the consonant).

The Japanese 'r,' like I said, is kind of unique. So if its recorded closer to a plosive (it'll sound more like a 'd') then you'll want to have less overlap, and if its recorded more like a fricative/approximant (it'll sound more like an 'l') then you'll want more overlap.

And regardless of the type of consonant, the pink unstretched portion needs to be stretched out until the vowel stabilizes. this is generally when the volume also stabilizes. But be wary of yoon syllables (bye, hye, kya, pyu, etc) because the short 'y' will look like the vowel, but it should be unstretched. If the pink area isn't over far enough on a yoon syllable, it'll sound like "kyyyyyyyyeeeeee" instead of just "kyeeeeee." A general rule of thumb for the pink area is that its always better to have it too far to the right than too far to the left.

The left blue/purple portion marks where the data starts. it also allows you to control the length of the consonant. So if you have a recording where the consonant is too long for your liking, then move the blue/purple portion to the right to cut out part of the consonant.

The end of the data, the right blue/purple section, should be moved to the left until the fade out of the vowel is cut off (the fade out of the vowel looks like a ">" in the waveform)

Also, when using the editor, if you double-click the "P" at the top left of the box, it'll play a sample of the recording, but it'll only play what’s in between the 2 blue/purple areas, so it can help you hear and adjust the position of the blue/purple area. It can also be used to find where the vowel starts by guess and check on the more difficult syllables that involve nasal and approximant consonants. Start by moving the left blue area over so it ends where you think the vowel starts and play the preview. If you no longer hear the consonant, then you know that the vowel starts to the left of where the blue area ends. If you still hear some of the consonant, then you know the vowel starts to the right of where the blue area ends. Ideally you should narrow it down to a point where its as far to the left as possible without hearing much of the consonant, if at all. After you locate the vowel this way, you can move the blue area back to where it should be and move the Red line over to where the blue section was; where the vowel starts.

If you prefer not to guess and check, click the s button next to the P button. It changes the sample from a waveform to an acoustic spectrogram. On an acoustic spectrogram, higher frequencies are higher on the spectrogram and lower frequencies are lower on the spectrogram. Sound is represented by blue portions on the spectrogram. The lighter it is, the louder the sound is at that frequency. The vowel generally looks like a very light blue or white band going across the lower part of the spectrogram. Because of this, the spectrogram can be extremely useful for finding the consonant and vowel on approximants, yoon syllables, and nasal consonants. Because approximants and nasal consonants still have an orderly waveform unlike other consonants, it can make them very hard to distinguish from the consonant using the waveform. But on the spectrogram, these consonants almost always look different from the vowel. Usually, they create a break in the vowel’s solid white band, but more often than not, they register somewhere else entirely from the vowel. Now, because they still have a regular pitch, the consonants will still register mostly in the same place as the vowel, however, most sounds, including vowels, also have a few faint bands somewhere else higher on the spectrogram (called harmonics), and more often than not, the harmonics will be different between sounds. So by looking for both the solid band the vowel creates and looking at the other harmonics higher on the spectrogram, using the spectrogram can make locating the vowel for approximants, yoon syllables, and nasal consonants much easier.

Hope this helped. If you have any questions, feel free to post it here or PM me.
Edited by WinterdrivE, Dec 6 2011, 07:58 PM.
aka AyuRox1 on VO
Offline Profile Quote Post Goto Top
 
1 user reading this topic (1 Guest and 0 Anonymous)
« Previous Topic · Tutorials and Resources Directory · Next Topic »
Add Reply

- Affiliates

- Chatbox

Welcome to the Cbox! Please be sure to follow the rules of the forum when posting. To use an avatar, simply upload a 45px by 45px image to an online host, and paste the direct link to it in the URL/Email blank next to your name!

• No flooding/spamming.
• No roleplaying in the Cbox.
• Your Cbox username must match your forum username.
(not exactly, but close enough that other people can tell who's who).