I’ve been exploring ISiS 1.3.0 in some detail over the past few weeks. I have a couple of questions and observations that others might also be interested in. Axel has already answered one of them but I’m pasting it in here so that others can find the answer too…
- Loudness accents
I didn’t initially realise that these values have to be somewhere between 0 and 1 - the documentation at one point talks about “specifying a loudness accent of 2” and the default score only uses 1’s and 0’s. I also got caught out by not allowing for the initial silence in my loud_accents list - whether this is explicitly included in your lyrics/notes/rhythms or not (and therefore added by default), you need to put in a value for it, otherwise your notes and accent values will be out of alignment, even though no error message will be returned.
- Silences at start and end of output files
I wondered how much control there was over these? Can they be omitted - or, by any chance, can the breathing be made even more exaggerated/audible?!
- @ phoneme followed by rest
@ issue.score.cfg (682 Bytes)
Using the EL corpus I get an abrupt and slightly noisy end to the ‘@‘ vowel when it is followed by a silence. This doesn’t happen when ‘@‘ is the last phoneme in the song. See attached score file to (hopefully) reproduce.
Answer to 2)
Concerning the durations: they are controlled in the song cfg. In the rhythm section
you have the durations for each note including silences and the start and end silences.
For example in
# 60 / tempo
rhythm: # _ c’est une chan - son
2, 1.54583333, 1.525, 1.525, 7.25833333,
The “two” is the duration of the silence at the start. You can reduce the duration but not delete the start and end silences completely. The ISiS algorithm needs them for synthesis, so if they are not there, ISiS will insert them, probably with duration 1.
If you want them short, you can simply select a short rhythmic duration. There is no limit besides needing to be > 0, so there should exist a point when they perceptually disappear.
Concerning the breathing. ISiS is currently a concatenation model. In the recordings we have silence annotation but no breathing annotation. So whether you get breathing or not depends on which of the many silences is selected. As we record words, all silences will necessarily be start and end silences of the recordings. We currently don’t have any breathing detection so it would be difficult to get control over the breathing.
Answer to 1)
Loudness accents are scaling factors that are applied to the loudness contour. Factor 1 is neutral and is the default. I don’t see any reason why you think they should be between 0 and 1, you can make the > 1 but, given these factors are applied to the original recordings you may get into saturation then.
There is one loudness contour for each note and so there is one loudness factor per note. In previous versions ISiS would not handle this very explicitly. You could give loudness factors ars desired and ISiS would complete as many 1s as needed. In 1.3.0 this has become more noisy. You can provide no loudness accents or a single one and as before ISiS will quietly set all loudness factors to 1. If you provide a list of loudness accents ISiS will expect one loudness factor per note and tell you if there is a mismatch. If you have too many you will receive an error. If you have too few you receive a comment and the list is filled with 1s.
The input needs to be formatted with one loudness factor per note, similar to the note frequency where you have to write frequency 0 for silences, for the loudness accents you have to specify a value as well. The value will be ignored though.
Generally speaking, given the somewhat shaky loudness model we have implemented in ISiS loudness factors have never been in the foreground of any communication; On the other hand, they provide an additional means to control the synthesis, so it appeared a pity to drop them. In case our plans for developing a DNN-based ISiS synthesis backend will be successful, loudness control should become more advanced and so these accents have been kept.
you wrote: otherwise your notes and accent values will be out of alignment, even though no error message will be returned.
Given ISiS has always completed missing accents with once it was impossible to know what was desired. As I said above, as far as I see in the code you should get a warning if you do not have the correct number of loudness accents for 1.3.0-rc7.
Answer to 3)
The model being a concatenative model that concatenates parametric representations of existing recordings it is likely either one of the @_ diphones or its parametric representation has an artifact. I cannot hunt them like this. The log files that ISiS generates lists exactly which diphones are selected. So if you attach the log file and tell me which of the @ in the score produces this behaviour I should be able to find the @_ diphone in the voice files and as far as I remember we have a (currently unused) feature that supports removing excluding diphones from the synthesis. So if my hypothesis is correct I may send you the exclusion file, which may fix the problem even without distributing a new voice file by means of simply forbidding the offending diphone. There are a few ‘ifs’ here - we see.
Thanks as ever Axel for all this useful and interesting detail.
Re 1)
Say we take the cfg file I attached before, and change the loudness accent for the second note from 0.8 to 2.0. This is what comes up when I try to render it:
“accent factor shall be within [0, 1] (2.0).”
Log file attached.
@ issue loud.log (26.5 KB)
Re 3), here is the log for this. The first two @s have the artefact, the final one doesn’t.
@ issue.log (36.2 KB)
Ok, sorry I missed that. It appears I introduced this check to reduce the likelihood of saturation.
Concerning 3) the problem here are the loudness accents. If I set them all to 1 (the default setup), the problem disappears, the noises remain the same but given you reduced the loudness of the vowel before the fade, the fade then becomes audible and appears as an artifact. I probably will need to look into the handling of the loudness contours across silences.