File format to keep accents and ' within TEXTFILE

pasquet · June 18, 2021, 5:04pm

Hello !

Am I right that accents and some other characters are lost when using any Unicode encoded text file with TEXTFILE ?
It seems to work with the “Western ISO Latin 1” encoding.

Thanks very much !

O.

haddad · June 18, 2021, 5:47pm

Yes Correct !

Best
K

pasquet · June 18, 2021, 5:58pm

Hello Karim !!

Thank you !!

Dumaiu · January 21, 2023, 11:39pm

M @haddad–what’s the reasoning behind its being this way? I could maybe see restricting character sets in file|directory names, but wouldn’t, for example, UTF-8 be a better encoding for file contents? I know LispWorks uses :latin-1 by default, but IMHO that’s not a good reason alone.

-Jonathan

haddad · January 22, 2023, 11:42am

Dear Jonathan,

The main reason is that “historically” OM code was written for MCL back in the days (cf. Macintosh Common Lisp - Wikipedia) and as i remember utf8 was not supported in MCL. But you are right, we should upgrade OM for utf8. However, thos will be a somehow delicate matter, for we look forward to have compatibility with old files/workspaces, etc… and we don ont want to break this. So, when we have some time, i will look into it.

Best
Karim

Dumaiu · January 24, 2023, 1:43am

Thanks for the explanation, Karim. I didn’t realize that OM had ever existed in a form not dependent on LispWorks.

You might consider the following idiom, from the LW manual at §26.6.3.5:
For example, the following will cause LispWorks to use UTF-8 if the file begins with valid UTF-8 bytes:

(pushnew :utf-8 system:*specific-valid-file-encodings*)

I think they meant it for situations like this.

-J.

haddad · January 24, 2023, 11:36am

Dear Johnathan,

thank you again for the tip. I will try this. But the problem of utf, is that latin encoding is all over the place in the code. I have to change all these in order to test it quietly and make sure it works and most particularly compatibility issue.

Will keep you informed.
Best
K

Dumaiu · January 25, 2023, 6:07am

You’re welcome. This issue isn’t a big deal to me personally, but it seems like it’d be of long-term benefit for the program.
-J.

anders · January 25, 2023, 2:14pm

utf-8 shouldnt be a big problem. I’ve pushed a utf-8 branch of OM at the repo. Things seem ok with relatively normal patch-files and workspaces, but there might be (probably is) issues with stranger old encodings, also across OS’es.

Karim will need to test everything and say ‘go’ before this can be put out in the wild.

haddad · January 25, 2023, 2:16pm

Thanks a lot Anders.

Testing right now!
Will keep you informed.

BEst
K

haddad · January 27, 2023, 2:20pm

Hi,

Just to say thanks to Johnathan and Anders, OM now supports utf8. Coming in the new 7.2 version to be released soon in March.

Best
K

anders · January 27, 2023, 2:59pm

Perhaps worth a note: OM defaults to utf-8 now, meaning if you open any latin-1 encoded file in om and save it, it will end up as utf-8.

anders · January 27, 2023, 3:01pm

As well, utf8 should support most practical uses. But with this change it would be quite easy to add support for other encodings for very special needs

haddad · January 27, 2023, 3:13pm

Thank Anders for these notes.

I will add one:
WARNING if you name patches in utf8 (using non latin-1) you will not be able to load correctly your workspace with older version of OM.

Best
K

apoorbaugh · February 2, 2023, 6:39am

This is great to hear! It will finally clear up the issue I was having with imported literature texts as detailed here: Non ’Base-Char” Makes Patch Impossible to Save.

Looking forward to the March release!

haddad · February 2, 2023, 10:57am

Dear Austin,

Unfortunately, [ I anticipated], there is a lot of issues still with the utf encoding. So we will keep this under testing. The reason is compatibility. Indeed, bringing utf-8 capable, breaks a lot of old patches for the time being. So thanx to Anders, we are testing it in a separate branch. And we are dealing still with issues. So it will have to wait.

Best
K

Dumaiu · February 4, 2023, 10:04pm

My philosophy in such situations is the Emacs one: Add a configuration option! What about including an option in default-prefs.lisp for changing the default encoding, to let users beta-test UTF-8 for you?

-Jonathan

haddad · February 5, 2023, 3:14pm

Dear Johnathan,

Unfortunately, i don’t think it is as simple as that since adding the utf-8 support involves many code changes, meaning adding switches in many location in the code.
Just to let you know, we had some compatibility issues with utf-8 using old patches. That is why we will for the time being, keep it as a development branch under heavy testing. However, please feel free to if you have a LW environment to test this branch.

Best
K

Dumaiu · February 6, 2023, 7:00am

I’m sorry you’re encountering back-compatibility failures, Karim. Is there a publicly-available test suite you use to check these things? I doubt any patches of my own are sufficiently “old.”

I will when I can come up with a way to do it using only a Personal Edition. A topic for yet another day.

-J.