--===============1804315488==
Content-Type: multipart/alternative; boundary=001636c598d67556900469d03a75

--001636c598d67556900469d03a75
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello,

I was looking into bugs 178278 <https://bugs.kde.org/show_bug.cgi?id=3D1782=
78>,
119723 <https://bugs.kde.org/show_bug.cgi?id=3D119723>,
183743<https://bugs.kde.org/show_bug.cgi?id=3D183743>(and other similar)
as I fixed a similar bug in Amarok 1.4 (
138482 <http://bugs.kde.org/show_bug.cgi?id=3D138482>). Specifically, I was
wondering why case-insensitive search for non-Latin characters didn't work
again. After spending some time in researching the issue I identified the
problem. But I am not sure of the best way to fix it, so I write this email
:) (I'm sorry it has become a bit long)

First, some background:
When Amarok starts the embedded MySQL server, it does not set the default
character encoding. This leads to the fact that MySQL uses the default
'latin1' encoding for all textual fields. Before putting data into the
tables, Amarok first (correctly) converts it to UTF8. For ASCII chars this
is no problem, as they are equivalent. But if the data contains a 'real'
utf8-encoded string, it gets stored just as a sequence of bytes. When we as=
k
the database to match other Unicode string with it, it can only do it if
there is an exact match with the stored string. Had it known what the
characters mean, it could perform case-insensitive search, as well as some
other collation fixes (like recognising the similarity of 'A' and '=C3=84')=
.

After trying several options, I identified 2 problems to be solved:
1) The user's DB is not yet created
In this case, all we have to do is instruct the server that it would be
dealing with UTF8 data, so it knows how to compare it correctly. This can b=
e
done by passing the parameter "--default-character-set=3Dutf8" to the serve=
r,
and when the connection is established, issuing a "SET NAMES 'utf8'" SQL
command. The first parameter specifies that by default, all newly created
text fields should have an UTF8 encoding, while the second specifies that w=
e
want to use UT8 encoding when talking to the server.

2) The user already has UTF8 data in Latin1 text fields
This is the case I was not sure how to handle. There is the possibility to
convert a field to other encoding. In our case, we should first convert the
fields to BINARY encoding, as we'll get broken characters otherwise. This
can be done by the following 2 SQL commands:
"ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;"
"ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER SET utf8 NOT
NULL;".
Please note that we need to do this conversion for each text field, because
doing this to the table or the database only sets the defaults. So, my
uncertainty in this approach is when and how to perform the conversion.
Maybe we should check the current encoding and perform the conversion if
necessary. Or we should use a different db schema version and convert all
old dbs to the new encoding. Another problem here is the possibility that 2
keys that were previously different can now become identical ("Die Arzte"
and "Die =C3=84rzte".) In this case the command fails because of the UNIQUE=
_KEY
constraint or deletes the second identical key if we use the IGNORE option.
I don't know if this is a problem, or the missing key will be regenerated.

So, thank you for reading through this quite long email. If anyone is
interested in fixing the problem, please give me hints how this could be
done in an acceptable way.

Best wishes,
Stanislav Nikolov

--001636c598d67556900469d03a75
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hello,<br><br>I was looking into bugs <a href=3D"https://bugs.kde.org/show_=
bug.cgi?id=3D178278" target=3D"_blank">178278</a>, <a href=3D"https://bugs.=
kde.org/show_bug.cgi?id=3D119723" target=3D"_blank">119723</a>,=C2=A0 <a hr=
ef=3D"https://bugs.kde.org/show_bug.cgi?id=3D183743">183743</a> (and other =
similar) as I fixed a similar bug in Amarok 1.4 (<a href=3D"http://bugs.kde=
.org/show_bug.cgi?id=3D138482">138482</a>). Specifically, I was wondering w=
hy case-insensitive search for non-Latin characters didn&#39;t work again. =
After spending some time in researching the issue I identified the problem.=
 But I am not sure of the best way to fix it, so I write this email :) (I&#=
39;m sorry it has become a bit long)<br>

<br>First, some background:<br>When Amarok starts the embedded MySQL server=
, it does not set the default character encoding. This leads to the fact th=
at MySQL uses the default &#39;latin1&#39; encoding for all textual fields.=
 Before putting data into the tables, Amarok first (correctly) converts it =
to UTF8. For ASCII chars this is no problem, as they are equivalent. But if=
 the data contains a &#39;real&#39; utf8-encoded string, it gets stored jus=
t as a sequence of bytes. When we ask the database to match other Unicode s=
tring with it, it can only do it if there is an exact match with the stored=
 string. Had it known what the characters mean, it could perform case-insen=
sitive search, as well as some other collation fixes (like recognising the =
similarity of &#39;A&#39; and &#39;=C3=84&#39;).<br>

<br>After trying several options, I identified 2 problems to be solved:<br>=
1) The user&#39;s DB is not yet created<br>In this case, all we have to do =
is instruct the server that it would be dealing with UTF8 data, so it knows=
 how to compare it correctly. This can be done by passing the parameter &qu=
ot;--default-character-set=3Dutf8&quot; to the server, and when the connect=
ion is established, issuing a &quot;SET NAMES &#39;utf8&#39;&quot; SQL comm=
and. The first parameter specifies that by default, all newly created text =
fields should have an UTF8 encoding, while the second specifies that we wan=
t to use UT8 encoding when talking to the server.<br>

<br>2) The user already has UTF8 data in Latin1 text fields<br>This is the =
case I was not sure how to handle. There is the possibility to convert a fi=
eld to other encoding. In our case, we should first convert the fields to B=
INARY encoding, as we&#39;ll get broken characters otherwise. This can be d=
one by the following 2 SQL commands:<br>

&quot;ALTER TABLE artists MODIFY name VARCHAR(255) CHARACTER SET binary;&qu=
ot;<br>&quot;ALTER IGNORE TABLE artists MODIFY name VARCHAR(255) CHARACTER =
SET utf8 NOT NULL;&quot;.<br>Please note that we need to do this conversion=
 for each text field, because doing this to the table or the database only =
sets the defaults. So, my uncertainty in this approach is when and how to p=
erform the conversion. Maybe we should check the current encoding and perfo=
rm the conversion if necessary. Or we should use a different db schema vers=
ion and convert all old dbs to the new encoding. Another problem here is th=
e possibility that 2 keys that were previously different can now become ide=
ntical (&quot;Die Arzte&quot; and &quot;Die =C3=84rzte&quot;.) In this case=
 the command fails because of the UNIQUE_KEY constraint or deletes the seco=
nd identical key if we use the IGNORE option. I don&#39;t know if this is a=
 problem, or the missing key will be regenerated.<br>

<br>So, thank you for reading through this quite long email. If anyone is i=
nterested in fixing the problem, please give me hints how this could be don=
e in an acceptable way.<br><br>Best wishes,<br>Stanislav Nikolov<br>

--001636c598d67556900469d03a75--

--===============1804315488==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Amarok mailing list
Amarok@kde.org
https://mail.kde.org/mailman/listinfo/amarok

--===============1804315488==--