Re: [IPT] GBIF Case 1773: UTF8
And have you tried Tim's suggestion?
Could you try issuing \s command in the mysql client shell to show it's character set settings? That would show information like this:
Connection id: 37141 Current database: Current user: root@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 5.1.56 Source distribution Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8 Db characterset: utf8 Client characterset: utf8 Conn. characterset: utf8 UNIX socket: /var/lib/mysql/mysql.sock Uptime: 41 days 22 hours 43 min 50 sec
Threads: 1 Questions: 597738 Slow queries: 20 Opens: 2328 Flush tables: 1 Open tables: 64 Queries per second avg: 0.164 --------------
which contains character set settings.
Cheers,
Burke
On Aug 12, 2011, at 2:30 PM, Mickael Graf wrote:
That's possible. But the very same data is displayed correctly with TapirLink.
Yes, you can copy to the mailing list later.
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 2:08 PM To: Mickael Graf Cc: helpdesk@gbif.org Subject: Re: GBIF Case 1773: UTF8
Hi Mickael,
I can see in the script the accented characters are already wrong. If you generate the script from sql client, perhaps the problem is on the DB side?
For the script I see now I am sure the IPT won't read it correctly if the source is already "Närke".
Would you mind I copy the thread to the IPT mailing list later?
Burke
On Aug 12, 2011, at 1:51 PM, Mickael Graf wrote:
Hi Burke,
I am using a view. Here come some scripts for checking the data/IPT.
The statement in IPT is then simply 'select * from rcDwCIPT'.
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Thursday, August 11, 2011 9:24 AM To: Mickael Graf Cc: helpdesk@gbif.org Subject: GBIF Case 1773: UTF8
Hi Mickaël,
Do you use SQL view or text file as the source for IPT? May I have some sample records to test and reproduce your issue?
Thanks!
Burke
On Aug 10, 2011, at 4:25 PM, Mickael Graf wrote:
Hi,
I am testing IPT 2 and NRM RingedBirds is the guinea pig.
Well, I have some issue with the encoding because, while swedish characters work fine with TapirLink (see http://www.gbif.se/tapir/tapir_client.php, choose NRM-RingedBirds and make an inventory over StateProvince), it's a mess with IPT, both as a preview and as a zipped file. For instance 'Närke' is displayed as 'Närke'. This happens regardless of the character encoding chosen under /source.do. The original data is UTF8, but then I don't know if any settings in tomcat need to be changed.
Do you have some knowledge about this issue? I am very bad at java, so I don't know where to look (and issue 418 doesn't help).
Cheers, Mickaël
<View_RC_DwC_IPT.sql><rc_test.sql>
Good, we're going further now. Here is what I get:
Connection id: 26948 Current database: nrm Current user: root@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 5.0.77 Source distribution Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1 UNIX socket: /var/lib/mysql/mysql.sock Uptime: 50 days 2 hours 7 min 54 sec
Threads: 5 Questions: 3266242 Slow queries: 266 Opens: 3577 Flush tables: 1 Open tables: 64 Queries per second avg: 0.755
I haven't tried Tim's suggestion. I simply don't know where to look/test.
Cheers, Mickaël
________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 2:59 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; helpdesk@gbif.org Subject: Re: GBIF Case 1773: UTF8
And have you tried Tim's suggestion?
Could you try issuing \s command in the mysql client shell to show it's character set settings? That would show information like this:
Connection id: 37141 Current database: Current user: root@localhost SSL: Not in use Current pager: stdout Using outfile: '' Using delimiter: ; Server version: 5.1.56 Source distribution Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: utf8 Db characterset: utf8 Client characterset: utf8 Conn. characterset: utf8 UNIX socket: /var/lib/mysql/mysql.sock Uptime: 41 days 22 hours 43 min 50 sec
Threads: 1 Questions: 597738 Slow queries: 20 Opens: 2328 Flush tables: 1 Open tables: 64 Queries per second avg: 0.164 --------------
which contains character set settings.
Cheers,
Burke
On Aug 12, 2011, at 2:30 PM, Mickael Graf wrote:
That's possible. But the very same data is displayed correctly with TapirLink.
Yes, you can copy to the mailing list later.
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 2:08 PM To: Mickael Graf Cc: helpdesk@gbif.org Subject: Re: GBIF Case 1773: UTF8
Hi Mickael,
I can see in the script the accented characters are already wrong. If you generate the script from sql client, perhaps the problem is on the DB side?
For the script I see now I am sure the IPT won't read it correctly if the source is already "Närke".
Would you mind I copy the thread to the IPT mailing list later?
Burke
On Aug 12, 2011, at 1:51 PM, Mickael Graf wrote:
Hi Burke,
I am using a view. Here come some scripts for checking the data/IPT.
The statement in IPT is then simply 'select * from rcDwCIPT'.
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Thursday, August 11, 2011 9:24 AM To: Mickael Graf Cc: helpdesk@gbif.org Subject: GBIF Case 1773: UTF8
Hi Mickaël,
Do you use SQL view or text file as the source for IPT? May I have some sample records to test and reproduce your issue?
Thanks!
Burke
On Aug 10, 2011, at 4:25 PM, Mickael Graf wrote:
Hi,
I am testing IPT 2 and NRM RingedBirds is the guinea pig.
Well, I have some issue with the encoding because, while swedish characters work fine with TapirLink (see http://www.gbif.se/tapir/tapir_client.php, choose NRM-RingedBirds and make an inventory over StateProvince), it's a mess with IPT, both as a preview and as a zipped file. For instance 'Närke' is displayed as 'Närke'. This happens regardless of the character encoding chosen under /source.do. The original data is UTF8, but then I don't know if any settings in tomcat need to be changed.
Do you have some knowledge about this issue? I am very bad at java, so I don't know where to look (and issue 418 doesn't help).
Cheers, Mickaël
<View_RC_DwC_IPT.sql><rc_test.sql>
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps: 1. Add lines above to your my.cnf 2. Restart the mysql, 3. First see if things still looks the same on TapirLink. 4. Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps: 1. Add lines above to your my.cnf 2. Restart the mysql, 3. First see if things still looks the same on TapirLink. 4. Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Mickaël,
I'd like to learn some details about encoding settings on your side.
1. Did you use mysqldump to create the script that you sent me earlier? 2. From the script, I can see the database stores data in UTF-8, is it correct? 3. Since the character in the dump sql is already broken, could you try, if you temporarily change the connection charset to utf-8, does the same dump contains the correct character for accented letters? Or please try --default-character-set=latin1 as one of your dump option. 4. Since all charset settings on your side appear to be latin1, are all databases hosted on the mysql server using UTF8 as the encoding? Including the one serves TapirLink?
I am trying to reproduce your environment here.
Cheers,
Burke
On Aug 16, 2011, at 9:26 AM, Mickael Graf wrote:
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
- Did you use mysqldump to create the script that you sent me earlier?
Yes, for the table definition and the data. The script for the view is hand written.
- From the script, I can see the database stores data in UTF-8, is it correct?
Unfortunately no. Default settings for the server are latin1 so is the database. The data itself is (well, shoud be...) utf8 and the table definition has utf8 as character set.
- Since the character in the dump sql is already broken, could you try, if you temporarily change the connection charset to utf-8, does the same dump contains the correct character for accented letters? Or please try --default-character-set=latin1 as one of your dump option.
I just did a dump with --default-character-set=latin1 and I can read the accented letters in less,emacs and firefox (where I need to specify it's utf8).
- Since all charset settings on your side appear to be latin1, are all databases hosted on the mysql server using UTF8 as the encoding? Including the one serves TapirLink?
Most of my databases use latin1, but I have two of them created using utf8. But for all of them the accented letters are correctly displayed with TapirLink.
Cheers, Mickaël
I am trying to reproduce your environment here.
Cheers,
Burke
On Aug 16, 2011, at 9:26 AM, Mickael Graf wrote:
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi,
I am still stuck with this issue. I tested different encoding with the data and I am pretty sure the issue lies in the IPT/Tomcat area (TapirLink/Apache is fine).
Burke, did you manage to get a correct result with the data I sent you? If yes, what did you do and what is your configuration?
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Tuesday, August 16, 2011 10:07 AM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; GBIF Helpdesk Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I'd like to learn some details about encoding settings on your side.
1. Did you use mysqldump to create the script that you sent me earlier? 2. From the script, I can see the database stores data in UTF-8, is it correct? 3. Since the character in the dump sql is already broken, could you try, if you temporarily change the connection charset to utf-8, does the same dump contains the correct character for accented letters? Or please try --default-character-set=latin1 as one of your dump option. 4. Since all charset settings on your side appear to be latin1, are all databases hosted on the mysql server using UTF8 as the encoding? Including the one serves TapirLink?
I am trying to reproduce your environment here.
Cheers,
Burke
On Aug 16, 2011, at 9:26 AM, Mickael Graf wrote:
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Mickaël,
No I can't. Since the file is encoded in UTF-8 but the value is already the wrong one, I couldn't continue by recreating the same set up like yours. Perhaps dump one in latin1 for me, please?
For IPT we identified an possible addition to make it capable to set connection charset. We have started investigating on that.
Cheers,
Burke
On Aug 24, 2011, at 2:59 PM, Mickael Graf wrote:
Hi,
I am still stuck with this issue. I tested different encoding with the data and I am pretty sure the issue lies in the IPT/Tomcat area (TapirLink/Apache is fine).
Burke, did you manage to get a correct result with the data I sent you? If yes, what did you do and what is your configuration?
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Tuesday, August 16, 2011 10:07 AM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; GBIF Helpdesk Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I'd like to learn some details about encoding settings on your side.
- Did you use mysqldump to create the script that you sent me earlier?
- From the script, I can see the database stores data in UTF-8, is it correct?
- Since the character in the dump sql is already broken, could you try, if you temporarily change the connection charset to utf-8, does the same dump contains the correct character for accented letters? Or please try --default-character-set=latin1 as one of your dump option.
- Since all charset settings on your side appear to be latin1, are all databases hosted on the mysql server using UTF8 as the encoding? Including the one serves TapirLink?
I am trying to reproduce your environment here.
Cheers,
Burke
On Aug 16, 2011, at 9:26 AM, Mickael Graf wrote:
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi Burke,
here comes some data from mysqldump, in two different version (one "as is" and one using --set-default-charset=latin1).
I hope it helps.
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Wednesday, August 24, 2011 5:07 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
No I can't. Since the file is encoded in UTF-8 but the value is already the wrong one, I couldn't continue by recreating the same set up like yours. Perhaps dump one in latin1 for me, please?
For IPT we identified an possible addition to make it capable to set connection charset. We have started investigating on that.
Cheers,
Burke
On Aug 24, 2011, at 2:59 PM, Mickael Graf wrote:
Hi,
I am still stuck with this issue. I tested different encoding with the data and I am pretty sure the issue lies in the IPT/Tomcat area (TapirLink/Apache is fine).
Burke, did you manage to get a correct result with the data I sent you? If yes, what did you do and what is your configuration?
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Tuesday, August 16, 2011 10:07 AM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; GBIF Helpdesk Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I'd like to learn some details about encoding settings on your side.
- Did you use mysqldump to create the script that you sent me earlier?
- From the script, I can see the database stores data in UTF-8, is it correct?
- Since the character in the dump sql is already broken, could you try, if you temporarily change the connection charset to utf-8, does the same dump contains the correct character for accented letters? Or please try --default-character-set=latin1 as one of your dump option.
- Since all charset settings on your side appear to be latin1, are all databases hosted on the mysql server using UTF8 as the encoding? Including the one serves TapirLink?
I am trying to reproduce your environment here.
Cheers,
Burke
On Aug 16, 2011, at 9:26 AM, Mickael Graf wrote:
Hi Burke,
I tried both UTF-8, Latin1 and Windows 1252, but the result looks always the same. It looks like this setting has no influence on the final result, at least here.
/Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Monday, August 15, 2011 4:24 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Michaël,
Have you tried using Latin 1 as the character encoding in the source data editing page of IPT?
Burke
On Aug 15, 2011, at 3:19 PM, Mickael Graf wrote:
Hi Burke,
Changing my.cnf breaks everything. So I reversed back. I need to study how to correctly migrate my data to a complete utf8 system. MySQL still comes with latin1 as default (I just checked on my Ubuntu 11.04!)
How can I test Tim's suggestion?
Cheers, Mickaël
From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 12, 2011 4:30 PM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list Subject: Re: GBIF Case 1773: UTF8
Hi Mickaël,
I can see from your script that the database is created using UTF-8, but it could be the connection characterset that interprets the UTF-8 information as iso-8859-1. Force opening a UTF-8 text file with Närke using latin1 charset indeed render the text as Närke.
In the [mysqld] section of /etc/my.cnf, you can instruct the server to start with preferred characterset and collation:
character_set_server=utf8 default-character-set=utf8 character_set_client=utf8 collation_server=utf8_general_ci skip-character-set-client-handshake
The last line force the connection charset as the one specified for the server.
So I suggest some steps:
- Add lines above to your my.cnf
- Restart the mysql,
- First see if things still looks the same on TapirLink.
- Try export the same sample script you gave us earlier, if the närke shows as it should be, then it should be fine on IPT.
Let me know if this setting works.
But if this change breaks TapirLink and others, you'll need to decide to configure others all or, see if Tim's suggestion works. (jdbc:mysql://localhost:3306/specimen_collections?autoReconnect=true&useUnicode=true&characterEncoding=UTF8&characterSetResults=UTF8)
Cheers,
Burke
Hi MIchaël,
Yes this time I have correct data to test. I changed from your script is to re-save the file in latin1, and add drop-table script from your previous dump. It was saved in UTF-8 despite the sql settings in the script are all latin1. The refined file is attached.
Then I reproduce the environment as the steps here:
1) Adjust MySQL server and client encoding settings to match yours: The server and client connection show: Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1
2) Create a database from the attached script, the database encoding is latin1.
3) Follow the normal procedure to create a SQL source in IPT. See "settings" screen shot.
4) Since JDBC driver detects source encoding automatically, the encoding setting in the bottom-left doesn't matter for SQL source. However, we're thinking about forcing the encoding as instructed. Please refer to mysql jdbc connector page[1].
5) The preview result on my side is attached as the result.png image. Närke is rendered correctly, whether your browser encoding is latin1 or UTF-8.
Since we assume everything on your side is latin1, if it still doesn't work, you can change a line in the jdbc.properties file of a *deployed* IPT, to force jdbc encoding:
6) In [Tomcat root]/webapps/ipt/WEB-INF/classes, you have jdbc.properties, at line 7, you have
mysql.url=jdbc:mysql://{host}/{database}
7) add the encoding setting to the connection, so it reads as mysql.url=jdbc:mysql://{host}/{database}?characterEncoding=Cp1252
The encoding name used by JDBC driver is slightly different from MySQL[1, again].
Let me know if you can work out a refreshed result. Otherwise I suspect there was once UTF-8 encoding involved in certain steps while you establishing the database, therefore you might want to consider a clean start, using a small set of data, or the script you gave me, by;
1. Export all your database as SQL script, make sure the client you use(phpmyadmin?) also honours latin1 in every step. 2. Check the file encoding and the contents are exported correctly. 3. Import the SQL and try from IPT again.
Hope this helps. Do let us know if your problem is resolved.
Thanks,
Burke
[1] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
Hi Burke,
Thank you for your help. It helped me narrowing the problem and I think I am now close to publish the resource.
For the record I did a mysqldump of the table using --default-character-set=latin1 (dumping in two files, one for the structure, the other for the data), converted the files with the forceUTF8 library and replacing occurrences of "latin1" with "utf8".
The resulting data is displayed correctly both with IPT and with TapirLink, and with both latin1 and utf8 as the character set for MySQL server. But this is on my own computer, testing on the server with MySQL/latin1 gives me errors with TapirLink. But nevermind, I'll create a temporary database for the time of migration.
Again, thanks a lot.
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 26, 2011 11:06 AM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; GBIF Helpdesk Subject: Re: GBIF Case 1773: UTF8
Hi MIchaël,
Yes this time I have correct data to test. I changed from your script is to re-save the file in latin1, and add drop-table script from your previous dump. It was saved in UTF-8 despite the sql settings in the script are all latin1. The refined file is attached.
Then I reproduce the environment as the steps here:
1) Adjust MySQL server and client encoding settings to match yours: The server and client connection show: Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1
2) Create a database from the attached script, the database encoding is latin1.
3) Follow the normal procedure to create a SQL source in IPT. See "settings" screen shot.
4) Since JDBC driver detects source encoding automatically, the encoding setting in the bottom-left doesn't matter for SQL source. However, we're thinking about forcing the encoding as instructed. Please refer to mysql jdbc connector page[1].
5) The preview result on my side is attached as the result.png image. Närke is rendered correctly, whether your browser encoding is latin1 or UTF-8.
Since we assume everything on your side is latin1, if it still doesn't work, you can change a line in the jdbc.properties file of a *deployed* IPT, to force jdbc encoding:
6) In [Tomcat root]/webapps/ipt/WEB-INF/classes, you have jdbc.properties, at line 7, you have
mysql.url=jdbc:mysql://{host}/{database}
7) add the encoding setting to the connection, so it reads as mysql.url=jdbc:mysql://{host}/{database}?characterEncoding=Cp1252
The encoding name used by JDBC driver is slightly different from MySQL[1, again].
Let me know if you can work out a refreshed result. Otherwise I suspect there was once UTF-8 encoding involved in certain steps while you establishing the database, therefore you might want to consider a clean start, using a small set of data, or the script you gave me, by;
1. Export all your database as SQL script, make sure the client you use(phpmyadmin?) also honours latin1 in every step. 2. Check the file encoding and the contents are exported correctly. 3. Import the SQL and try from IPT again.
Hope this helps. Do let us know if your problem is resolved.
Thanks,
Burke
[1] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
Hi Mickaël,
Glad to learn you're progressing. And sorry for delayed response.
For the record I did a mysqldump of the table using --default-character-set=latin1 (dumping in two files, one for the structure, the other for the data), converted the files with the forceUTF8 library and replacing occurrences of "latin1" with "utf8".
I suppose the file is also encoded in UTF-8 when you saved the result?
The resulting data is displayed correctly both with IPT and with TapirLink, and with both latin1 and utf8 as the character set for MySQL server. But this is on my own computer, testing on the server with MySQL/latin1 gives me errors with TapirLink. But nevermind, I'll create a temporary database for the time of migration.
How's the situation now? Did you sort out why TapirLink had errors?
Thanks - we also learned from the problem-solving.
Burke
Again, thanks a lot.
Cheers, Mickaël ________________________________________ From: Burke Chih-Jen Ko (GBIF) [bko@gbif.org] Sent: Friday, August 26, 2011 11:06 AM To: Mickael Graf Cc: Johan Dunfalk; GBIF IPT mailing list; GBIF Helpdesk Subject: Re: GBIF Case 1773: UTF8
Hi MIchaël,
Yes this time I have correct data to test. I changed from your script is to re-save the file in latin1, and add drop-table script from your previous dump. It was saved in UTF-8 despite the sql settings in the script are all latin1. The refined file is attached.
Then I reproduce the environment as the steps here:
- Adjust MySQL server and client encoding settings to match yours:
The server and client connection show: Server characterset: latin1 Db characterset: latin1 Client characterset: latin1 Conn. characterset: latin1
Create a database from the attached script, the database encoding is latin1.
Follow the normal procedure to create a SQL source in IPT. See "settings" screen shot.
Since JDBC driver detects source encoding automatically, the encoding setting in the bottom-left doesn't matter for SQL source. However, we're thinking about forcing the encoding as instructed. Please refer to mysql jdbc connector page[1].
The preview result on my side is attached as the result.png image. Närke is rendered correctly, whether your browser encoding is latin1 or UTF-8.
Since we assume everything on your side is latin1, if it still doesn't work, you can change a line in the jdbc.properties file of a *deployed* IPT, to force jdbc encoding:
- In [Tomcat root]/webapps/ipt/WEB-INF/classes, you have jdbc.properties, at line 7, you have
mysql.url=jdbc:mysql://{host}/{database}
- add the encoding setting to the connection, so it reads as
mysql.url=jdbc:mysql://{host}/{database}?characterEncoding=Cp1252
The encoding name used by JDBC driver is slightly different from MySQL[1, again].
Let me know if you can work out a refreshed result. Otherwise I suspect there was once UTF-8 encoding involved in certain steps while you establishing the database, therefore you might want to consider a clean start, using a small set of data, or the script you gave me, by;
- Export all your database as SQL script, make sure the client you use(phpmyadmin?) also honours latin1 in every step.
- Check the file encoding and the contents are exported correctly.
- Import the SQL and try from IPT again.
Hope this helps. Do let us know if your problem is resolved.
Thanks,
Burke
[1] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-charsets.html
participants (2)
-
Burke Chih-Jen Ko (GBIF)
-
Mickael Graf