[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [curn-users] Translating character encodings.
On 6/26/07 5:56 AM, Nuno Leitao wrote:
> Hi Brian,
>
> Thanks very much for your help. Here's my config file:
>
> [curn]
> CacheFile:${system:user.home}/k2/contentfetch/curn/data/store/curn.cache
> AllowEmbeddedHTML:false
> CommonXMLFixups:true
> DaysToCache:60
> GzipDownload:true
> MaxSummarySize:65535
> MaxThreads:4
> ReplaceEmptySummaryWith:title
> ShowDates:true
>
> [Feed.Publico.Geral]
> TitleOverride:Publico,Geral
> URL:http://www.publico.clix.pt/rss.asp?idCanal=10
> SaveAs:${system:user.home}/k2/contentfetch/curn/data/store/publico.geral.xml
>
> SaveAsEncoding:utf-8
>
> [Feed.DN]
> TitleOverride:Diario de Noticias
> URL:http://rss.sapo.pt/dn/
> SaveAs:${system:user.home}/k2/contentfetch/curn/data/store/dn.xml
> SaveAsEncoding:utf-8
>
> [OutputHandler]
> Class:org.clapper.curn.output.freemarker.FreeMarkerOutputHandler
> Disabled:false
> TemplateFile:file
> ${system:user.home}/k2/contentfetch/curn/data/etc/rssout.ftl
> SaveAs:${system:user.home}/k2/contentfetch/curn/data/store/forindexing.xml
> SaveAsEncoding:utf-8
> MimeType:application/xml
>
> You will notice that, the "Publico.Geral" feed is an ISO-8859-1 feed,
> while the DN feed is UTF-8. I have set SaveAsEncoding on both the feed
> config and the OutputHandler (a FreeMarker template), yet what I get is:
>
> * publico.geral.xml seems to be written in the original encoding,
> * dn.xml seems to be written in the original encoding,
> * forindexing.xml claims to be UTF-8 ('$ file forindexing.xml') but it
> has ISO-8859-1 characters (or what seem to be) in the actual file.
>
> Any help will be very much appreciated.
Nuno,
There are, indeed, problems.
1. The "publico" feed is served up by a Microsoft IIS/5.0 web server, which
specifies the character set in a Content-Type header that looks like
this:
Content-Type: text/xml; Charset=ISO-8859-1
Most web servers use a lower-case "charset" parameter, and curn only
recognizes lower case. I've fixed the bug in curn 3.2; the fix is in
3.2-alpha-3, which is unofficially available here:
http://www.clapper.org/software/java/curn/download/tmp/install-curn-3.2-alpha-3.jar
Alternatively, you can add a ForceEncoding parameter to the config:
[Feed.Publico.Geral]
TitleOverride:Publico,Geral
URL:http://www.publico.clix.pt/rss.asp?idCanal=10
ForceEncoding: ISO-8859-1
2. The DN feed is UTF-8, but the web server doesn't advertise the encoding,
so curn will use whatever the default is for your Java VM. That causes
problems here on my Mac, which assumes a different default encoding (not
UTF-8). Use the ForceEncoding parameter to fix this problem.
3. My version of forindexing.xml, using my Freemarker template, looks (to
me) to be proper UTF-8. I converted the output to UTF-16 and then back
to UTF-8, which should weed out any bad UTF-8 characters. The original
UTF-8 version and the one converted from UTF-16 are identical.
Also note that in curn 3.2, the syntax of the SaveAs parameter changes, and
the SaveAsEncoding parameter is deprecated. The old syntax is still
supported, but using it will generate runtime warning messages. The new
syntax is:
SaveAs: [--backups total_backups] [--encoding encoding] path
or
SaveAs: [-b total_backups] [-e encoding] path
So, in your configuration file, you'd replace:
[Feed.DN]
TitleOverride:Diario de Noticias
URL:http://rss.sapo.pt/dn/
SaveAs:${system:user.home}/k2/...
SaveAsEncoding:utf-8
with this:
[Feed.DN]
TitleOverride:Diario de Noticias
URL:http://rss.sapo.pt/dn/
SaveAs: -e utf-8 ${system:user.home}/k2/...
(I've trimmed the paths to prevent wrapping in the mail message.)
Note that the "SaveAs" parameter in the [OutputHandler] section does NOT
use that same syntax. The above new syntax only applies to the "SaveAs"
parameter in a [Feed] section.
This is all spelled out in the User's Guide that's installed with the alpha-3
release.
I hope to have 3.2 officially out within the next month. I've been swamped
lately...
Hope this helps.
--
-Brian
Brian Clapper, http://www.clapper.org/bmc/
Go soothingly in the grease mud, as there lurks the skid demon.
---
*** Posted to the curn-users mailing list (curn-users@xxxxxxxxxxx).
Back to curn-users archive.