[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[curn-users] Java UTF-8 byte order mark issues



I'm sending this reply to the entire list, since it might be of use to
someone else. Jerry sent me a message indicating that curn was puking on
XML from a particular site. Here's the result of my diagnosis.

On 2/29/08 2:48 PM, Donoghue, Jerry wrote:
> Brian,
>
> Here you go:
>
> http://www.breakingviews.com/Rss/Feed.aspx?s=f-jl8742aw&g=3319&t=full

Jerry,

This is actually a Java problem. The file is UTF-8, but with a leading
3-byte byte order mark (BOM). The UTF-8 specification allows for a BOM,
even though it's kind of pointless with a byte-oriented encoding like
UTF-8.

See http://en.wikipedia.org/wiki/Byte_Order_Mark

Unfortunately, Java does not handle UTF-8 with a byte order mark. So the
Java I/O routines pass them directly along to the XML parser, which sees
those leading three bytes and pukes, because it (the XML parser) expects to
see an "<?xml>" element, not those weird three characters. (Download that
URL, and do a hexdump on it. You'll see what I mean. The leading bytes are
0xef 0xbb 0xbf, which is the BOM.)

Add the following to the section for that feed:

    ForceEncoding: utf-8
    PreparseEdit0:  's|^.*<\?xml version|<?xml version|'

This should fix the problem.

Here's the configuration I used to test the fix:

[FeedTest_BN]

URL: http://www.breakingviews.com/Rss/Feed.aspx?s=f-jl8742aw&g=3319&t=full
Disabled: false
ForceEncoding: utf-8
PreparseEdit0:  's|^.*<\?xml version|<?xml version|'

Hope that helps.
--
-Brian

Brian Clapper, http://www.clapper.org/bmc/
Anything is possible if you don't know what you're talking about.
---
*** Posted to the curn-users mailing list (curn-users@xxxxxxxxxxx).



 Back to curn-users archive.