[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [curn-users] issue with IgnoreDuplicateTitles plug-in



On 8/6/07 2:53 AM, Bharath Prathipati wrote:

> I was testing the duplicate article issue on my server, and not on
> Slashdot feeds. I should have made it clear beforehand. And I’m
> attaching the output file created from my rss feed using curn, which has
> duplicate articles. If you search for “cure for oil” you’ll find 2
> articles.

Okay, I think I may understand what's happening here.

Are you expecting curn to remove duplicates as part of the SaveAs
processing? If so, that's not going to work. SaveAs saves the raw XML,
downloaded from the remote site, BEFORE the file is parsed. curn does no
processing of the XML feed before saving it to the file specified by
SaveAs, so if there are duplicates in the feed, there will be duplicates in
the file. (That's what I meant by the term "raw" in the description of the
SaveAs parameter.)

Is that what you're trying to do?

The IgnoreDuplicateTitles plug-in operates on the parsed feed data, which
means it logically runs AFTER the SaveAs plug-in. Thus, by the time the
IgnoreDuplicateTitles plug-in runs, SaveAs has already done its work.

In curn 3.2 (which is not yet released, but see below), there's a new
SaveAsRSS parameter. That parameter instructs curn to save the PARSED
(i.e., not raw) feed data to an RSS feed format. Since the new SaveAsRSS
plug-in runs AFTER the data is parsed, it can (and does) honor what the
IgnoreDuplicateTitles plug-in does.

Even though curn 3.2 isn't released, you can play with an alpha release.
It's located here:

http://www.clapper.org/software/java/curn/download/tmp/install-curn-3.2-alpha-9.jar

The SaveAsRSS parameter is described in the User's Guide in there, but
here's a brief run-down. It's honored with a Feed section.

----------
SaveAsRSS: If set, this parameter specifies that the parsed feed data
           should be rewritten in the specified RSS format and saved to
           the specified file. This configuration item takes a command
           line-style value:

           [--backups total_backups] [--type rsstype] [--encoding enc] path

           or

           [-b total_backsup] [-t rsstype] [-e enc] path

           where:

           - <total_backups> specifies how many backups (i.e., previous
             versions) of the generated RSS file to keep. For instance, a
             value of 5 means "keep 5 previous versions of the file, plus
             the one from the current run." This is the best way to keep
             RSS files from previous curn runs. The backup files have
             version numbers preceding their extensions. For instance, if
             the output file is foo.xml, and total_backups is 2, curn
             will keep foo.0.xml and foo.1.xml. The file with the largest
             version number is the oldest one. If not specified, this
             parameter defaults to 0, which means "no backups".

           - <rsstype> is the type of RSS output to generate. Currently,
             "rss1", "rss2" and "atom" are the supported values.

           - <encoding> is optional and specifies the desired encoding of
              the file. It defaults to "utf-8".

           - <path> is the path to the file where the RSS output should
             be written

           Note that only the new data in the feed is converted to RSS.

EXAMPLES:

           SaveAsRSS: -b 1 -t rss2 -e Cp1252 ${system:user.home}/feed-rss2.xml
           SaveAsRSS: -t atom -e UTF8 ${system:user.home}/feed-atom.xml

OPTIONAL. Default: none
----------

Regards,

-Brian

Brian Clapper, http://www.clapper.org/bmc/
Why is it that there are so many more horses' asses than there are horses?
	-- G. Gordon Liddy
---
*** Posted to the curn-users mailing list (curn-users@xxxxxxxxxxx).



 Back to curn-users archive.