[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [curn-users] issue with IgnoreDuplicateTitles plug-in



Thanks Brian.. that was a prompt reply!

 

Here's my config file:

 

[var]

# "feedDir" dumps to a directory that's accessible internally via URL

feedDir: .

# curnDir: where this file and the cache live

curnDir: .

 

[curn]

CacheFile: ${var:curnDir}/common.cache

MaxThreads: 15

ParserClass: org.clapper.curn.parser.rome.RSSParserAdapter

GzipDownload: true

 

[Feed_slashdot]

# Slashdot

URL: http://rss.slashdot.org/Slashdot/slashdot

SaveAs: ${var:feedDir}/slashdot.xml

IgnoreDuplicateTitles: true

 

And the command line:

 

(to enable logging..)

set CURN_JAVA_VM_ARGS=-Djava.util.logging.config.file=./logging.properties

 

(to invoke curn)

curn --logging -C curn.cfg

 

and this is the "logging.properties" (in case you want to have a look at it)

 

log4j.rootLogger=debug, File

log4j.appender.File=org.apache.log4j.FileAppender

log4j.appender.File.layout=org.apache.log4j.PatternLayout

log4j.appender.File.file=./log.out

# Overwrite the file each time

log4j.appender.File.append=false

# Print the date in ISO 8601 format

log4j.appender.File.layout.ConversionPattern=%d %-5p (%c{1}): %m%n

log4j.logger.org.clapper.curn=debug

 

The issue here is not to enable logging .. though J .. but to get the duplicate plugin working.

And also I have a question about IgnoreDuplicateArticlesPlugIn.java'. I suppose this is the underlying class that's called when IgnoreDuplicateTitles is set to true. I somehow forced articles with duplicate titles to see how they're handled. They have exactly the same title. But curn couldn’t suppress it. I was hoping to modify that plugin to make it more sophisticated, if this works well.

 

Thanks again for taking out time on a Sunday evening!

 

- Bharath

 

-----Original Message-----
From: Brian Clapper [mailto:bmc@xxxxxxxxxxx]
Sent
: Sunday, August 05, 2007 5:56 PM
To: Bharath Prathipati
Cc: curn-users@xxxxxxxxxxx
Subject: Re: [curn-users] issue with IgnoreDuplicateTitles plug-in

 

On 8/5/07 8:34 PM, Bharath Prathipati wrote:

> hi there!

> 

> I was trying out the windows version of curn and was trying this

> IgnoreDuplicateTitles plugin, but could never get that to work. And even

> my attempt to enable logging was not successful.

> 

> For the plugin.. I used “IgnoreDuplicateTitles: true” for each feed. Is

> there something else to be done?

> 

> May be my question/request is too abstract! Please let me know if

> anything else is to be provided.

 

Bharath,

 

Logging does work, but getting it configured properly can be a challenge

the first time you try it. (That has a lot to do with how the underlying

logging APIs work.)

 

Please send me:

 

a) Your curn configuration file.

b) The command line you used to invoke curn.

 

Note that the IgnoreDuplicateArticles plug-in is rather simplistic. It

simply compares the article titles in each feed to see if there are

duplicates. It attempts to normalize the titles slightly, but only

slightly:

 

- It converts all adjacent white space into a single space.

- It converts the title to lower case.

 

Thus, these two titles will compare as equal (and the second one will be

suppressed:

 

       Dog drags owner from well

       Dog    drags Owner  from well

 

The first one will be converted to "dog drags owner from well" and saved.

When curn sees the second title, it will remove the extra spaces and convert

it to lower case; the result will match the first title, and the second

article will be suppressed.

 

It doesn't do anything fancier than that, though.

 

Send me your config file. I'll take a look.

--

-Brian

 

Brian Clapper, http://www.clapper.org/bmc/

A day without sunshine is like night.



 Back to curn-users archive.