anl.aida.reader
Class ChicagoTribuneHTMLReader

java.lang.Object
  extended by anl.aida.reader.ChicagoTribuneHTMLReader
All Implemented Interfaces:
ContentReader

public class ChicagoTribuneHTMLReader
extends java.lang.Object
implements ContentReader

Scrapes article content from archived Chicago Tribune pages.


Nested Class Summary
private  class ChicagoTribuneHTMLReader.Processor
           
 
Field Summary
private  java.lang.StringBuilder content
           
private  StringExtractor extractor
           
private  DocumentProcessor processor
           
private  ReaderResult result
           
 
Constructor Summary
ChicagoTribuneHTMLReader()
           
 
Method Summary
 ReaderResult read(java.lang.String url, java.lang.String title, java.util.Date date, java.lang.String author)
          Reads the text from the specified URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

extractor

private StringExtractor extractor

processor

private DocumentProcessor processor

content

private java.lang.StringBuilder content

result

private ReaderResult result
Constructor Detail

ChicagoTribuneHTMLReader

public ChicagoTribuneHTMLReader()
Method Detail

read

public ReaderResult read(java.lang.String url,
                         java.lang.String title,
                         java.util.Date date,
                         java.lang.String author)
                  throws java.io.IOException
Reads the text from the specified URL. The title and date are expected to be provided from another source, e.g. an RSS feed entry or an archive "index."

Specified by:
read in interface ContentReader
Parameters:
url - the url to read from
title - the article title
date - the date of the article
author - the author (can be empty string)
Returns:
the result of reading and parsing the text at the url
Throws:
java.io.IOException - if there is an error while reading