anl.aida.reader
Class IndependentHTMLReader

java.lang.Object
  extended by anl.aida.reader.IndependentHTMLReader
All Implemented Interfaces:
ContentReader

public class IndependentHTMLReader
extends java.lang.Object
implements ContentReader

Scrapes article content from The Independent.


Nested Class Summary
private  class IndependentHTMLReader.Processor
           
 
Field Summary
private  java.lang.StringBuilder content
           
private  StringExtractor extractor
           
private  DocumentProcessor processor
           
private  ReaderResult result
           
 
Constructor Summary
IndependentHTMLReader()
           
 
Method Summary
 ReaderResult read(java.lang.String url, java.lang.String title, java.util.Date date, java.lang.String author)
          Reads the text from the specified URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

extractor

private StringExtractor extractor

processor

private DocumentProcessor processor

content

private java.lang.StringBuilder content

result

private ReaderResult result
Constructor Detail

IndependentHTMLReader

public IndependentHTMLReader()
Method Detail

read

public ReaderResult read(java.lang.String url,
                         java.lang.String title,
                         java.util.Date date,
                         java.lang.String author)
                  throws java.io.IOException
Reads the text from the specified URL. The title and date are expected to be provided from another source, e.g. an RSS feed entry or an archive "index."

Specified by:
read in interface ContentReader
Parameters:
url - the url to read from
title - the article title
date - the date of the article
author - the author (can be empty string)
Returns:
the result of reading and parsing the text at the url
Throws:
java.io.IOException - if there is an error while reading