anl.aida.reader
Class NTARCHtmlReader
java.lang.Object
anl.aida.reader.NTARCHtmlReader
- All Implemented Interfaces:
- ContentReader
public class NTARCHtmlReader
- extends java.lang.Object
- implements ContentReader
Reads the html from an NTARC article page. This stores as content the lines
of text between "Thanks for Visiting" and an http link containing "reblog".
When visiting a page without a true cookie accepting browser, the Thanks For
Visiting message should always show up.
Method Summary |
ReaderResult |
read(java.lang.String url,
java.lang.String title,
java.util.Date date,
java.lang.String author)
Parses the text from the specified URL. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
extractor
private StringExtractor extractor
processor
private DocumentProcessor processor
content
private java.lang.StringBuilder content
result
private ReaderResult result
NTARCHtmlReader
public NTARCHtmlReader()
read
public ReaderResult read(java.lang.String url,
java.lang.String title,
java.util.Date date,
java.lang.String author)
throws java.io.IOException
- Parses the text from the specified URL. The title and date are expected to
be provided from another source, e.g. an RSS feed entry or an "index."
- Specified by:
read
in interface ContentReader
- Parameters:
url
- the url to read fromtitle
- the article titledate
- the date of the articleauthor
- the author (can be empty string)
- Returns:
- the result of reading and parsing the text at the url
- Throws:
java.io.IOException
- if there is an error while reading