anl.aida.reader
Class NYTimesHTMLReader
java.lang.Object
anl.aida.reader.NYTimesHTMLReader
- All Implemented Interfaces:
- ContentReader
public class NYTimesHTMLReader
- extends java.lang.Object
- implements ContentReader
Scrapes article content from NYTimes pages. It searches the passed in URL for
the print link and then follows that link to get the printable page. The
printable page is much cleaner and this then scrapes that.
Method Summary |
private java.lang.String |
fixLink(java.lang.String link,
java.lang.String url)
|
ReaderResult |
read(java.lang.String url,
java.lang.String title,
java.util.Date date,
java.lang.String author)
Reads the text from the specified URL. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
DOC_START
private static final java.lang.String DOC_START
- See Also:
- Constant Field Values
DOC_END
private static final java.lang.String DOC_END
- See Also:
- Constant Field Values
extractor
private StringExtractor extractor
printLinkExtractor
private StringExtractor printLinkExtractor
processor
private DocumentProcessor processor
nullProc
private DocumentProcessor nullProc
content
private java.lang.StringBuilder content
result
private ReaderResult result
printLink
private java.lang.String printLink
singlePageLink
private java.lang.String singlePageLink
bean
private NYTimesHTMLReader.NYTStringBean bean
printBean
private NYTimesHTMLReader.NYTPrintLinkBean printBean
NYTimesHTMLReader
public NYTimesHTMLReader()
read
public ReaderResult read(java.lang.String url,
java.lang.String title,
java.util.Date date,
java.lang.String author)
throws java.io.IOException
- Reads the text from the specified URL. The title, date and author are
expected to be provided from another source, e.g. an RSS feed entry or an
archive "index."
- Specified by:
read
in interface ContentReader
- Parameters:
url
- the url to read fromtitle
- the article titledate
- the date of the articleauthor
- the author (can be empty string)
- Returns:
- the result of reading and parsing the text at the url
- Throws:
java.io.IOException
- if there is an error while reading
fixLink
private java.lang.String fixLink(java.lang.String link,
java.lang.String url)
throws java.net.MalformedURLException
- Throws:
java.net.MalformedURLException