anl.aida.reader
Class NYTimesHTMLReader

java.lang.Object
  extended by anl.aida.reader.NYTimesHTMLReader
All Implemented Interfaces:
ContentReader

public class NYTimesHTMLReader
extends java.lang.Object
implements ContentReader

Scrapes article content from NYTimes pages. It searches the passed in URL for the print link and then follows that link to get the printable page. The printable page is much cleaner and this then scrapes that.


Nested Class Summary
private static class NYTimesHTMLReader.NullProcessor
           
private  class NYTimesHTMLReader.NYTPrintLinkBean
           
private  class NYTimesHTMLReader.NYTStringBean
          Looks for the article body tag and places start and end tags in the output to mark the start and end of the text.
private  class NYTimesHTMLReader.Processor
           
 
Field Summary
private  NYTimesHTMLReader.NYTStringBean bean
           
private  java.lang.StringBuilder content
           
private static java.lang.String DOC_END
           
private static java.lang.String DOC_START
           
private  StringExtractor extractor
           
private  DocumentProcessor nullProc
           
private  NYTimesHTMLReader.NYTPrintLinkBean printBean
           
private  java.lang.String printLink
           
private  StringExtractor printLinkExtractor
           
private  DocumentProcessor processor
           
private  ReaderResult result
           
private  java.lang.String singlePageLink
           
 
Constructor Summary
NYTimesHTMLReader()
           
 
Method Summary
private  java.lang.String fixLink(java.lang.String link, java.lang.String url)
           
 ReaderResult read(java.lang.String url, java.lang.String title, java.util.Date date, java.lang.String author)
          Reads the text from the specified URL.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

DOC_START

private static final java.lang.String DOC_START
See Also:
Constant Field Values

DOC_END

private static final java.lang.String DOC_END
See Also:
Constant Field Values

extractor

private StringExtractor extractor

printLinkExtractor

private StringExtractor printLinkExtractor

processor

private DocumentProcessor processor

nullProc

private DocumentProcessor nullProc

content

private java.lang.StringBuilder content

result

private ReaderResult result

printLink

private java.lang.String printLink

singlePageLink

private java.lang.String singlePageLink

bean

private NYTimesHTMLReader.NYTStringBean bean

printBean

private NYTimesHTMLReader.NYTPrintLinkBean printBean
Constructor Detail

NYTimesHTMLReader

public NYTimesHTMLReader()
Method Detail

read

public ReaderResult read(java.lang.String url,
                         java.lang.String title,
                         java.util.Date date,
                         java.lang.String author)
                  throws java.io.IOException
Reads the text from the specified URL. The title, date and author are expected to be provided from another source, e.g. an RSS feed entry or an archive "index."

Specified by:
read in interface ContentReader
Parameters:
url - the url to read from
title - the article title
date - the date of the article
author - the author (can be empty string)
Returns:
the result of reading and parsing the text at the url
Throws:
java.io.IOException - if there is an error while reading

fixLink

private java.lang.String fixLink(java.lang.String link,
                                 java.lang.String url)
                          throws java.net.MalformedURLException
Throws:
java.net.MalformedURLException