Selected Papers from Text Mining Workshop at the 2012 SIAM International Conference on Data Mining

A special issue of Algorithms (ISSN 1999-4893).

Deadline for manuscript submissions: closed (30 June 2012) | Viewed by 49731

Special Issue Editors

Department of Electrical Engineering and Computer Science, The University of Tennessee, Min H. Kao Building, Suite 401, 1520 Middle Drive, Knoxville, TN 37996, USA
Interests: information retrieval, data and text mining, computational science, bioinformatics, and parallel computing
Department of Mathematics and Statistics, University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA
Interests: optimal control theory; finite dimensional optimization; robust stability of control systems; computational information retrieval

Special Issue Information

Dear Colleagues,

The proliferation of digital computing devices and their use in communication continues to result in an increased demand for systems and algorithms capable of mining textual data. Thus, the development of techniques for mining unstructured, semi-structured, and fully structured textual data has become quite important in both academia and industry. As a result, a one-day workshop on text mining was held on April 28, 2012 in conjunction with the SIAM Twelfth International Conference on Data Mining to bring together researchers from a variety of disciplines to present their current approaches and results in text mining. The workshop surveyed the emerging field of text mining-the application of techniques of machine learning in conjunction with natural language processing, information extraction and algebraic/mathematical approaches to computational information retrieval. Many issues are being addressed in this field ranging from the development of new document classification and clustering models to novel approaches for topic detection, tracking, and visualization.

Prof. Dr. Michael W. Berry
Dr. Jacob Kogan
Guest Editors

Keywords

  • document ranking and representation
  • document classification and clustering
  • text summarization and anomaly detection

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

1199 KiB  
Article
Extracting Hierarchies from Data Clusters for Better Classification
by German Sapozhnikov and Alexander Ulanov
Algorithms 2012, 5(4), 506-520; https://doi.org/10.3390/a5040506 - 23 Oct 2012
Cited by 2 | Viewed by 5919
Abstract
In this paper we present the PHOCS-2 algorithm, which extracts a “Predicted Hierarchy Of ClassifierS”. The extracted hierarchy helps us to enhance performance of flat classification. Nodes in the hierarchy contain classifiers. Each intermediate node corresponds to a set of classes and each [...] Read more.
In this paper we present the PHOCS-2 algorithm, which extracts a “Predicted Hierarchy Of ClassifierS”. The extracted hierarchy helps us to enhance performance of flat classification. Nodes in the hierarchy contain classifiers. Each intermediate node corresponds to a set of classes and each leaf node corresponds to a single class. In the PHOCS-2 we make estimation for each node and achieve more precise computation of false positives, true positives and false negatives. Stopping criteria are based on the results of the flat classification. The proposed algorithm is validated against nine datasets. Full article
Show Figures

Figure 1

1308 KiB  
Article
The Effects of Tabular-Based Content Extraction on Patent Document Clustering
by Denise R. Koessler, Benjamin W. Martin, Bruce E. Kiefer and Michael W. Berry
Algorithms 2012, 5(4), 490-505; https://doi.org/10.3390/a5040490 - 22 Oct 2012
Viewed by 6407
Abstract
Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed software [...] Read more.
Data can be represented in many different ways within a particular document or set of documents. Hence, attempts to automatically process the relationships between documents or determine the relevance of certain document objects can be problematic. In this study, we have developed software to automatically catalog objects contained in HTML files for patents granted by the United States Patent and Trademark Office (USPTO). Once these objects are recognized, the software creates metadata that assigns a data type to each document object. Such metadata can be easily processed and analyzed for subsequent text mining tasks. Specifically, document similarity and clustering techniques were applied to a subset of the USPTO document collection. Although our preliminary results demonstrate that tables and numerical data do not provide quantifiable value to a document’s content, the stage for future work in measuring the importance of document objects within a large corpus has been set. Full article
Show Figures

Graphical abstract

3733 KiB  
Article
Contextual Anomaly Detection in Text Data
by Amogh Mahapatra, Nisheeth Srivastava and Jaideep Srivastava
Algorithms 2012, 5(4), 469-489; https://doi.org/10.3390/a5040469 - 19 Oct 2012
Cited by 22 | Viewed by 14627
Abstract
We propose using side information to further inform anomaly detection algorithms of the semantic context of the text data they are analyzing, thereby considering both divergence from the statistical pattern seen in particular datasets and divergence seen from more general semantic expectations. Computational [...] Read more.
We propose using side information to further inform anomaly detection algorithms of the semantic context of the text data they are analyzing, thereby considering both divergence from the statistical pattern seen in particular datasets and divergence seen from more general semantic expectations. Computational experiments show that our algorithm performs as expected on data that reflect real-world events with contextual ambiguity, while replicating conventional clustering on data that are either too specialized or generic to result in contextual information being actionable. These results suggest that our algorithm could potentially reduce false positive rates in existing anomaly detection systems. Full article
Show Figures

Graphical abstract

401 KiB  
Article
Better Metrics to Automatically Predict the Quality of a Text Summary
by Peter A. Rankel, John M. Conroy and Judith D. Schlesinger
Algorithms 2012, 5(4), 398-420; https://doi.org/10.3390/a5040398 - 26 Sep 2012
Cited by 13 | Viewed by 7304
Abstract
In this paper we demonstrate a family of metrics for estimating the quality of a text summary relative to one or more human-generated summaries. The improved metrics are based on features automatically computed from the summaries to measure content and linguistic quality. The [...] Read more.
In this paper we demonstrate a family of metrics for estimating the quality of a text summary relative to one or more human-generated summaries. The improved metrics are based on features automatically computed from the summaries to measure content and linguistic quality. The features are combined using one of three methods—robust regression, non-negative least squares, or canonical correlation, an eigenvalue method. The new metrics significantly outperform the previous standard for automatic text summarization evaluation, ROUGE. Full article
Show Figures

Figure 1

683 KiB  
Article
Monitoring Threshold Functions over Distributed Data Streams with Node Dependent Constraints
by Yaakov Malinovsky and Jacob Kogan
Algorithms 2012, 5(3), 379-397; https://doi.org/10.3390/a5030379 - 18 Sep 2012
Viewed by 5923
Abstract
Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of [...] Read more.
Monitoring data streams in a distributed system has attracted considerable interest in recent years. The task of feature selection (e.g., by monitoring the information gain of various features) requires a very high communication overhead when addressed using straightforward centralized algorithms. While most of the existing algorithms deal with monitoring simple aggregated values such as frequency of occurrence of stream items, motivated by recent contributions based on geometric ideas we present an alternative approach. The proposed approach enables monitoring values of an arbitrary threshold function over distributed data streams through stream dependent constraints applied separately on each stream. We report numerical experiments on a real-world data that detect instances where communication between nodes is required, and compare the approach and the results to those recently reported in the literature. Full article
Show Figures

Figure 1

224 KiB  
Article
Incremental Clustering of News Reports
by Joel Azzopardi and Christopher Staff
Algorithms 2012, 5(3), 364-378; https://doi.org/10.3390/a5030364 - 24 Aug 2012
Cited by 29 | Viewed by 9024
Abstract
When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated processes [...] Read more.
When an event occurs in the real world, numerous news reports describing this event start to appear on different news sites within a few minutes of the event occurrence. This may result in a huge amount of information for users, and automated processes may be required to help manage this information. In this paper, we describe a clustering system that can cluster news reports from disparate sources into event-centric clusters—i.e., clusters of news reports describing the same event. A user can identify any RSS feed as a source of news he/she would like to receive and our clustering system can cluster reports received from the separate RSS feeds as they arrive without knowing the number of clusters in advance. Our clustering system was designed to function well in an online incremental environment. In evaluating our system, we found that our system is very good in performing fine-grained clustering, but performs rather poorly when performing coarser-grained clustering. Full article
Show Figures

Figure 1

Back to TopTop