21 Heng Mui Keng Terrace, Singapore 119613Homepage: http://textmining.krdl.org.sg
Text mining, also known as knowledge discovery from text, and documentinformation mining, refers to the process of extracting interesting patternsfrom very large text corpus for the purposes of discovering knowledge. Text mining is an interdisciplinary field involving information retrieval,text understanding, information extraction, clustering, categorization,visualization, database technology, machine learning, and data mining. Regarded by many as the next wave of knowledge discovery, text mininghas a very high commercial value. This talk presents a general frameworkfor text mining, consisting of two stages: text refining that transformsunstructured text documents into an intermediate form; and knowledgedistillation that deduces patterns or knowledge from the intermediate form. We then survey the state-of-the-art text mining approaches, products, andapplications by aligning them based on the text refining and knowledgedistillation functions as well as the intermediate form that they adopt. Inconclusion, we highlight the upcoming challenges of text mining and theopportunities it offers. 1. INTRODUCTION
Text mining, also known as text data mining (Hearst, 1997) or knowledge discovery fromtextual databases (Feldman & Dagan, 1995), is an emerging technology for analyzinglarge collections of unstructured documents for the purposes of extracting interesting andnon-trivial patterns or knowledge. It can be envisaged as a leap from data mining orknowledge discovery from (structured) databases (Fayyad et al., 1996; Simoudis, 1996).
As the most natural form of storing and exchanging information is written words, textmining has a very high commercial potential. In fact, a recent study indicated that 80% ofa company's information was contained in text documents, such as emails, memos,customer correspondence, and reports. The ability to distil this untapped source of
information provides substantial competitive advantages for a company to succeed in theera of a knowledge-based economy. There are many possible applications of text miningtechnology. We briefly highlight a few below.
1. Customer profile analysis, e.g., mining incoming emails for customers' complaint and
2. Patent analysis, e.g., analyzing patent databases for major technology players, trends,
3. Information dissemination, e.g., organizing and summarizing trade news and reports
for personalized information services.
4. Company resource planning, e.g., mining a company's reports and correspondences
for activities, status, and problems reported.
Text mining is a challenging task as it involves dealing with text data that are inherentlyunstructured and fuzzy. The field is interdisciplinary, involving information retrieval, textanalysis, information extraction, clustering, categorization, visualization, databasetechnology, machine learning, and data mining. To facilitate discussion, this articlepresents a general framework for text mining consisting of two components: Text refiningthat transforms free-form text documents into an intermediate form; and knowledgedistillation that deduces patterns or knowledge from the intermediate form. We then usethe proposed framework to study and align the state-of-the-art text mining products andapplications based on the text refining and knowledge distillation functions as well as theintermediate form that they adopt.
The rest of this paper is organized as follows. Section 2 presents the proposed text miningframework that bridges the gap between text mining and data mining. Section 3 gives anoverview of the current text mining products and applications in the light of the proposedframework. The final section discusses open problems and research directions. 2. A TEXT MINING FRAMEWORK
Text mining can be visualized as consisting of two phases: Text refining that transformsfree-form text documents into a chosen intermediate form, and knowledge distillation thatdeduces patterns or knowledge from the intermediate form (Tan, 1999).
Intermediate form (IF) can be semi-structured such as the conceptual graphrepresentation, or structured such as the relational data representation. Intermediate formcan be document-based wherein each entity represents a document, or concept-basedwherein each entity represents an object or concept of interests in a specific domain. Mining a document-based IF deduces patterns and relationship across documents. Document clustering/visualization and categorization are examples of mining from adocument-based IF. Mining a concept-based IF derives pattern and relationship across
objects or concepts. Data mining operations, such as predictive modeling and associativediscovery, fall into this category. A document-based IF can be transformed into aconcept-based IF by realigning or extracting the relevant information according to theobjects of interests in a specific domain. It follows that document-based IF could bedomain-independent whereas concept-based IF is always domain-dependent. Document-based intermediate form Concept-based intermediate form Text Knowledge Refining Distillation
Figure 1: A text mining framework. Text refining converts unstructured textdocuments into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns orknowledge across documents. A document-based IF can be projected onto aconcept-based IF by extracting object information relevant to a domain. Knowledge distillation from a concept-based IF deduces patterns or knowledgeacross objects or concepts.
For example, given a set of news articles, text refining first converts each document intoa document-based IF. One can then perform knowledge distillation on the document-based IF for the purpose of organizing the articles, according to their content, forvisualization and navigation purposes. For knowledge discovery in a specific domain, thedocument-based IF of the news articles can be projected onto a concept-based IFdepending on the task requirement. For example, one can extract information related to“company” from the document-based IF and form a company database. Knowledgedistillation can then be performed on the company database (company-based IF) to derivecompany-related knowledge. 3. PRODUCTS AND APPLICATIONS
Table 1 shows an illustrative list of text mining products and applications based on thetext refining and knowledge distillation functions as well as the intermediate formadopted. The text mining products/applications can be roughly organized into twogroups. One group focuses on document organization, visualization, and navigation. Theother group focuses on text analysis functions, notably, information retrieval, informationextraction, categorization, and summarization. While we see that most text miningsystems provide natural language processing (NLP) functions, few, if any, haveintegrated data mining functions for knowledge distillation across concepts or objects. Company/ Product/ Text Refining Intermediate Knowledge Organization Application Functions Distillation Functions
Table 1: A list of selected text mining products and applications based on the textrefining and knowledge distillation functions as well as the intermediate formadopted. 3.1. Document exploration tools
Document exploration tools organize documents based on their content and provide anenvironment for a user to navigate and browse in a document or concept space. A popularapproach is to perform clustering on the documents based on their similarities in contentand present the groups or clusters of the documents in certain graphical representation. There are a good number of text mining products that fall into this category. Thefollowing list is by no means exhaustive but should be sufficient to illustrate the varietyof the representation schemes available.
Cartia's ThemeScape is an enterprise information mapping application that presentsclusters of documents in a landscape representation. Canis's cMap is a documentclustering and visualization tool based on Self-Organizing Map. IBM’s TechnologyWatch, developed jointly with Synthema in Italy, is a text mining application in thescientific domain. It performs document clustering plus visualization in the form of mapsfor patent databases and technical publications. Inxight also offers a visualization tool,known as VizControls, that performs value-added post-processing of search results byclustering the documents into groups and displaying based on a hyperbolic treerepresentation. Semio Corp's SemioMap employs a three-dimensional graphical interfacethat maps the links between concepts in the document collection. Note that SemioMap isconcept-based in the sense that it explores the relationships between concepts whereasmost other visualization tools are document-based. 3.2. Document analysis tools
Document analysis tools analyze the content of the documents and discover relationshipsbetween concepts or entities described in the documents. They are mainly based onnatural language processing techniques, including text analysis, text categorization,information extraction, and summarization.
Knowledge Discovery System's Concept Explorer is a visual search tool that helps to findprecisely related content on the web. It "learns" relationships between words and phrasesautomatically from sample documents and visually guides you to construct searches. Inxight's LinguistX is another document retrieval tool with some text analysis andsummarization capabilities. IBM’s Intelligent Miner is probably one of the mostcomprehensive text mining products around. It offers a set of text analysis tools,including a feature extraction tool, a set of clustering tools, a summarization tool, and acategorization tool. Also incorporated are the IBM’s text search engine, NetQuestionSolution and the IBM web crawler package. TextWise, an R&D company based inSyracuse University, offers various text mining products. DR-LINK is an informationretrieval system based on automatic concept expansion. CINDOR is its cross lingualversion. CHESS is a text analysis and information extraction tool. Also an informationextraction tool is the Data Junction's Cambio, which extracts data in the form of relationalattributes from text. Megaputer's TextAnalyst uses a semantic net representation of
documents and performs automated indexing, topic assignment, text abstraction, andsemantic search. 4. OPEN PROBLEMS AND FUTURE DIRECTIONS
Despite the great potential and the mushrooming of text mining products, there aretechnical issues to be overcome before text mining becomes a main stream technology. 4.1. Intermediate form
Intermediate forms with varying degrees of complexity are suitable for different miningpurposes. For a fine-grain domain-specific knowledge discovery task, it is necessary toperform semantic analysis to derive a sufficiently rich representation to capture therelationship between the objects or concepts described in the documents. However,semantic analysis methods are computationally expensive and often operate in the orderof a few words per second. It remains a challenge to see how semantic analysis can bemade much more efficient and scalable for very large text corpora. 4.2. Multilingual text refining
Whereas data mining is largely language independent, text mining involves a significantlanguage component. Multilingual text mining is the area we expect to see a lot ofactivities in the next few years due to the substantial competitive advantages and the hugecommercial potential that one can obtain through mining in languages other than English. Languages that are of particular interests include European languages and Asianlanguages, in particular Japanese and Chinese. As each language has a different syntacticstructure and requires specialized semantic interpretation, a systematic approach forbringing in language modeling is inevitable and will form an essential part ofmultilingual text mining. 4.3. Domain knowledge integration
Current text mining systems do not make use of domain knowledge. We expect it to bean integral component of the future text mining tools. Domain knowledge is useful inorientating and focusing attention so as to improve the text parsing efficiency and to helpto derive a more compact representation. Domain knowledge also plays a major role inknowledge distillation tasks. In a classification or predictive modeling task, for example,domain knowledge helps to improve learning/mining efficiency as well as the quality ofthe learned model (or mined knowledge) (Tan, 1997). It is also interesting to explore howa user’s knowledge can be used to initialize a system’s knowledge structure and make thediscovered knowledge more interpretable. 4.4. Personalized autonomous mining
Another important dimension of research is to make text mining tools more user friendly. Current text mining products/applications are designed for trained knowledge specialists. Future text mining tools, as part of the knowledge management systems, should bereadily usable by technical users as well as management executives. There have beensome efforts in developing systems that interpret natural language queries and performappropriate mining operations automatically. Text mining tools could also embedded inintelligent personal assistants (Tan & Teo, 1998). Under the agent paradigm, a personalminer would learn a user’s profile, conduct text mining operations automatically, andforward information without requiring an explicit request from the user. REFERENCES
Fayyad, U., Piatetsky-Shapiro, G. & Smyth, P. (1996), “From data mining to knowledgediscovery: An overview”, in U. Fayyad et al. (eds.) Advances in Knowledge Discoveryand Data Mining, MIT Press, Cambridge, Mass., 1-36.
Feldman, R. & Dagan, I. (1995), “Knowledge discovery in textual databases (KDT)”, inProceedings of the First International Conference on Knowledge Discovery and DataMining (KDD-95), Montreal, Canada, August 20-21, AAAI Press, 112-117.
Hearst, M.A. (1997), “Text data mining: Issues, techniques, and the relationship toinformation access”, Presentation notes for UW/MS workshop on data mining, July 1997.
Simoudis, E. (1996), “Reality check for data mining”, IEEE Expert, 11(5).
Tan, A.-H. (1997), “Cascade ARTMAP: Integrating neural computation and symbolic knowledge processing”, IEEE Transactions on Neural Networks, 8(2), 237-250.
Tan, A.-H. & Teo, C. (1998), “Learning user profiles for personalized informationdissemination”, in Proceedings, International Joint Conference on Neural Networks(IJCNN'98), Alaska, 183-188.
Tan, A.-H. (1999), “Text Mining: The state of the art and the challenges”, inProceedings, PAKDD’99 workshop on Knowledge Discovery from Advanced Databases,Beijing, April, 1999. Ah-Hwee Tan 1999. The authors hereby grant a non-exclusive licence to SEARCC to publish thisdocument in full on the World Wide Web and on CD-ROM and in printed form as part of the SEARCC’99conference proceedings. The authors also grant permission to educational and non-profit institutions touse this document in courses of instruction provided that the article is used in full and this copyrightstatement is reproduced. Any other usage is prohibited without the express permission of the authors.
Technical Bulletin: Improving the Quality of Alcoholic Beverages After nearly three years of tastings in Australia, New Zealand, USA and France we havedemonstrated that our beverage processor can significantly smooth all spirits and a majority of stillwines by removing the bitterness, astringency and aftertaste. These tastings have included everything from box wines to vintage still wines, a
Education Glenn A. Mottershead, P.E University of Waterloo, Waterloo, Ontario Professional Registrations Professional Experience Mr. Mottershead has 31 years of experience in the engineering design, installation, commissioning, testing, troubleshooting, and modernizing of large synchronous generators. He is an expert in all aspects of high voltage epoxy mica bar and coil insul