Textual data analysis using a nonhierarchical neural network approach

Authors


Introduction

Artificial neural networks excel at recognizing patterns in textual data. Its pattern recognition capability allows a neural network engine to assign weights representing the multiple connections among concepts. These weights can then be used to create dendograms or otherwise categorize concepts in a hierarchical manner. However this approach has its limitations.

Words often have different meanings depending upon the context in which they occur. Consider the word “mustang” which can be used to describe a car, horse, or airplane. Yet a hierarchical clustering method is unable to fully describe multiple relationships and meaningsbecause it is only able to show concepts connected in one way. Each concept is assigned to only one “best” cluster in the output suggesting that there is only one meaning of that concept in the data analyzed. The use of a nonhierarchical approach addresses this limitation since it allows the researcher to interact with the neural network to explore all possible meanings of a concept. Thus, in the resulting output a concept may appear in as many clusters as are appropriate.

Method

Opinions about the terrorist attacks of September 11, 2001 were of particular interest during the five-year anniversary of the event in September 2006. To gauge opinion, editorials, opinion pieces and letters to the editors of all U.S. newspapers indexed in the FACTIVA™ database were retrieved for the month of September 2006.

The 3.2MB text file was analyzed using the CATPAC™ text analysis program. Output consisted of a scalar products matrix used to generate an artificial neural network (ANN) with output consisting of a weighted input network (WIN) file and the hierarchical clusters represented in both dendogram and 3-dimensional coordinate files. The ORESME ™ software was then used for nonhierarchical analysis of the CATPAC results.

An input (activation) value was assigned to one or more terms and the resulting clusters of “activated concepts” in the ANN were compared. The nodes of an ANN are connected to one another by weights which represent their relative “closeness” in the network. They communicate with each other by a simple linear threshold rule:

The signal sent from any node i to any node j is equal to the product of the activation value of i and strength of the connection between i and j. Thus the total signal received by any node j will be the sum of the signals received from all the other nodes, or

equation image

Unlike the traditional forward feed-back propagation neural networks, ORESME1 is an interactive activation and competition network, and any neuron can be an input, hidden or output neuron.

Results

Concepts generated with regard to a single commonly experienced event produced some predictable results. In the hierarchical cluster analysis many concepts were grouped as anticipated—SEPTELEVENTH clustered with the concepts ATTACK, BUSH, UNITEDSTATES, IRAQ and US. Additionally, there was a TORTURE cluster, a NEWS cluster and surprisingly, a CLINTON cluster. However, this obviously represented only part of peoples' thinking with regard to 9/11 and this event's fifth anniversary.

The disadvantage of hierarchical cluster analysis is that we see only part of the picture. Concepts are placed in one “overall best fit” cluster when in reality, they can be in one or many clusters depending on such variables as context, time, place, etc. On the other hand, a nonhierarchical approach allows us to see some of those relationships that may not have been statistical “best fits,” but are nonetheless important in finding meaning in the text. Consider the concept SEPTELEVENTH which has a clearly defined cluster comprised of BUSH, FIRST, AGAINST, US, ATTACK, IRAQ. When it is paired with another term, NEWS, the concepts with which it was originally clustered are not as important and terms with which it was not seemingly related are now much closer. We see a different cluster emerge: SEPTELEVENTH, NEWS, IRAQ, ATTACK, NEW, YEAR. Indeed, SEPTELEVENTH has multidimensional meaning.

This study is evidence that software like ORESME™ can be used to analyze text in a more meaningful way. There are quirks in the software and the output could be more user-friendly. But this is of minor concern given the overall results of this research which support the need for a nonhierarchical approach. There are so many confounding variables when it comes to studying human communication. That it is impossible to control these variables only reinforces the importance of using nonhierarchical analysis to discover meaning.

Ancillary