Text Miner Data Source Dialog

The following options appear on each of the five Text Miner (tabbed) dialogs: Data Source, Models, Pre-Processing, Representation, and Output Options.

Data

Analytic Solver Data Mining:  Text Miner Data tab

Variables

Variables contained in this list are text variables included within a data set on a worksheet, with at least one column that contains free-form text (or file paths to documents containing free-form text), and other columns that contain traditional structured data.

First Row Contains Headers

This option is selected by default. Select this option if the first row of your data set contains headers for your data. 

Selected Text Variables

Variables contained in this list have been selected from the Variables list as inputs to Text Miner.

Text variables contain file paths

Select this option if the text variables within your data set contain pointers or paths to a text document or collection of text documents. If your data set contains pointers, this option is selected by default.

Selected Non-Text variables

Variables contained in this list have been selected from the Variables list as non-text inputs to Text Miner (i.e., numeric variables).

Pre-Processing

Analyze all terms

When this option is selected, Analytic Solver examines all terms in the document. A term is defined as an individual entity in the text, which may or may not be an English word. A term can be a word, number, email, or URL. This option is selected by default.

Analyze specified terms only

When this option is selected, Analytic Solver examines all terms in the document. A term is defined as an individual entity in the text, which may be an English word. A term can be a word, number, email, or URL. If this option is selected, the Edit Terms button is enabled to open the Edit Exclusive Terms dialog. Use this dialog to addd and remove terms to be considered for text mining, and disregard all other terms. For example, to mine each document for a specific part name such as alternator, click Add Term on the Edit Exclusive Terms dialog, then replace New term with alternator and click Done to return to the Pre-Processing dialog. During the text mining process, Analytic Solver analyzes each document for the term alternator, excluding all other terms.

Start term/phrase

If this option is used, text appearing before the first occurrence of the Start Phrase will be disregarded and similarly, text appearing after End Phrase (if used) will be disregarded. For example, if text mining the transcripts from a Live Chat service, you would not be particularly interested in any text appearing before the heading Chat Transcript or after the heading End of Chat Transcript. Thus, you would enter Chat Transcript into the Start Phrase field and End of Chat Transcript into the End Phrase field.

End term/phrase

If this option is selected, text appearing before the first occurrence of the Start Phrase and after End Phrase (if used) will be disregarded. For example, when text mining the transcripts from a Live Chat service, text appearing before the heading Chat Transcript or after the heading End of Chat Transcript would be disregarded if those terms were entered into the Start Phrase and End Phrase.

Stopword removal

If selected (default), over 300 commonly used words/terms (i.e., a, to, the, and) are removed from the document collection during preprocessing. To view the list of terms, click Edit. To remove a word from the Stopwords List, highlight the word, then click Remove Stopword. To add a new word to the list, click Add Stopword. Double-click the word to edit.

Analytic Solver allows additional stopwords to be added, or existing to be removed, via a text document (*.txt), by using Browse to navigate to the file. Terms in the text document can be separated by a space, a comma, or both. To add three terms in a text document, rather than using the Edit Stopwords dialog, the terms could be listed as: subject emailterm from or subject,emailterm,from or subject, emailterm, from. With a large list of additional stopwords, this would be the preferred way to enter the terms.

Click Done to close the Edit Stopwords dialog, and return to Pre-Processing.

Exclusion List

If selected, terms entered into the Exclusion List are removed from the document collection. This is beneficial if a large number of documents in the collection contain the same terms, for example, from, to, subject in a collection of emails. If all documents contain the same terms, including those terms in the analysis, could bias the analysis. To enter terms to be excluded, click Edit. The Edit Exclusion List dialog opens. Click Add Exclusion Term to add  and edit a new term. Analytic Solver removes the terms listed in this dialog from the document collection during pre-processing. To remove a term from the exclusion list, highlight the term and click Remove Exclusion Term.

To add terms all at once from a text document (*.txt), click Browse to navigate to the file and import the list. Terms in the text document can be separated by a space, a comma, or both. To supply excluded terms in a document rather than in the Edit Exclusion List dialog, enter the terms as: subject emailtoken from, or subject,emailtoken,from, or subject, emailtoken, from. With a large list of terms to be excluded, this would be the preferred way to enter the terms.

Edit Exclusion List Dialog 

Click Done to close the dialog and return to Pre-Processing.

Vocabulary Reduction - Advanced

Analytic Solver also allows the combining of synonyms and full phrases.

Synonym Reduction

From the Pre-Processing tab, under Vocabulary Reduction, select Advanced to display the Vocabulary Reduction - Advanced dialog. Synonym reduction to replace synonyms such as car, automobile, convertible, vehicle, sedan, coupe, subcompact, and jeep, with auto. Click Add Synonym and replace rootterm with the term to be substituted, then replace synonym list with the list of synonyms (i.e., car, automobile, convertible, vehicle, sedan, coupe). During preprocessing, Analytic Solver replaces the terms car, automobile, convertible, vehicle, sedan, coupe, subcompact, and jeep with the term auto. To remove a synonym from the list, highlight the term and click Remove Synonym.

Vocabulary Reduction - Advanced Dialog 

If adding synonyms from a text file, each line must be of the form rootterm:synonymlist, or using our example, auto:car automobile convertible vehicle sedan coup or auto:car,automobile,convertible,vehicle,sedan,coup. Separation between the terms in the synonym list must be a space, a comma, or both. For a large list of synonyms, this would be the preferred way to enter the terms.

Phrase Reduction

Analytic Solver allows the combining of words into phrases that indicate a singular meaning such as station wagon, which refers to a specific type of car rather than two distinct tokens - station and wagon. To add a phrase in the Vocabulary Reduction - Advanced dialog, select Phrase reduction, and click Add Phrase. The term phrasetoken appears. Click the term to edit, and enter the term that replaces the phrase (i.e., wagon). Click the phrase to edit and enter the phrase that will be substituted (i.e., station wagon). If supplying phrases through a text file (*.txt), each line of the file must be of the form phrasetoken:phrase or using our example, wagon:station wagon. For a large list of phrases, this would be the preferred way to enter the terms.

Click Done to return to Pre-Processing.

Maximum vocabulary size

Analytic Solver reduces the number of terms in the final vocabulary to the most frequently occurring in the collection. The default is 1000.

Perform stemming

Stemming is the practice of stripping words down to their stems or roots. For example, stemming terms such as argue, argued, argues, arguing, and argus, would result in the stem argu. However, argument and arguments would stem to argument. The stemming algorithm used in Analytic Solver is smart in the sense that while the term running would be stemmed to run, runner would not. Analytic Solver uses the Porter Stemmer 2 algorithm for the English Language. For more information on this algorithm, click here.

Normalize case

When this option is selected, Analytic Solver converts all text to lowercase.

Term Normalization Advanced

From the Pre-Processing dialog, under Term Normalization, click Advanced to open the Term Normalization - Advanced dialog. Use this dialog to replace or remove nonsensical terms such as HTML tags, URLs, or email addresses from the document collection. To remove normalized terms completely, include the normalized term (i.e, emailtoken) in the Exclusion List in the Edit Exclusion List dialog.

Term Normalization - Advanced Dialog 

Minimum stemmed term length

If stemming reduced a term's length to two or less characters, Text Miner disregards the term. This option is selected by default.

Remove HTML tags

If selected, HTML tags are removed from the document collection. HTML tags and text contained inside these tags contain technical, computer-generated information that is not typically relevant to the goal of the Text Mining application. This option is not selected by default.

Normalize URLs

If selected, URLs appearing in the document collection are replaced with the term urltoken. URLs do not normally add any meaning, but it is sometimes interesting to know how many URLs are included in a document. This option is not selected by default.

Normalize email addresses

If selected, email addresses appearing in the document collection are replaced with the term emailtoken. This option is not selected by default.

Normalize numbers

If selected, numbers appearing in the document collection are replaced with the term numbertoken. This option is not selected by default.

Normalize monetary amounts

If selected, monetary amounts are substituted with the term moneytoken. This option is not selected by default.

When finished with selections, click Done to return to the Pre-Processing dialog.

Remove terms occurring in less than __% of documents

If selected, Text Miner removes terms that appear in less than the percentage of documents specified. For most text mining applications, rarely occurring terms do not typically offer any added information or meaning to the document in relation to the collection. The default percentage is 2%.

Remove terms occurring in more than __% of documents

If selected, Text Miner removes terms that appear in more than the percentage of documents specified. For many text mining applications, the goal is identifying terms that have discriminative power or terms that will differentiate between a number of documents. The default percentage is 98%.

Maximum term length

If selected, Text Miner removes terms that contain a set number of characters. This option can be extremely useful for removing some parts of text that are not actual English words (i.e., URLs or computer-generated tokens), or to exclude very rare terms such as Latin species or disease names (i.e., Pneumonoultramicroscopicsilicovolcanoconiosis).

When finished with selections, click Finish, then click the Representation tab to open the Representation dialog.

Representation

Term-Document Matrix Scheme

A term-document matrix is a matrix that displays the frequency-based information of terms occurring in a document or collection of documents. Each column is assigned a term and each row a document. If a term appears in a document, a weight is placed in the corresponding column indicating the term's importance or contribution. Analytic Solver offers four different commonly used methods of weighting scheme used to represent each value in the matrix: Presence/Absence, Term Frequency, TF-IDF (the default), and Scaled term frequency.

If Presence/Absence is selected, Analytic Solver enters a 1 in the corresponding row/column if the term appears in the document, or a 0 if it does not. This matrix scheme does not take into account the number of times the term occurs in each document.

If Term Frequency is selected, Analytic Solver counts the number of times the term appears in the document and enters this value into the corresponding row/column in the matrix.

The default setting - Term Frequency-Inverse Document Frequency (TF-IDF) is the product of scaled term frequency and inverse document frequency. Inverse document frequency is calculated by taking the logarithm of the total number of documents divided by the number of documents that contain the term. A high value for TF-IDF indicates that a term that does not occur frequently in the collection of documents taken as a whole, appears quite frequently in the specified document. A TF-IDF value close to 0 indicates that the term appears frequently in the collection or rarely in a specific document.

If Scaled term frequency is selected, Analytic Solver normalizes (bring to the same scale) the number of occurrences of a term in the documents (see the table below).

To create a scheme, click the Advanced button to open the Term Document Matrix - Advanced dialog. Use this dialog to select choices for local weighting, global weighting, and normalization. See the table below for definitions regarding options for Term Frequency, Document Frequency, and Normalization.

Term Document Matrix - Advanced Options

Perform latent semantic indexing

When this option is selected, Analytic Solver uses Latent Semantic Indexing (LSI) to detect patterns in the associations between terms and concepts to discover the meaning of the document.

The statistics produced and displayed in the Term-Document Matrix contain basic information on the frequency of terms appearing in the document collection. With this information, we can rank the significance or importance of these terms relative to the collection and particular document. LSI uses singular value decomposition (SVD) to map the terms and documents into a common space to find patterns and relationships. For example, when inspecting the document collection, we might find that each time the term alternator appeared in an automobile document, the document also included the terms battery and headlights. Or each time the term brake appeared in an automobile document, the terms pads and squeaky also appeared. However, there is no detectable pattern regarding the use of the terms alternator and brake. Documents including alternator might not include brake, and documents including brake might not include alternator. The four terms, battery, headlights, pads, and squeaky describe two different automobile repair issues: failing brakes and a bad alternator. LSI will attempt to: 1) distinguish between these two different topics; 2) identify the documents that deal with faulty brakes, alternator problems or both; and 3) map the terms into a common semantic space using singular value decomposition. SVD is a tool used by Text Miner to extract concepts that explain the main dimensions of meaning of the documents in the collection. The results of LSA are usually hard to examine because they cannot fully explain how the concept representation was constructed. 

Concept Extraction - Latent Semantic Indexing

In the Representation dialog, underContract Extraction - Latent Semantic Indexing, select Automatic, Maximum number of concepts, or Minimum percentage explained. The following describes these selections.

If Automatic is selected, Analytic Solver calculates the importance of each concept, take the difference between each, and report any concepts above the largest difference. For example, if three concepts were identified (Concept1, Concept2, and Concept3) and given importance factors of 10, 8, and 2, respectively, Analytic Solver would keep Concept1 and Concept2, since the difference between Concept2 and Concept 3 (8-2=6) is larger than the difference between Concept1 and Concept2 (10-8=2). If Maximum number of concepts is selected, Analytic Solver will identify the top number of concepts according to the value entered here. The default is 2 concepts.If Minimum percentage explained is selected, Analytic Solver will identify the concepts with singular values that, when taken together, sum to the minimum percentage explained (90% is the default).

If Maximum number of concepts is selected, Analytic Solver retains the top significant concepts according to the value entered.

When finished making selections, click Finish, then select the Output Options tab. The following options are available in the Output Options dialog.

Output Options

Term-Document Matrix

Under Matrices, the Term-Document matrix displays the most frequently occurring terms across the top of the matrix and the document IDS down the left. If a term appears in a document, a weight is placed in the corresponding column indicating the importance of the term using our selection of TF-IDF on the Representation dialog. The number of terms contained in the matrix is controlled by the Maximum vocabulary size option on the Pre-Processing tab. The number of documents is equal to the number of documents in the sample. Analytic Solver offers four different commonly used methods for ranking the number of times a term appears in a document on the Pre-Processing tab: Presence/Absence, Term Frequency, TF-IDF (default), and Scaled term frequency. 

Concept-Document Matrix

The Concept-Document Matrix is enabled when Perform latent semantic indexing is selected on the Representation tab. The most important concepts are listed across the top of the matrix and the documents are listed down the left side of the matrix. The number of concepts is controlled by the setting for Concept Extraction-Latent Semantic indexing on the Representation tab: Automatic, Maximum number of concepts, or Minimum percentage explained. If a concept appears in a document, the singular value decomposition weight is placed in the corresponding column indicating the importance of the concept in the document. If Perform latent semantic indexing is selected, this option will also be selected by default.

Term-Concept Matrix

The Term-Concept matrix is enabled when Perform latent semantic indexing is selected on the Representation tab. The most important concepts are listed across the top of the matrix, and the most frequently occurring terms are listed down the left. The number of most important concepts is controlled by the setting for Concept Extraction-Latent Semantic indexing option on the Representationn tab: Automatic, Maximum number of concepts, or Minimum percentage explained. The number of terms in the matrix is controlled by the Maximum vocabulary size on the Pre-Processing tab. If a term appears in a concept, the singular value decomposition weight is placed in the corresponding column indicating the importance of the term in the concept.

Term frequency table

The Term frequency table displays the most frequently occurring terms in the document collection according to the value entered under Vocabulary for Most frequent terms. The first column of the table, Collection Frequency, displays the number of times the term appears in the collection. The second column of the table, Document Frequency, displays the number of documents that include the term. The third column in the table, Top Documents, displays the top documents where the corresponding term appears the most frequently according to the Maximum corresponding documents setting (see below). This option is selected by default.

Most frequent terms

This option is enabled only when Term frequency table is selected. This option controls the number of terms displayed in the Term frequency table and Zipf's plot. This option is selected by default with a value of ten terms.

Full vocabulary

This option is enabled only when Term frequency table is selected. If selected, the full vocabulary list is displayed in the term frequency table.

Maximum corresponding documents

This option is enabled only when Most frequent terms is selected. This option controls the number of documents displayed in the third column of the Term frequency table.

Zipf's plot

The Zipf Plot graphs the document frequency against the term ranks (or terms ranked in order of importance). Typically the number of terms in a document follow Zipf's law, which states that the frequency of terms used in a free-form text drops exponentially. In other words (pun intended), when we speak, we tend to use a few words a lot, but most words very rarely. Hover over each point in the plot to see the most frequently occurring terms in the document collection.

Show documents summary

If selected, Analytic Solver produces a Documents table displaying the document ID, length of the document, and number of terms included in each document.

Keep a short except. Number of characters

If selected, Analytic Solver produces a fourth column in the Documents table displaying the first N number of characters in the document. This option is not selected by default, but if selected, the default number of characters is 20.

Scree Plot

This plot gives a graphical representation of the contribution or importance of each concept according to the setting for Maximum number of concepts. Find the largest drop or elbow in the plot to discover the leading topics in the document collection. When moving from left to right on the x-axis, the importance of each concept will diminish. This information may be used to limit the number of concepts (as variables) used as inputs into a classification model. This option is not selected by default.

Maximum number of concepts

If Scree Plot is enabled, Maximum number of concepts is enabled. Enter the number of concepts to be displayed in the Scree Plot.

Terms scatter plot

This graph is a visual representation of the Concept-Term Matrix. It displays all terms from the final vocabulary in terms of two concepts. Similarly to Concept-Document scatter plot, Concept-Term scatter plot visualizes the distribution of vocabulary terms in the semantic space of meaning extracted with LSA. The coordinates are also normalized, so the range of axes is always [-1,1], where extreme values (close to +/-1) highlight the importance or load of each term to a particular concept. The terms appearing in a zero-neighborhood of concept range do not contribute much to a concept definition. In our example, we identify a concept having a set of terms that can be divided into two groups: one related to Autos, and other to Electronics. These groups would be distant from each other on the axis corresponding to this concept, which would provide evidence that this particular conceptcaught some pattern in the text collection that is capable of discriminating a topic of article. Therefore, Term-Concept scatter plot is an extremely valuable tool for examining the main topics in the collection of documents, finding similar words that indicate similar concept, or the terms explaining the concept from opposite sides (i.e., term1 can be related to cheap affordable electronics, and term2 can be related to expensive luxury electronics).

Document scatter plot

This graph is a visual representation of the Document-Concept Matrix. This graph is a visual representation of the Concept-Document matrix. Note that Analytic Solver normalizes each document representation so it lies on a unit hypersphere. Documents that appear in the middle of the plot, with concept coordinates near 0 are not explained well by either shown concept. The further the magnitude of coordinate from zero, the more effect particular concept has for the corresponding document. Two documents placed at extremes of a concept (one close to -1 and other to +1) indicates strong differentiation between these documents in terms of the extracted concept. This provides a means for understanding the actual meaning of the concept, and investigates which concepts have the largest discriminative power when used to represent the documents from text collection.

Concept importance

This table displays the total number of concepts extracted, the Singular Value for each, the Cumulative Singular Value and the % of Singular Value explained which is used when Minimum percentage explained is selected for Concept Extraction - Latent Semantic Indexing on the Representation tab. This option is not selected by default.

Term importance

This table displays each term along with its importance as calculated by singular value decomposition. This option is not selected by default.

Write text mining model

Select this option under to write the baseline or base corpus model to a worksheet. The base corpus model can be used to process new documents based on the existing text mining model.