Note: The following functionality is currently not supported in the Analytic Solver Data Science Cloud app.
The following example illustrates how to import approximately 1,000 text files saved in a single file folder. Browse to C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets and open the Text Mining Example Documents.zip archive file. Unzip the contents of this file to C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets\Text Mining Example Documents\ (or a location of your choice). Four folders are created beneath Text Mining Example Documents: Autos, Electronics, Additional Autos and Additional Electronics; and around 1,200 short text files will be extracted to the location chosen. This example is based on a text data set that consists of 20,000 messages, collected from 20 different netnews newsgroups. Of the posted messages, two interest groups were selected, one with postings related to Autos, and one related to Electronics (about 50% in each).
Open a new Excel worksheet. On the Analytic Solver Data Science ribbon, select Get Data - File Folder to open the Import From File System dialog. At the top of the dialog, click Browse... and navigate to the Autos subfolder (C:\Program Files\Frontline Systems\Analytic Solver Platform\Datasets\Text Mining Example Documents\Autos). Set the File Type to All Files (*.*), then select all files in the folder and click Open. The files are displayed under the Files list. Click the >> button to move the files from the Files list to the Selected Files list. Repeat these steps for the Electronics subfolder. When these steps are completed, 985 files are displayed under Selected Files.
Select Sample from selected files to enable the Sampling Options. Analytic Solver Data Science will perform sampling from the files in the Selected Files field. Enter 300 for Desired sample size while leaving the default settings for Simple random sampling and Set Seed.
Note: If you are using the educational version of Analytic Solver Data Science, enter "100" for Desired Sample Size. This is the upper limit for the number of files supported when sampling from a file system when using Analytic Solver Data Science. For a complete list of the capabilities of Analytic Solver Data Science and Analytic Solver Data Science for Education, click here.
Analytic Solver Data Science will select 300 files using Simple random sampling with a seed value of 12345. Under Output, leave the default setting of Write file paths. Rather than writing out the file contents into the report, Analytic Solver Data Science will include the file paths.
Note: Currently, Analytic Solver Data Science only supports the import of delimited text files. A delimited text file is one in which data values are separated by a character such as quotation marks, commas or tabs. These characters define a beginning and end of a string of text.
Click OK. The FileSampling worksheet, containing the sampling results, will be inserted into the current workbook.
The Inputs portion of the report displays the selections we made on the Import From File System dialog. Here we see the path of the directories, the number of files written, our choice to write the paths or contents (File Paths), the sampling method, the desired sample size and the seed value (12345).
Underneath the Text Data portion are paths to the 300 text files in random order that were sampled by Analytic Solver Data Science. If Write file contents had been selected, rather than Write file paths, the report would contain the RowID, File Path, and the first 32,767 characters present in the document.
From here, one could use Excel’s sort features to categorize the paths by “Autos” and “Electronics” for use with the Text Mining tool. See subsequent Text Mining chapter for an example on how to use this feature.