Last Modified 3 October 2008

Statistical Semantic Analysis

These are instructions for using the code Avri Bilovich wrote for his undergraduate dissertation with Joanna Bryson at the University of Bath.  Some of his results have since been published:
In addition to the instructions below, you will also want


This initial version of the code supposed that al target/basewords files (those given in the "basewords and useful stimuli" folder) are in the C:/ directory as in: C:/targetwords.txt etc...
 

The semalUI should only be loaded into the lisp environment if using Lispworks environment as it relies on the CAPI library. This is only used for loading and querying already analysed textes (i.e. texts which are already loaded into a hastable).  

For non UI interaction:

All the interactions that follow assume that all the defined functions were loaded into the current working memory of the Lisp session ran. In order to do this, it is sufficient to load the `semal.lisp' file with the load function.

A.1 Analysing texts and saving the results

The different text analysing functions can appear quite overwhelming for the first time user, but each of them has a specific use. A user with a little knowledge of the Lisp programming language can easily create a new function with the exact functionality required. The simple text analysing function is the analyse function. It analyses a single stream and populates the *word-count* and *target* global hash tables which are used for the similarity calculation. This function also saves all the target table in a file ready to be loaded at a future date. The only minor difficulty that this function presents is the fact that the user must specify a stream as an input and not a file. A simpler version which accepts a file as an input, or even either a file or a stream, can be easily written using this function:

(defun easy-analyse

      (file &optional (winsize 10) (basewords (get-basewords))

      (targetwords (get-targetwords))

      (pathname "foo.txt")

      (gutenberg nil))

      (with-open-file (in file :direction :input)

      (analyse in winsize basewords targetwords pathname gutenberg))) 

The with-open-file command, makes a stream from a file and binds the stream to the name specified (in this example `in') in the body. The simple analysis of a file called `c:/ws.txt', using the default parameters and the analyse function for example is achieved by this interaction:

CL-USER 1 > (with-open-file (in "c:/ws.txt" :direction :input)

(analyse in))

;;; Compiling file foo.txt ...

;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0

;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3

;;; Source level debugging is on

;;; Source file recording is on

;;; Cross referencing is on

; (TOP-LEVEL-FORM 1)

; (TOP-LEVEL-FORM 2)

; (TOP-LEVEL-FORM 3)

;; ** Automatic Clean Down

#P"C:/Documents and Settings/Avri/foo.fsl"

NIL

NIL

The output shows that the target table has been saved in the file C:/Documents and Settings/Avri/foo.fsl, and can be loaded from there. If one does not wish to use the default parameters for the semantic space construction, the different components can be specified to the analyse function:

CL-USER 5 > (with-open-file (in "c:/ws.txt" :direction :input)

(analyse in

8

'("black" "army" "death")

'("king" "queen" "fool")

"c:/testing.txt"

t))

;;; Compiling file c:/testing.txt ...

;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0

;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3

;;; Source level debugging is on

;;; Source file recording is on

;;; Cross referencing is on

; (TOP-LEVEL-FORM 1)

; (TOP-LEVEL-FORM 2)

; (TOP-LEVEL-FORM 3)

#P"c:/testing.fsl"

NIL

NIL 

The first optional input is the window size which defaults to 10. The next two inputs are respectively the list of base words (also called context words) and the list of target words (i.e. words to be analysed). It is not expected that the user will directly write these list of words each time they are needed as, for example, the default list of context words contains 536 elements. Instead, the user should have a file containing all the words for one list, and use the read-file function to get the list, as in the following example where the output was omitted: 

CL-USER 6 > (with-open-file (in "c:/ws.txt" :direction :input)

(analyse in

8

(read-file "c:/basewords.txt")

(read-file "c:/targetwords.txt")

"c:/testing.txt"

t)) 

The final two inputs to this function are the file where to store the *target* table and the Gutenberg boolean. It is important to note that the target-words table is not saved under the exact file name as entered, but under the .fsl file with the same name, at the same location. In the above example, the table will be saved in \c:/testing.fsl". This anomaly will be removed from the program in the next version. Finally, the Gutenberg boolean specifies if the file analysed has been taken from the Gutenberg Project. This in turn specifies if the header that is present in all Gutenberg Project files must be stripped.

It is possible to use one of the other functions in order to analyse texts. If these functions do not save all the global variables, or the target table (depending on what is needed), then the user can do this manually using the functions save-target-table and save-all-variables. When saving all the variables, the target table can be normalised upon loading the file, whereas when saving only the target table one must make sure it is already normalised (as information required to normalise it at a later date is not saved). Both these functions only require the pathname of where to save the variables to in order to work, as the hash table is a global variable: 

CL-USER 7 > (save-all-variables "c:/testing.txt")

;;; Compiling file c:/testing.txt ...

;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0

;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3

;;; Source level debugging is on

;;; Source file recording is on

;;; Cross referencing is on

; (TOP-LEVEL-FORM 1)

; (TOP-LEVEL-FORM 2)

; (TOP-LEVEL-FORM 3)

; (TOP-LEVEL-FORM 4)

; (TOP-LEVEL-FORM 5)

; (TOP-LEVEL-FORM 6)

#P"c:/testing.fsl"

NIL

NIL

Detailed explanation about how the other analysing functions work is not given in this document. The code and example programs (such as the analyse-local-BNC function) give general pointers as to how they are used. Once the *target* table is populated, or saved into a file, one can start observing the results of the analysis. 

A.2 Querying the *target* table: observing the analysis results

If the *target* table has not been populated during the current session, the user must first load such a table from a file, before observing similarity between words. This operation is done using the load command in Lisp (or alternatively the load-target-table function which is a simple renaming of load) as such:

CL-USER 8 > (load "c:/testing.fsl")

; Loading fasl file c:\testing.fsl

#P"c:/testing.fsl"

If the *target* table has been saved with all the global variables, and needs to be normalised, this can also be done. The default normalisation function provided is the log odds-ratio. The function finalize-target-table takes care of this.

Once the *target* table has been normalised, similarity values for pairs of words can be calculated using the similarity function as shown in the following examples:

CL-USER 9 > (similarity "black" "white")

0.7966215831073316

CL-USER 10 > (similarity "black" "dog")

0.35816217445260423

CL-USER 11 > (similarity "good" "yellow")

NIL

CL-USER 12 > (similarity "man" "man")

1.0000000000000004

These examples are also a good illustration of what happens in special cases. When one of the inputs to the function is not in the *target* table, the function returns nil (as in interaction number 11). Finally, the precision of the calculations are shown by interaction 12. The similarity between a word and itself should be 1, therefore there is an error rate of about 0.0000000000000004 in that interaction. As a rule of thumb, it is useful to take into account only the first few digits when comparing similarity values. For example we can say that the similarity between \black" and \white" is about 0.800 and the similarity between \black" and \dog" is 0.358. In order to replicate the different experiments, other functions to query the *target* table in di®erent ways have been written. Most of these use the similarity function seen above, or optimise it for specific uses (such as not calculating the vector form of each word more than once...).


page author: Avri Bilovich
Back to AmonI Software