This initial version of the
code supposed that al target/basewords files (those given in the
"basewords and useful stimuli" folder) are in the C:/ directory as in:
C:/targetwords.txt etc...
The semalUI should only be
loaded into the lisp environment if using Lispworks environment as it
relies on the CAPI library. This is only used for loading and querying
already analysed textes (i.e. texts which are already loaded into a
hastable).
For non UI interaction:
All the interactions that follow assume that all the defined functions were loaded into the current working memory of the Lisp session ran. In order to do this, it is sufficient to load the `semal.lisp' file with the load function.
The different text analysing functions can appear quite overwhelming for the first time user, but each of them has a specific use. A user with a little knowledge of the Lisp programming language can easily create a new function with the exact functionality required. The simple text analysing function is the analyse function. It analyses a single stream and populates the *word-count* and *target* global hash tables which are used for the similarity calculation. This function also saves all the target table in a file ready to be loaded at a future date. The only minor difficulty that this function presents is the fact that the user must specify a stream as an input and not a file. A simpler version which accepts a file as an input, or even either a file or a stream, can be easily written using this function:
(defun easy-analyse
(file &optional (winsize 10) (basewords (get-basewords))
(targetwords (get-targetwords))
(pathname "foo.txt")
(gutenberg nil))
(with-open-file (in file :direction :input)
(analyse
in winsize basewords targetwords pathname gutenberg)))
The with-open-file command, makes a stream from a file and binds the stream to the name specified (in this example `in') in the body. The simple analysis of a file called `c:/ws.txt', using the default parameters and the analyse function for example is achieved by this interaction:
CL-USER 1 > (with-open-file (in "c:/ws.txt" :direction :input)
(analyse in))
;;; Compiling file foo.txt ...
;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0
;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3
;;; Source level debugging is on
;;; Source file recording is on
;;; Cross referencing is on
; (TOP-LEVEL-FORM 1)
; (TOP-LEVEL-FORM 2)
; (TOP-LEVEL-FORM 3)
;; ** Automatic Clean Down
#P"C:/Documents and Settings/Avri/foo.fsl"
NIL
NIL
The output shows that the target table has been saved in the file C:/Documents and Settings/Avri/foo.fsl, and can be loaded from there. If one does not wish to use the default parameters for the semantic space construction, the different components can be specified to the analyse function:
CL-USER 5 > (with-open-file (in "c:/ws.txt" :direction :input)
(analyse in
8
'("black" "army" "death")
'("king" "queen" "fool")
"c:/testing.txt"
t))
;;; Compiling file c:/testing.txt ...
;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0
;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3
;;; Source level debugging is on
;;; Source file recording is on
;;; Cross referencing is on
; (TOP-LEVEL-FORM 1)
; (TOP-LEVEL-FORM 2)
; (TOP-LEVEL-FORM 3)
#P"c:/testing.fsl"
NIL
NIL
The first optional input is
the window size which defaults to 10. The next two inputs are
respectively the list of base words (also called context words) and the
list of target words (i.e. words to be analysed). It is not expected
that the user will directly write these list of words each time they
are needed as, for example, the default list of context words contains
536 elements. Instead, the user should have a file containing all the
words for one list, and use the read-file function to get the list, as
in the following example where the output was omitted:
CL-USER 6 > (with-open-file (in "c:/ws.txt" :direction :input)
(analyse in
8
(read-file "c:/basewords.txt")
(read-file "c:/targetwords.txt")
"c:/testing.txt"
t))
The final two inputs to this function are the file where to store the *target* table and the Gutenberg boolean. It is important to note that the target-words table is not saved under the exact file name as entered, but under the .fsl file with the same name, at the same location. In the above example, the table will be saved in \c:/testing.fsl". This anomaly will be removed from the program in the next version. Finally, the Gutenberg boolean specifies if the file analysed has been taken from the Gutenberg Project. This in turn specifies if the header that is present in all Gutenberg Project files must be stripped.
It is possible to use one of
the other functions in order to analyse texts. If these functions do
not save all the global variables, or the target table (depending on
what is needed), then the user can do this manually using the functions
save-target-table and save-all-variables. When saving all the
variables, the target table can be normalised upon loading the file,
whereas when saving only the target table one must make sure it is
already normalised (as information required to normalise it at a later
date is not saved). Both these functions only require the pathname of
where to save the variables to in order to work, as the hash table is a
global variable:
CL-USER 7 > (save-all-variables "c:/testing.txt")
;;; Compiling file c:/testing.txt ...
;;; Safety = 3, Speed = 1, Space = 1, Float = 1, Interruptible = 0
;;; Compilation speed = 1, Debug = 2, Fixnum safety = 3
;;; Source level debugging is on
;;; Source file recording is on
;;; Cross referencing is on
; (TOP-LEVEL-FORM 1)
; (TOP-LEVEL-FORM 2)
; (TOP-LEVEL-FORM 3)
; (TOP-LEVEL-FORM 4)
; (TOP-LEVEL-FORM 5)
; (TOP-LEVEL-FORM 6)
#P"c:/testing.fsl"
NIL
NIL
Detailed explanation about how
the other analysing functions work is not given in this document. The
code and example programs (such as the analyse-local-BNC function) give
general pointers as to how they are used. Once the *target* table is
populated, or saved into a file, one can start observing the results of
the analysis.
If the *target* table has not been populated during the current session, the user must first load such a table from a file, before observing similarity between words. This operation is done using the load command in Lisp (or alternatively the load-target-table function which is a simple renaming of load) as such:
CL-USER 8 > (load "c:/testing.fsl")
; Loading fasl file c:\testing.fsl
#P"c:/testing.fsl"
If the *target* table has been saved with all the global variables, and needs to be normalised, this can also be done. The default normalisation function provided is the log odds-ratio. The function finalize-target-table takes care of this.
Once the *target* table has been normalised, similarity values for pairs of words can be calculated using the similarity function as shown in the following examples:
CL-USER 9 > (similarity "black" "white")
0.7966215831073316
CL-USER 10 > (similarity "black" "dog")
0.35816217445260423
CL-USER 11 > (similarity "good" "yellow")
NIL
CL-USER 12 > (similarity "man" "man")
1.0000000000000004
These examples are also a good illustration of what happens in special cases. When one of the inputs to the function is not in the *target* table, the function returns nil (as in interaction number 11). Finally, the precision of the calculations are shown by interaction 12. The similarity between a word and itself should be 1, therefore there is an error rate of about 0.0000000000000004 in that interaction. As a rule of thumb, it is useful to take into account only the first few digits when comparing similarity values. For example we can say that the similarity between \black" and \white" is about 0.800 and the similarity between \black" and \dog" is 0.358. In order to replicate the different experiments, other functions to query the *target* table in di®erent ways have been written. Most of these use the similarity function seen above, or optimise it for specific uses (such as not calculating the vector form of each word more than once...).