In this tutorial we shall use real data to test the perceptron algorithm. In order to do this, we need to be able to read files from disk in a Haskell program.
Try the following two evaluations in ghci
"Hello World!"
putStr "Hello World!"
What is the difference between the two? Why is there a difference?
What are the types of the two expressions above? Do you know? Try it out and see if it matches what you think.
:type "Hello World!"
:type putStr "Hello World!"
The IO ()
type is an example of a monad,
a concept which will take some time to get used to.
For the time being, we will only be concerned with the IO monad
and how to use it to control I/O. We will learn more about
monads later.
IO
is a type constructor, so it wraps another type.
In the case above, we had IO ()
, with ()
as the inner type. This is the singleton type; i.e. the type
()
has only one possible value, namly ()
.
What use can we have of singleton type?
The IO
can be viewed as an action. Thus the type
stores an action which can be subject to calculations and used
to construct other actions.
When the program runs, the action will eventually be performed.
Output actions, such as the one returned by putStr
,
will typically have type IO ()
. They are interesting
because of the output they generate, not because of the data
contained.
An input function, in contrast, could have type (say) IO String
where the type wraps the data (string) read from input.
A program, typically, is a sequence of actions.
Such sequences can be constructed in several ways.
The easiest way to get started is to use the syntactic sugar
of the do
notation.
That will do for now. We will dig deeper next week.
We are going to use two IO functions, putStr
and
getLine
.
putStr
and
getLine
have? (Usie :type
in GHCI.)Main
for this exercise,
and add the following
definition:
hello :: IO ()
hello = do
n <- getLine
putStr ( "Hello, " ++ n ++ "\n" )
You will see in the next step why it has to be called Main
.
Main
module in GHCI and evaluate hello
.
When nothing happens and you don't get a prompt, it is waiting
for your input.
The interpreter (ghci
) is great to test individual functions,
but at the end of the project you will probably want to produce a
stand-alone program. This requires a compiler, namely ghc
.
A standalone program is a module called Main
with a function
main :: IO a
for some type a
.
main = hello
ghc Main.hs
ls
Which new files have been created?
./Main
It is possible to get GHC to make programs with names other than Main, but let's cross that bridge when we need it.
We want to test our machine learning algorithm on real data. University of California, Irvine hosts the machine learning repository which provides a large collection of real data for testing. We will use some breast cancer data from Wisconsin.
Comma separated values (CSV) is a common format to store data. Each row is a record, and each item of the record is separated by commas. We need to figure out how to read such files in Haskell.
In the previous step we download a file with comma-separated values (CSV), which we want to use with our perceptron. Hence we need to write the necessary functions to load and parse such a file. We start with loading, and return to parsing in the next step.
Make sure you have the data file wdbc.data in your current directory, and test the following in GHCi.
readFile "wdbc.data"
What do you get?
To parse the CSV file, we will use a library which is not installed by default. Hackage is a rich database of libraries for Haskell, and you are likely to consult it frequently for new libraries, which are easily installed with the cabal tool.
If you google for «haskell csv», you will find a number of hits. When I did it, the top three were different libraries in hackage. In this tutorial I will use the simplest libraries. It may be rather crude, but it will get the job done quickly. Feel free to take a more mature approach if you are up to it.
Text.CSV
module.
This gives the API documentation for this module.
Which types and functions can you use?
(Don't spend too much time on this if you don't see the answer.
We walk you through it later.)
cabal install csv
As you see in the API documentation, the CSV library has several
functions to parse CSV data. Since we have already learnt how to
read the file into a String, we will use the function
parseCSVTest
which parses a String.
ghci
.import Text.CSV
.let s = "1,2,3\n4,5,6"
.parseCSVTest
function takes one argument,
namely the CSV formatted string.
Try this
parseCSVTest s
.
Look at the output. What data type is returned?
parseCSVTest
?
You can check the documentation or use GHCi with the following command.
:type parseCSVTest
Comments?
The parseCSVTest
is a test function which prints the
data on the terminal. It does not actually return the data.
To be able to use the data for further computation, we will use
parseCSV
.
What is the return type of parseCSV
?
There are two `kinds' of objects of this type. What do you get from the following in GHCi?
:type Left 'a'
:type Right 2
So the return type of parseCSV
is either a `Left' which means it is a ParseError,
or `Right' which means it is a valid CSV object.
In real software you have to take care of ParseError
to do error handling. However, for now, we will be
rather crude, and try to get on with it.
We can use the following function to unpack the Either type:
stripError (Left _) = error "Parser error!"
stripError (Right csv) = csv
Test the function in GHCi.
stripError (Left "foobar")
stripError (Right 3.14)
The first argument to parseCSV
is the name of a
log file. We won't use that either for now,
so let's just write a very simple wrapper for parseCSV
:
parseCSVsimple :: String -> CSV
parseCSVsimple s = stripError (parseCSV "/dev/null" s)
Here, /dev/null
is a special file discarding all data
written thereto.
Create a new module to handle CSV data for neural networks,
adding the definitions above.
Maybe ANNData
is a suitable name.
Test parseCSVsimple
in the same way as you tested
parseCSVTest
.
We have learnt to read a file into a string, and to parse a string for CSV data. Let's put the two operations together, to parse the real data set. We will make a function with the following data type:
getRawData' :: String -> IO [[String]]
.
The input argument is the filename, used as an argument to
readFile
.
The output is a list of list, where each constituent list
is one row from the CSV file, and each string in the inner list
is one value from the comma separated line.
ANNData
from Step 5
by adding the following definitions.
getRawData'
.
You will need to use the
readFile
and parseCSVsimple
functions.
getRawData'
on the Wisconsin
Breast Cancer Data file.
You can use the structure from the Main program in Problem 1.
When you have the return value from parseCSVTest, of type
[[String]]
, you can use the return
function on it, to get a return value of type
IO [[String]]
.
Note
There is a slightly simpler way to do this.
You can make a wrapper similar to parseCSVsimple
,
using parseCSVFromFile
instead of parseCSV
.
Try it out for yourself if you have time.
It is possible that the data from parseCSVsimple
includes an empty row, [""]
.
dropEmpty
which takes a list of lists,
as returned by getRawData'
, and drops any list containing
just the empty string, and keeping all others.
Remember to have type declaration as well as function definition.
getRawData :: String -> IO [[String]]
getRawData = do
d <- getRawData'
return (dropEmpty d)
The dot in the definition denotes function composition.
So far we have read and parsed the data set to obtain a list of lists of strings. However, the data are numerical, so String is not an appropriate data type. We need to clean it up, and parse the strings containing numbers into a numeric data type.
Each row in the CSV file includes several values which would form the input vector to a perceptron, plus a class which determines the the correct output.
Cleaning up the data is a multi-step process, which we consider in the next problem.
The data set (CSV) file consists of rows. Each row consists of an ID, a class label, and a feature vector. The feature vector is in turn made up of individual features.
The raw data that you have read is [[String]], so each row is a list of strings, where one string is class label, some strings may be ignored (the ID), and the rest is the feature vector.
We want to reformat the data set so that it has type [(Double,[Double])]. Thus each row is a pair, where the first element is the class label (Double) and the other is the feature vector ([Double]). Thus, we need the function
processDataSet :: [[String]] -> [(Double,[Double])]
It is easiest to work bottom up. So we will do processDataSet last, and start with the class label and individual features.
The class label is a string "M" or "B", while it
should be numeric, typically -1 or +1.
Let's map "M" to +1 and "B" to -1.
We need a function numericLabel
to do
the conversion
numericLabel
numericLabel
numericLabel "M"
numericLabel "B"
numericLabel "q"
numericLabel "Bonnie"
For the time being, it is ok if the last two tests cause an error. In a production system we would have to handle such errors appropriately. Our time, in contrast, is better spent on exploring the learning algorithm, than handling input which we do not want to handle.
The features are strings representing numeric data.
We have to parse it to get floating point data.
We need a function numericFeatures
to do
the conversion.
read
function to do the conversion.
Open ghci
and get familiar with it.
Try the following:
read "6.12"
What happens?
read "6" :: Integer
read "6" :: Double
read "6.12" :: Double
numericFeatures
.
numericFeatures
,
using map
and read
.
Note that the type declaration of
numericFeatures
makes sure that the right version of
read
is used.
numericFeatures ["6.12","8.11","0","2"]
numericFeatures ["B","6.12","8.11","0","2"]
For the time being, it is ok if the last test causes an error. As before, a production system would require adequate error handling.
Using the helper functions from Steps 1-2, we are ready to
write a function processItem
taking a row
([String]) from the parsed CSV data
and return a pair with class label and feature vector for the perceptron.
processItem
.
processItem
,
using the helper functions from Steps 1-2.
Test the function, e.g.
processItem ["9898","M","6.12","8.11","0","2"]
Now we need a function formatData
taking [[String]]
as input and applying processItem
on each row.
The output should be a list of class label/feature vector pairs.
This is an obvious case for map
.
formatData
.formatData
.
Test the function on data from the getRawData
function.
Write the function getData
which takes a file name as
input, reads the file, parses CSV data, and formats it properly using
formatData
.
Note that you have written all the functionality already.
You can use function composition (see Problem 2/Step 6) to
combine previous functions and define getData
.
Test the getData
function in GHCi.
As you see in the API documentation, the CSV library has several functions to parse CSV data. The one we used is very simple and provides no error handling.
Revise the functions above to use parseCSV
,
and handle error values properly.