Tuesday, September 8, 2015

Update: Machine Learning Projects for .NET Developers

I've been working my way through Machine Learning Project for .NET Developers by Mathias Brandewinder (Amazon link). Apparently, I'm going pretty slowly since I got the book back in July (and was very excited about what I found). I've made it through Chapter 3 now.

Yes, I've been going slowly. But a lot of the information is really sinking in -- both about machine learning and about the F# language.

Very Approachable
One thing I really like about this book is that it is *very* approachable. When I looked at the first project on recognizing hand-written digits, I was expecting to get lost pretty quickly. And that's why I didn't undertake the challenge earlier (I made up my own challenge instead).

But Mathias made this task seem almost trivial. When he talks about the first step of doing the simplest thing possible (comparing light/dark pixel values from the data with the known values of the training set), it seemed blindingly obvious. Why didn't I think of that?

By seeing such an easy first step, I knew that the task was doable -- even by me.

And this was a hurdle that I needed to get over.

It doesn't stop there, though. That technique gets us positive results, but far from perfect. He then goes on to describe other machine learning algorithms that are a bit more complex (and a little bit outside of my competence with mathematics). And he also talks about the importance of comparing our "tweaked" algorithm to our baseline. We need to make sure we make things better and not worse.

Interesting Problems
Another thing that has kept my attention is the set of problems to be solved. I've already talked about how digit recognition had caught my eye, but it didn't stop there.

In Chapter 3, Mathias goes on to show how to build a simple DSL to query StackOverflow data. Does this sound familiar? If not, just go back to see how Barry Stahl (blog, twitter) blew my mind in a presentation on building a DSL to get StackOverflow data 2 years ago.

Getting into the F# Flow
The examples have also helped me "get into the flow" of F#. This has to do with really embracing the "pipe forward" operator |> and setting up parameters for functions so that they can easily be chained together.

Back to the StackOverflow DSL, I'm surprised at how easy this was to build. F# type providers are really awesome.

Here's a quick run through of the DSL code (which is available on GitHub). The first few lines set up the type provider:


Then there are a few functions to build the DSL:


The way that we interact with StackOverflow is by building a URL with a query string. So each of these small functions (tagged, page, pageSize) are designed to add to the query string. But these make things much easier to work with.

The last function (extractQuestions) executes the query.

And with this in place, we can start using it:


The first couple lines create aliases between "C#" and "F#" and their URL-encoded counterparts.

Then we see the awesomeness of having these small methods that we can pipe together. So what "fsSample" ultimately gets us are the StackOverflow questions that are tagged with "F#" and then gets the first 100 results. (The StackOverflow API is paged, and it returns the first page if no value is specified.)

What I've really grown to like is the syntax that we see in "analyzeTags". This summarizes the results. And it does it in very small pieces. First by getting the tags from the questions, then the number of times each tag appears, then limiting the results to tags that appear more than once, then sorting them in reverse order based on the count, and then printing out the results.

Here's the output of "analyzeTags fsSample":


Working with data this way really makes me like F# more and more.

Hey Look What I Can Do!
The best thing about Machine Learning Projects for .NET Developers is that it lets me do things that I never thought I could do. I had made these tasks out to be much more complicated in my mind.

But here's something I did today (by following the example in the book):


This is a map showing the population density per square km in each country. The data comes from the World Bank, with F# type providers providing easy access to the data, then using Deedle to create data frames that are easily consumable by R and the "rworldmap" package.

It sounds complicated, but there are only about 40 lines in the script file to do this. Really cool stuff.

More To Come
I'm less than halfway through Machine Learning Projects for .NET Developers, so there is much more to come. I think this is the catalyst that I need to travel much deeper into the functional world and the F# language.

Happy Coding!

No comments:

Post a Comment