Tuesday, March 30, 2010

Datamining is the MVP of the future

Data Mining is the process of extracting patterns from data (Go, Go, Wikipedia!) and I firmly believe that being a Data Miner is the Most Valuable Profession of the future.

I bring this up because I watched a Ted presentation from 2006 last night by Hans Rosling on the topic of Social Health Statistics. This Ted talk is the best presentation I've seen in forever. I'm still puzzling over whatever gorgeous program he used for the graphics that did such an excellent job conveying information instantaneously with very little explanation. The two points of the talk that are absolutely required are at 3:50 and 11:30. Rosling stands in front of the graph while it is animated and gestures at the data points explaining why they are moving in the direction they are moving. Later in the video, he states that his interpretation of the data is made possible because the has been aggregated and formatted in a way that it is easy for humans to grok.

What is marvelous about this talk is that, in 2006, Rosling speaks of Data Mining without referencing it by name. He ends his talk by saying we need a 'garden' of interfaces to the vast amount of data we've been hoarding like misers against a time when we will actually be able to use and understand it. Our recent history is boiled down to data points, numbers and keywords, and stored in banks within organizations who, as Rosling says, have an attitude like the Head of UN Statistics he mentions. These organizations say 'we can't do it', at least as of this talk in 2006, but the option is there for others to try.

Data Miners are already digging fast and well, trying their hand at these banks of information, and it is only getting more prevalent as 2010 flows past. Data Mining, the type that shoots for human readability, combines art and statistics while providing relevance and suggesting relationships. This budding profession requires someone who has a notion of presenting information in understandable ways as well as someone who can wrangle the analysis required. Most of the data representation ends up in a graphic, if not a graph, and if it moves or is interactive, so more the better. Data Mining leads to Chart Porn. (Totally Safe-For-Work despite the name!)

Sites like Data Mining, Strange Maps, Information is Beautiful, and Weather Sealed, present data in visual form, extracting meaning (or at least interesting relationships) out of meaningless heaps we've saved just in case we might need them. Additionally, ever more specific datasets can provide very specific information about just how broad or narrow certain trends can be. For WoW geeks like me, there is Armory Data Mining, which mines through the huge database called the Wow Armory that Blizzard provides for public access.

To rephrase the earlier definition, Data Mining (and subsequent Chart Porn ^^) is the process by which we take large quantities of data and poke at it until it makes a picture.

Science Fiction has been pointing at this idea and jumping up and down about it for years. The books/series that I recall off the top of my head that mention the idea of Data Mining and the representation of data for easy consumption are Otherland by Tad Williams, and Gun With Occasional Music by Jonathan Lethem. In each of these books, they contain the seed of an idea in which data (news, in this case) is aggregated in a visual or aural fashion, through what I assume is a process very like Data Mining, and presented to a character.

Otherland is especially interesting because Williams uses the same metaphor that Rosling does in his Ted presentation, a garden. This garden has roses and weeds, and each plant and flower is weighted by so that color, health and other variables to correspond to frequency, reliability and whether or not the observer considers it a positive or negative trait.

Gun With Occasional Music takes a different route, as a science fiction dystopia that extrapolates from a point where visual media never really took off, and a morning symphony provides the daily news. A low, ominous tremolo suggests violence and percussion suggests murder, which sparks the main character to pick up an investigation. Both of these representations utilize Data Mining for information presentation, except they pull from a constantly changing dataset.

With these representations comes that the idea that any representation of sufficient complexity could be utilized, especially to assist in keeping the differently-abled up to date using sound, visuals, numbers, and patterns. Humans are wild awesome at finding patterns and connections even where none exist.

Designing weighted patterns to put value and meaning in context, Data Miners provide one of the most valuable tools that our future has. With the volume of information at our fingertips getting more and more overwhelming, another layer of abstraction is absolutely necessary to be able to see the big picture. The 'big picture' requires us to zoom out, and the bigger the picture the further out we must go.

The Wikipedia article on Data Mining does not mention this sort of 'social' application and sticks mostly to business and science. However, Google proves that the social applications are already being explored. Google's algorithms operate on the basic assumption that you can mine almost any dataset for relationships and outside of their intense search algorithms, their Google Reader has a feature where you ask it to find interesting things based on what you've already said you liked. Similarly, most music databases can stream radio for you based on a 'seed' song or artist because it has gathered data from others who like the same things that you do. Data Mining as a concept has been around since the 80s and is being applied everywhere in small ways now that the internet has provided access to data to work with.

But what is it GOOD for? How is this useful other than in a 'information is fascinating' and beyond 'big picture'?

One limited example is that Data Mining can be coupled with sufficiently complex Artificial Intelligence to look for flagged relationships within datasets. Credit card companies and other large corporations that deal with fraud in billing look for things like 'Purchase in Houston', with an immediate 'Purchase in Anchorage', and then again a 'Purchase in Houston'. The AI needs to be able to discern whether or not that is suspicious behavior (maybe they're local business selling through the internet?) and if someone's card needs to be frozen. My mother once got a call querying her if she'd used her credit card for gas. She had the previous week, for the very first time, and it was atypical enough that the credit card company gave her a courtesy call to make sure her card hadn't been stolen. Even ten years ago it was not usually people flagging these instances. AI Data Mining of this sort is worth hundreds of thousands of dollars each year as it cuts down fraud.

Another example of what Data Mining is useful for comes about through Facebook, Myspace, and other high-volume social networks which have huge marketable databases. However, how ethical is it for the company that owns this database full of such an enormous volume of personal information to sell it to marketing firms or other places who might want to peek at all that delicious data? Not only does Data Mining present opportunities for marketing, but it also brings up huge glaring questions on how this information can be used while still protecting the individuals who provided the information.

I'm of the opinion that some data is meant to be mined, as if someone brought a dump truck of ore to your foundry and said, "find me something useful!" As a caveat, however, I think that the data should have been collected specifically to be mined, like the US Census and the daily news. With consent being the biggest factor, I believe that personal information should be protected as a type of media.

Professor Lev Manovich of the University of California, San Diego, suggested in 2001 (Nearly a decade ago!) that Databases are a Symbolic Form. In other words, he posits that databases, because of their unique structure without a beginning or an end, should be considered a new form of media. In this, I use the word database rather than dataset, primarily because database includes some element of structure and meaning to the collection of data and can include objects instead of datapoints, while dataset is a more mathematical term where the data within the set must have meaning applied.

Traversing a database, then, according to Manovich, is a non-linear way of navigating our shared experience. Where a story has a narrative and a photograph encourages movement of the eye to each point of interest, a database uses a webbed approach to convey information driven more by the user than the artist who created it. As anyone who has lost hours on TvTropes, Youtube, or Wikipedia could tell you, this type of presentation - a media full of other media, a meta-media - can be both edifying or a total waste of time. Manovich ends his essay by stating that there is nothing inherent within a database that 'fosters a narrative', but there are hundreds of thousands of narratives - a linear path compatible with human interpretation - lurking within any one database. These narratives, which he also refers to as interfaces, are the 'garden' of Rosling's Ted talk. Manovich talking about how the seeds from which the flowers grow are a type of essentially different new media.

With the idea that databases are a new media comes the idea that they can be treated ethically as media. In this vein, databases and datasets should be subject to a new form of copyright law incorporating basic rights of privacy wherein the individuals providing their information have the right to withhold it from the repository. If by simply living we are constantly creating and generating information for the consumption of others, then we should be able to hold the copyright on those creations like we do other types of created media.

With these concerns in mind, the internet and its ever-growing network of fascinating information - mostly unintentional and full of meaningful implied spaces - now gives us the capability of taking a stab at something very much like Asimov's Psychohistory to predict what the general trend of events. We're approaching the point where everyone's personal data on the web, tracking their online lives and the gaps representing their offline lives, could feasibly be used to predict the future the same way we predict the weather. Google searches can already anticipate outbreaks of the flu and it is only a matter of time before judicious Data Mining turns up other trends and behavior we can identify as important.

We desperately need this type of data acquisition and make-sensey-ness in every field under the sun. The social sciences, government, business, medicine, education, computer science, news, the internet, and I need data mining. And, really, how can you say no to the internet?

This will be a baseline skill required everywhere once people figure out that this IS a skill, an art, and a profession. It is a field that requires new, developing structures and a way to compartmentalize so that people who do not know HOW this works still know they can hire a Data Miner to perform crucial functions. Even with an official title and trying to wedge themselves into other professions while requiring similar skillsets, Data Miners will be hired to the AI/Fraud department of credit card companies, to high-profile research projects that need a code monkeys(for science!), to marketing companies hoping to pinpoint their demographics by prodding twitter, and to news outlets to produce pretty graphics for multimedia news presentation.

We need this. It is a skill and a profession that should be taught as such. Just as there are programmers, researchers, artists, and statisticians there should be data miners who incorporate elements of each of these other professions and combines them into a field the future will find essential.

1 comment:

Andrew Jennings said...

I remember reading a book as a child about a group of scientists who set up the greatest data mining program ever. They then fed the information into a supercomputer hoping to make the computer self aware. It was pretty funny.

I think google has some really interesting data mining algorithms. They do some really brilliant data tracking/site crawling stuff for their webstats service.