What YOU Need to Know About Big Data and Artificial Intelligence

Now that I have a section on my blog for AI and machine learning, I wanted to repost one of my first essays about it to serve as a general introduction, so people can understand some of the terms I use.

For those who are not practitioners of AI, Data Science, Big Data, and all the other names the new technology goes by, it’s useful to get some understanding that enables good decision making. Vendors are all too happy to leave you with the impression that what’s behind the curtain will be able to predict the future better than the magic and arcane rituals of soothsayers of old, but this time it will work because – science! If you ask to see what’s behind the curtain, they will oblige you with a dumpster full of polysyllabic minutia. In what follows I try to find a middle path.

There are three foundational methods used in artificial intelligence and exploitation of big data. They are neural networks, ontological reasoning, and statistical investigation. All three techniques are being applied to the same big data sets, and all three compete in the same market spaces, with different advantages for different types of big data exploitation. Combining these techniques can bear particularly fruitful results. The three approaches are characterized below.

Neural network reasoning or machine learning is like newborns learning about the world and requires feedback to indicate “correct” and “incorrect”. This technique is powerful and quick but can be subject to the human flaw of spurious correlations since the “reasoning” is generally opaque, and the problem set is inevitably filled with noise. A neural net trained to find dogs in all pictures might key in on the presence of fur and floppy ears and miss dogs with pointy ears (false negative) or mischaracterize instances of rabbits as dogs (false positive). Therefore the training sets must be very carefully chosen, and the results legitimized with human judgement. The ability to quickly digest and characterize absurdly large data sets makes neural networks too valuable to not be used, and false positives can usually be quickly spotted in the training phase, allowing the software to be retrained. Failure to find correlating data even though it exists, presents a more severe problem, but given that human observers could not possibly process all the mountains of data in an actionable amount of time, it is a risk that seems reasonable to take. Also the user can set the threshold for evaluating data as positives depending on one’s tolerance for false negatives and correspondingly one’s ability to sort through false positives.

Neural networks can also be designed to use “unsupervised learning.” In this case the human or algorithm does not declare a choice right or wrong, rather the software is given a series of “vectors” to use to sort the data it ingests into clusters to be evaluated (and usually named) by the users.

The biggest challenge with neural networks aside from finding an experienced practitioner is selecting a set of training data that is large enough, diverse enough, similar enough to the targeted data, and properly labeled as well as a separate test set that is also properly labeled to run the trained neural net on to see how accurate the neural network results prove to be. Another challenge is that labeling a large data set is labor intensive.

Ontological reasoning is a way of categorizing information in a more portable and universal way than a typical relational database (RDBMS) or Extensible Markup Language (XML) schema. Ontological reasoning is like algebra using logic and proofs based on given axioms or like deductive reasoning ala Sherlock Holmes. If Cyberpuss is my cat, she is genetically related to all other cats, and is neither a gorilla nor a bird and, given her relationship to me, is probably a house cat rather than a feral cat. Ontologies specify the hierarchy of the items in question and their relations to each other so that all that needs to be exchanged is that Cyberpuss is my cat. In an example, a neural network can quickly identify objects in a video and characterize them as ships and even identify hull numbers. More elaborate number-crunching can note that a found ship has the hull number 575. Ontological reasoning can then connect disparate pieces of information to show that a 1-3 digit hull number indicates a navy ship rather than a freighter or tanker, that the number itself might identify one of several candidates from different navies, and that the dimensions of the ship would indicate that it is definitely the Chinese guided missile frigate Yueyang, enabling the system to show all the data collected from other sources about that ship.

Building useable ontologies is one of the “dark arts” of AI and requires a comprehensive knowledge of the domain to be modeled as well as the general skills of a philosopher to distinguish between the essence of something and the traits that are not part of the essence but attach to the thing in question. Training in general knowledge management or library science seems to be good for the latter. Ontologies include taxonomies but differ from taxonomies in important ways. Taxonomies alone can lead the user astray as the biologists found much to their despair with the platypus and as anyone who has tried to keep a complex file or email hierarchy organized can attest. Some things seem to fit in more than one category and ontologies can help diminish ambiguity by adding relationships that aren’t necessarily straight inheritance.

Statistical investigation is finding trends (and sometimes even more usefully exceptions) that are invisible to normal human perception such as which drug has the best track record in treating liver cancer, or what is different about the three patients who went into remission using the drug that is least successful normally, or that there’s unusual activity around a ship’s berth in the harbor, or among a set of airplane passengers? Of course the first rule of statistics is that correlation does not mean causation, so statistics are best used as an investigative tool to find what data to look at.

One frequent use of statistical reasoning is for market segmentation (sometimes called clustering) that identifies groups to market to as in the dreaded “demographics” of the television business. A set of easily identifiable characteristics or sometimes a single identifier such as age is used to segment the group, and then statistical reasoning is used to find the commonalities among the group that shares the same identifier. Segmentation leads advertisers to pitch cars to a certain age group and scooters to a different age group after identifying what kind of entertainment attracts a certain age segment. Sophisticated market segmentation relies on a set of identifiers rather than a single one, but risks cutting the segments into groups too small to be useful.

As the name implies executing statistical investigation requires users skilled in the advantages and limits of various statistics. The calculation is easy these days, but the choice of which technique to use and the skill to interpret the results require a very talented and experienced statistician.

All three techniques are valuable individually and especially when used in conjunction. Statistical investigation can be used to find the “vectors” that are fed into unsupervised or, in some cases, even supervised learning systems. Neural networks of either supervised or unsupervised types can find and correlate items from the giant cloud of data, and then ontological reasoning can classify and connect the found items to other data about them and push the data out as actionable intelligence.

One response to “What YOU Need to Know About Big Data and Artificial Intelligence”

  1. Great intro Frank- thanks!

    Like

Leave a reply to jeff kish Cancel reply