A guest contribution by Christian Dugast.
Machine Learning is about to reshape digital services like never before: Digital services are stepping-out of the binary, rigid era. The era of adaptive, contextual, self-learning services has begun.
The business potential behind this shift is huge. Not only can new services be defined and provided that were too expensive to process with traditional tools. But also the quality of already existing automated services will dramatically increase. The equation seems marvellous: Machine Learning (ML) = lower cost + increased quality.
However, as always with a paradigm shift, its ubiquity is not immediate. Transforming technology value into business value always follows a learning curve. So the question is what is the path to follow and what are the main ingredients you need to have in your cupboard to be successful when going for a Machine Learning project.
Machine Learning is about … Learning
First it is important to understand the basis of a Machine Learning system. It is about learning. But what can be learned by a machine? Quite a lot on one hand, but not that much on the other. Quite a lot, as anything that can be recorded through a digital sensor can be managed by a Machine Learning system. Not that much as the system needs a tremendous amount of repetition of similar examples from which it will learn. Learning is about abstracting and generalising from those examples, as the goal is to be able to make a decision on new unseen data.
Let’s take the example of X-Ray or CT analysis. A human being needs some 25 to 30 years to be trained and finally be able to work as a radiologist.
[25 years to learn being a radiologist is a long time, sure. But within these 25 to 30 training years, the radiologist also trained a speech recognition system (he/she learned to listen and understand as a child) as well as a dialogue system (exchange information with other human beings) and probably also a translation system. If this radiologist likes to cook, he/she also trained a cooking system. And if this radiologist developed complex hobbies like Renaissance art or rock climbing he/she trained a multitude of other “systems”, too.]
Essentially, the radiologist has developed means to collect and structure knowledge: he/she has learned to abstract each picture seen so that he/she can produce a diagnosis in the form of “this part of the body has this anomaly”.
The Machine Learning system (ML) trying to be a radiologist (and they begin to be very good) will try to identify the relation between a picture and a diagnosis. The difficulty of ML is to try to generalise (deduce) from unseen pictures the correct interpretation of that new picture.
How many pictures does the radiologist need to analyse and learn from? Can a radiologist be sure to make a good cancer diagnosis after having seen two examples of tumours? If this would be the case, nearly everybody could be a radiologist. Thus it takes more than just a few examples to learn from. The more pictures (let’s call it data) the radiologist has seen, the more experienced he/she is, the more precise the diagnosis will be. To illustrate how important experience is (or how important the data volume is), a radiology department in Germany can claim being specialised for example in mammography only if it makes at least 5.000 examinations per year.
Ingredient 1: Data
Without data, an ML system cannot learn anything. And it needs very large data sets so that it can generalise (learn) from this data. Is any data good? What type of data is needed? What is the amount of data needed? Let’s take the example of the self-driving car. Such a system is made of two components that need to communicate with each other: An ML part that learns to find its way in a moving jungle of objects and, secondly, a rule part which contains the traffic rules.
The ML part receives data in form of examples (e.g. examples of cars (red, green, white, van, compact, SUV …), of movement (driving in the opposite direction, in the same direction, crossing lanes, slower, faster …). And in order to be able to generalise from this data, it needs hundreds of thousands of examples of cars seen from different angles, in different luminosity context, at different speed … of human beings on the road, off the road … of electric poles … of road signs …
The traffic rule part is made of rules defining in a deterministic way what to do. Here you need only one example per rule. The rule itself tells everything: At red light stop the car. There is nothing to generalise from this, the decision is binary.
Machine Learning learns from examples, not from rules. Machine Learning needs what is called exemplifying data. In your Machine Learning project, separate your data into two types: data that exemplifies/illustrates and data that defines (based on rules). The type of data available and its amount will tell you how to build your solution. Perhaps most of your problem can be solved by a more traditional rule based system and you do not need to learn from examples.
But data alone is not enough. The ML system has to be told what it needs to learn. To make it simple, it cannot decide for itself what is good and what is bad. It needs a teacher, a supervisor.
[For sake of completeness we should mention unsupervised methods that try to structure the data without making use of any annotation or external knowledge. These are typically clustering methods that, when the data is appropriate, can produce good results. However, these methods are not the ones we are talking about here, those that have given rise to the paradigm shift. Still, this brings up a good question: Does the data always need be annotated/curated by a human being? Google has built as huge “brain” (16,000 processors with one billion connections) which after viewing 10 million videos on YouTube (not annotated) has defined a specific cluster for cats, an animal quite often found on YouTube videos. Other researchers are working on unsupervised learning models for machine translation, so how to translate a sentence in one language into a sentence in another language. These unsupervised learning systems just have texts in different languages texts that are not necessarily correlated with each other. We could call such a system a universal deciphering machine. Is this possible? Probably!]
In the case of the radiologist, we need images COMBINED with the diagnosis e.g. “this part of the body has this anomaly”. We call this kind of data „annotated data“ (annotated with a semantic label for example).
Domain Specific Data or any Data?
Examples of successful Machine Learning are often related to image processing, speech recognition, or machine translation, services for which the ML research community had and has plenty data to learn from. In the long years dealing with data and learning from data, the ML community came out with two rules that seem to be contradictory at first sight:
- There is no better data than more data.
- There is no better data than domain data.
To understand the subtle difference between these 2 rules, one needs to know what domain data is. In the field of image processing an example of a domain would be „pictures of animals“. Another domain in image processing would be „pictures of cars“. In speech recognition, for example, a domain would be medical vocabulary, and another domain would be legal vocabulary.
Thus, the question is, do we take data only from the specific domain we want our system to learn from? Or does data from other domains help?
The second rule states in essence to be careful while selecting data: not all data is good, the one specific to your domain (e.g. medical domain vs. legal domain) is better. And the first rule is telling you to take all data you can find, so also text from a radiologist even if you are building a speech recognition system for lawyers.
Both rules are correct but one needs to go a little bit deeper into what data is. For example, adding gigabytes of legal data to a speech recognition system that deals with medical content will not help much. But if you do not have access to much medical data, some legal data added to your medical corpus may provide your system with some necessary basic structuring knowledge.
Artificial Data or Real World Data?
Understanding how artificial data (e.g. recorded in a lab) differs from real data (recorded in real life, e.g. on the street) is very important. For example, noise, luminosity, hesitations, speed, repetitions, user heterogeneity have a tremendous implication on how the system will learn. A good approach for example would be to bootstrap a project with artificial data in order to be able to build an initial prototype. Then use this prototype in the real world to record the real world data needed to learn from and build the system. This brings us to the second ingredient for a successful ML project:
Ingredient 2: Data Scientist
To secure the success of your project, bring on board someone who can work with tremendous amounts of data, who knows what data is: A data scientist.
Being able to work with data is more than managing the learning software and the huge data volume. It is about being able to digest, analyse and synthetize this data, to understand its essence, to have excellent skills in statistics, data models, data distributions and their intrinsic properties, to be able to draw conclusions from experiments and to decide on next steps.
So you have the data, you know what type of data is needed, and you know the kind of person needed to crunch your data. However, not all data is business relevant. And a purely technical view of the data, hence of the solution you want to build can be very dangerous to your business. Aspects like usability, conceptual similarities, smartness, need coverage, understandability and ubiquity cannot be sacrificed on the data altar. Differences that are technically important to take care of but bring confusion at business level have to be discarded or abstracted. So the next question is, who takes care of defining what needs to be abstracted and learned by your ML engine? Which gives us the 3rd ingredient:
Ingredient 3: Business Comprehension of Your Data
Your Machine Learning project also needs a business sensible person: your ML project wants to learn from the data that makes sense to your business. Not from all your data. If your data scientist understands your business, he/she can do the mapping alone. Best practice is to have a data scientist and a business person (e.g. product marketing) working alongside one another to assure the system learns what is business relevant.
Ingredient 4: Time
As one can see, working with data, understanding data, extracting the business value of data is at the centre of a Machine Learning project. Data preparation (data collection, feature extraction, data filtering, data verification, data validation and data analysis) is crucial. It is a long iterative process (e.g. after analysis, one may need to rethink feature extraction or filtering) that will take time.
When you are ready with data preparation, you are not done with being patient. The learning process of your ML software may take a few weeks of processing on a cluster of Graphical Processors before the learning has reached an asymptote and can stop. And it takes weeks for one run on one set of ML parameters. For each change on your parameter setting, you will need to wait weeks to see the result. Sure, a parameter setting that is catastrophic can be stopped very quickly. But you will know only at the end of a learning cycle if a promising idea is better indeed. So time is an important factor here.
Before going into selecting a good IT infrastructure for your MP project, be sure to have these 4 crucial ingredients available. They are must have ingredients. Missing one of them will not only put your ML project in danger; the project just won’t succeed. Having them will help you to start in good shape with a fair chance of obtaining great results.
Dr. Christian Dugast is a seasoned marketer, business developer and software developer in Machine Learning, with strong accents on Automatic Speech Recognition, Information Extraction and Machine Translation. Since 2006, the year he founded Tech2biz, he helps Technology start-ups define and implement their market strategy. Based on his large network and in depth knowledge of the Language Technology market, Christian also advises solution providers of various verticals to find the best fit between their offering and the required Language Technology. In some cases, he will dive technically into a project and crunch data for which he will write be-spoke Java-code or push Excel to its limits. Christian lives with and from data.
Christian received his Ph.D. degree in Computer Science from the University of Toulouse (France) in 1987. Having started his career as a research scientist in Automatic Speech Recognition (ASR), he was the first to propose an hybrid approach for ASR combining Neural Networks and Hidden Markov Models (IEEE paper 1994). Christian had the honour to present the first commercially available continuous speech dictation product worldwide (Eurospeech ’93 in Berlin).
In 1995, he left research to enter the business world. Building up the European subsidiary of Nuance Communications, he gained extensive experience in introducing complex technologies and became recognized as one of the main drivers of the European ASR market. He then went up the value chain, always concentrating on what is needed to accelerate the penetration of new technologies and their related solutions.
Christian, a regular speaker at European conferences, believes that business success needs more than a good technology. Or, to put it in his words: „It takes more than technology to make things happen.“