Descriptive & Predictive Modeling
Data Science has seen tremendous growth and support across industries in recent years. Bringing together methods from many fields, Data Science incorporates a plethora of confusing terms with fuzzy definitions. A glimpse on the history of data and clear term differentiation will help kickstart your venture into this fascinating field.
Essentially, Data Science is simply the collection of methods using data to understand the world. In its basis it is as old as humankind. We observe, deduct, and test our hypotheses. One famous example is Erastothenes measuring the earth’s circumference around 200 BC by observing and using geometry - with a descriptive model. Descriptive models use historic data to learn the relationship between one or several inputs and one outcome. Let's look at a use case:
Joe is a supply chain manager at a screw manufacturer and wants to minimize the production rejects. Joe is an expert in his field and has observed over the years that factors such as temperature, input metals, machine speed or machine grease used influence the production quality. These independent variables are a few of potentially dozens. Even the most experienced person cannot assess their effects on each other as well as on the desired outcome without the right tool box. When Joe thinks about the problem, he realises, that it can be defined in several ways. He could define the output as a continuous variable, for instance as a reject rate. He could also define the output variable as a categorical variable with 2 classes "sellable" and "rejects". The definition of the problem is important, because it defines the method. While classification methods are suited for discrete scales with clear bounds, regression methods are suited for continuous scales.
The application of descriptive models has its advantages. It's easy to understand and to interpret. However, it is limited to situations that are covered by its input data. Also, the relationship among the various input variables as well as the distribution should be very well understood at the time of method selection.
Since in Joe's case, the criteria for sellable and rejects are very clearly defined, he decides to go with a logistic regression and categorises his dependent variable into the two named categories. According to his model one very influential factor is temperature. As a result a climate control has been installed to keep the temperatures within an optimal range for the production process.
Half of the machines were due for replacement last week. Joe recorded in his data, which machine produced which results, but he has not yet gathered enough data on the new machines. The predictive power of his existing model is not given for predicting the outcome through the new production lines. What now?
Advancements in storage capacity and processing power have brought the development of new algorithms under the hood of machine learning. While descriptive models may have limited predictive power, machine learning addresses their shortcomings. Machine Learning models are more flexible in the way they "learn" relationships. They learn in a more abstract way and can therefore apply their learnings to combinations of input variables that have not been part of the input. But perhaps the most significant distinction between descriptive and predictive modeling is the entity in focus. While descriptive modeling aims to give an overall picture that is relevant for future decision making, predictive modeling focuses on a single input and tries to predict its outcome depending on its properties.
Joe has been applying descriptive modeling for a while. He has been able to take control over some of the influencing factors. But when it comes to predicting a specific outcome of a batch, the predictive power of the model depends very heavily on the question, whether the given circumstances have been part of the training. He decides to venture into machine learning. He could stick to logistic regression, but instead he decides to try a random forest, an algorithm that is particularly suited for noisy data and produces easily interpretable results. He trains his model on a large amount of data that he has gathered over the past year. For today's batch he is predicting the percentage of sellable screws, given today's circumstances.
Data collected to teach an algorithm variable relationships is called training data. You could say, that data is somewhat the equivalent to machines, what experience is to humans. Sometimes the relationships discovered between variables are not easily interpretable, because they are so abstract.
At the end of the day, Joe compares his prediction to the actual outcome and makes a few observations. Today's circumstances were somewhat special, since 2 out of 4 production lines have been updated with new machines a few weeks ago. While a descriptive model would likely be failing in this scenario, a predictive model could handle this situation, provided enough training data has been produced and injected in the right proportion. As long as enough combinations of attribute values have been included in the training, a machine learning model would still be able to derive the relationship between variables, even if it has not seen this exact combination of values before. Concluding that 86% of the production will be happening with the new machinery, Joe decides to weight the training data to ensure that it shows the exact same proportions of old machinery and new one. The new machinery has not been tested with today's input metals, however, Joe is confident, that he will still get suitable findings.
We’re on the verge of yet another great technical advancement. Deep Learning is considered a key part of true artificial intelligence. Neural networks have been around since the 1980s. An increase of technical processing capabilities as well as cost effective storage solutions and cloud computing have brought their final breakthrough in the last decade. Machines are about to gain the ability to learn, judge their learning's accuracy, decide what information is missing in order to improve, search for that information, and train themselves to correct previous mistakes. Even Machine creativity is something that is being explored heavily, while all these tasks are zero-touch, which means without any intervention by humans.
After years of trial and error, Joe has become a machine learning enthusiast. Last week, a company installed his new, fully automated industrial control unit. This morning he’s trying to understand the decisions the unit is taking. Specifically, why this week’s output is expected to be less than last week’s while the costs are projected to rise slightly. In the logs he can see that the machine speed has been set up to reduce by mid day. In the logs he can see that the unit has pulled the weather forecast. It’s going to be very hot this week and therefore the use of special-purpose oil causes a slight rise in cost, while the reduction in machine speed limits the output.
The more complicated the logic, the harder it is to interpret the results. Model interpretability and the ability to explain individually decided cases are fields that don’t yet receive a lot of attention, but will become very important in the future.
Artificial intelligence touches many fields of engineering, maths, and data science. But the underlying principles remain: The input data needs to be clean, well understood and free of unintended biases. Good Data Scientists know how to check for biases and irregularities relying on an expert’s domain knowledge. Excellent Data Scientists ask the right questions to assess client needs and avoid potentially extremely expensive failures. Considering the entire process from the data collection, cleaning, and storage in combination with pipeline design, reporting, and access control from day one will enable your cost effective entry and ensure your flexibility a couple of years down the road.
Our recommendations:
Start simple!
Add complexity when the data and problems are fully understood.
Don't fall into the lock-in trap of large vendors. Invest in people, open source technology and therefore flexibility instead.