Data Mining

Quality in Manufacturing Data

Best Practices Approach To The Manufacturing Industry

Data Mining: Quality in Manufacturing Data was written by Ken Collier KPMG Consulting and Gerhard Held SAS Consultants Curt Marjaniemi Don Sautter KPMG Consulting and Mohan Namboodiri SAS Technical Reviewers Knowledge Management Solutions Group KPMG Consulting and Development Group Business Solutions Division, Knowledge SAS

October 2000

Table of Contents
List of Exhibits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii 1. Manufacturing in a Rapidly Changing Market. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. The Role of Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3. Enterprise Quality and Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4. Case Study 1: Printing Process Out of Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 5. Case Study 2: Failures in Hard Disk Drives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 6. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 7. References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 8. Companion Document. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 9. Recommended Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Credits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

List of Exhibits
Figure 1. Quality Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Figure 2. Three-level Quality Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Figure 3. P-chart for the Proportion of Banding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 4. Pareto Diagram of Press Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 5. Data Mining Flow for Band Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Figure 6. Data Replacement Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 7. Variable Selection Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 8. Lift Chart for Banding Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 9. Tree for Banding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Figure 10. Data Mining Flow for Hard Drive Failure Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Figure 11. Lift Curves for the Best Neural Net, Decision Tree, and Regression Model . . . . . . . . . . . . . . . . 16

Table 1. Potential Causes for Bands in the Printing Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1. Manufacturing in a Rapidly Changing Market
The manufacturing industry has become more and more complex and has grown to include a variety of sub-sectors. Some of the sub-sectors in the manufacturing industry are the following: ??? ??? ??? ??? ??? ??? exploitation of mineral resources (coal, oil, gas) high technology equipment such as computers consumer-oriented mass production such as food processing highly specialized capital goods production such as turbines process-oriented production such as chemicals discrete consumables such as cars.

Although these sub-sectors are very diverse internally, they all have in common production, research, and development facilities, which often can be large. Most if not all of them also face very similar production issues such as new product development, quality management, capacity planning and tracking, preventive maintenance of production facilities, health, safety, and environmental protection, inventory optimization, and supply chain management. The marketplace for manufacturing companies has changed drastically over the past 10 to 15 years. Production has become more complex. Most companies face increased competition both at home and abroad. Mergers and acquisitions create a series of opportunities and threats through economies of scale, integration, and globalization of operations. For example, in the engineering industry, the value of all cross-border business leaped from US$43 billion in 1997 to $78 billion in 1998, while many companies merged: Daimler/Chrysler, Ford/Volvo (both automotive), Honeywell/Allied Signal (automation systems/power generation), Veba/Viag, and Alsthom/ABB (power generation) to name just a few.1 In addition, manufacturing efficiency, quality control, and faster time-to-market influence competitive advantage. Manufacturing industries have reacted with increased investment in information technology to streamline production processes and to assemble data about their customers. These investments include spending on company staff as well as software. In some companies, such as ABB (mechanical and electronic engineering) and Cummins Engine (diesel engines), up to one-third of their development engineers are software engineers. Enterprise resource planning (ERP) systems from companies such as SAP and Baan have been installed as a standard to run the everyday business; however, such systems provide little help in adjusting to changes in customer demand. Often going hand in hand with the drive for economies of scale is the move to restructure companies into smaller and more efficient sub-units with a large service component capable of reacting more quickly to ever-increasing customer expectations. Information delivery systems have been or are being introduced to track customer loyalty and market trends as early as possible.2 Meanwhile the e-commerce revolution has already reached manufacturing. For example, volume car manufacturers such as General Motors and Ford are planning to coordinate their relationships with suppliers through online systems.3 On the production side, processes have been automated to the finest detail, and production is surveyed by measurement systems that collect a huge amount of data. Usually these data are only collected to signal if something has gone out of control so that operators are informed immediately to stop production and identify the potential root cause of the failure.


Data Mining 1

This best practices paper discusses quality-related aspects of the enterprise and explains some of the ways in which information technology can help solve quality problems in manufacturing data. These solutions are set in the context of developing quality efforts over time. The quality issue is discussed as one requiring management attention and an enterprisewide solution approach. This paper focuses on the contribution that modern analytical techniques such as data mining can make to this approach and is substantiated with two case studies, one of a comparatively simple printing process and another from a more complex hard disk drive production process.

2. The Role of Quality
The interest in quality as a business topic was inspired by the success of Japanese production techniques in the 1960s and 1970s and in later years in other East Asian countries. Notable contributors such as W. Edwards Deming, Kaoru Ishikawa, and Joe Juran helped along the Quality Movement.4 Inspection of crucial quality characteristics of manufactured goods became a widespread practice. Given mass production and the lack of automated measurement systems, inspections were initially done through acceptance sampling, that is, the inspection of a random sample from which conclusions were drawn about the underlying production lot or batch. With the introduction of automatic measuring devices, this early phase of quality measurement led to continuous quality control online (during the production process). Widespread use of statistical process control (SPC) systems became standard, and control charts could be found everywhere on factory floors. When errors in production exceeded control limits, the root causes of failures were identified. The next step was to identify factors for problems in production through experiments designed offline. Often, production had to be stopped to run these experiments, so great care was taken to minimize the number of experimental settings (or runs) to restrict the cost of out-time. The heavy emphasis on statistical quality control is often referred to as the first generation quality initiative. The first generation quality initiative is still in current practice but only as a baseline in the manufacturing industry. As Lori Silverman has noted, ???The basic tools of quality are no longer sufficient to achieve the performance levels that todays organizations are seeking to maintain market leadership and competitive advantage.???5 Measuring quality continuously is a requirement, but the drawback is that SPC/experimental design only concentrates on individual processes. In modern production settings, manufacturing consists of numerous interrelated steps. For example, semiconductor manufacturing involves treating wafers of silicon in more than 100 steps. Typically, 100,000 of these are produced per day, which means a gigantic amount of process data is produced by manufacturing execution systems. Other sources of data cover other aspects of quality. SPC systems calculate quality-related metadata6 such as statistics of subgroup samples or capability indices from process data. Laboratory information management systems (LIMS) include research and testing data, while ERP systems might add data about material resource planning or non-production-related data. Modern quality implementation is therefore no longer restricted to controlling individual processes but has moved into a second generation, which considers the quality management of the whole enterprise. In this second generation, quality has become a top management issue. From an IT perspective, the quality initiative is now required to build up quality data warehouses,7 which cover the whole production process including other quality-related data in an analysis-ready form. A quality data warehouse links supplier data, process data, data from other manufacturing plants, and human resource data to address questions such


Metadata are descriptive data or other information about data entities, such as field names and types, and are typically stored in data dictionaries and data warehouses.


2 Data Mining

as comparing quality across products, manufacturing lines, or plants, linking warranty problems to internal process data, or predicting product quality before the product reaches the customer (Figure 1).

Figure 1. Quality Data Warehouse

Quality solutions that take into consideration the entire enterprise need to enable decision makers at various levels of the organization to make effective decisions that impact the quality of a process or product. For example, the introduction of an enterprise quality system at Gerber Products, the baby food company, has enabled floor operators to know exactly when to adjust the process and, just as importantly, when to leave the process alone.8 At the same time, plant managers can track quality performance through a process flow reporting feature and identify immediately a root cause failure in the production upstream, while corporate management will receive standardized reports about key quality metrics on a regular basis.9


3. Enterprise Quality and Data Mining
Data warehouses populated with historical quality data serve to address questions of a more predictive nature, such as when a particular machine component is likely to break, and what combination of causes tend to lead to a malfunction in the production process. Questions of this nature require analytical modelling and/or data mining, which is a third generation of quality initiatives. Data mining is defined as the process of selecting, exploring, and modelling potentially large amounts of data to uncover previously unknown patterns for business advantage.10 In contrast, more traditional decision support techniques like online analytical processing (OLAP) usually provide descriptive answers to complex queries and assume some explicit knowledge about the factors causing the quality problem.


SAS, ???From Data Mining to Business Advantage: Data Mining, The SEMMA Methodology and SAS Software,??? 1998 Data Mining 3

Analytical modelling can range from descriptive modelling using statistical analysis or OLAP to predictive modelling using advanced regression techniques and data mining methods. While data mining can generate high returns, it requires a substantial investment. Effective data mining requires well-defined objectives, high quality data in a form ready to be mined, and generally some amount of data pre-processing and manipulation. This technology is not a fully automated process. Data mining assumes a combination of knowledge about the business/production processes and the advanced analytical skills required to ask the right questions and interpret the validity of the answers. Typically data mining is done as a team effort to assemble the necessary skills. A feedback loop to deploy data mining results into the production system ensures that a return on investment can be realized together with some clues on how to repeat this exercise for the next problem to be addressed. Thus, a three-level quality strategy can be employed in which each level serves as a precursor to the next, and each new level generates increased knowledge about the production process and additional return of investment (Figure 2).

Figure 2. Three-Level Quality Strategy

Fortunately, manufacturing data lends itself well to advanced analytics and data mining. There is an abundance of data that are usually of high quality because their acquisition is automated. What is required is to establish a habit of storing historic data for mining analysis. In the first generation of quality management, the quality control approach, data are typically only used for online SPC and then discarded or else archived but never analyzed. In the second generation of quality management, the enterprise quality solution approach, data are also generated about research, suppliers, customers, and complaints. Such data are vital if the production data are to be enriched and exploited intelligently.

4 Data Mining

Decision support or data mining has been used successfully to streamline processes in manufacturing. The following are a few examples: ??? Honda Motor Company in the United States is using Weibull analyses to predict at what age or mileage various components of cars are likely to fail. The resulting information allows engineers to plan maintenance schedules and design cars that will last longer. This careful analysis and the feedback of its findings into production have enabled Honda to achieve some of the highest resale values for cars in the United States.11 ??? A major South African power generating station experienced problems with tube failures in a re-heater. Tube failures are very costly; the material costs to replace the damaged tubes, the labor cost to perform the scope of work, the cost of lost production, and the external costs required to replace the lost production all add up. The company sought a method that would enable it to predict the potential tube failures to plan maintenance. Data mining and multidimensional visualization techniques showed that the problem was due to a high local wear rate of a certain tube. Further investigation revealed that the inlet header disturbed the airflow, which caused the high local wear rate. A different setting of the inlet header reduced the wear significantly. Increasing the tube life by just one year delivers an estimated return on investment of 480 percent, an estimate that considers only the tube itself. Taking into account the damage and costs incurred for the re-heater and the wider effects for the entire plant, the return on investment is considerably greater. ??? Data mining is used in semiconductor manufacturing to predict the likelihood that a microprocessor die will fail after packaging. It is often more cost-effective to discard defective die packages than to rework them. By pre-classifying each die with a probability of failure, the manufacturer can discard those with high probabilities very early in the assembly cycle. This analytics-based selection process eliminates unnecessary manufacturing costs and increases the percentage of good parts exiting the assembly/test process. ??? Computer hard disks are produced at mass quantities (100,000 parts per day) with a current failure rate of 1 percent. With a cost of $25 for each failure, even an improvement of 0.25 percent in the failure rate results in cost savings of $2,281,250 per year.12 Case Study 2 covers the details of this example. There are many more examples where data mining has proven to be extremely useful for process control applications, maintenance interval prediction, and production and research process optimization. These examples include reducing inventory by 50 percent without any loss in service levels, optimizing filling operations in the food industry, optimizing yield in car engine testing, predicting peak loads in telecommunication networks, forecasting utility demand (water, gas, electricity), reducing energy consumption at power stations, and identifying fault patterns in gas drilling. Data mining is also widely employed in sales and marketing operations, for example, to calculate the profitability of customers or to find out which customers are most likely to leave for the competition. Forrester Research reported in a recent study of Fortune 1000 companies comparing current (1999) and planned (2001) usage of data mining that while marketing, customer service, and sales will remain as the major business application areas for data mining, process improvement applications will experience the highest relative increase from 2 percent in 1999 to 22 percent of all data mining application areas in 2001 (multiple responses accepted).13


4. Case Study 1: Printing Process Out of Control

This case study was one of the earliest published examples on the use of data mining techniques to address a process-related problem. Bob Evans and Doug Fisher14 discussed the problem of ???banding??? in rotogravure printing occurring at R.R. Donnelly, America??™s largest printer of catalogues, retail brochures, consumer and trade magazines, directories, and books. Rotogravure printing involves rotating a chrome-plated, engraved, copper cylinder in a bath of ink and pressing a continuous supply of paper against the inked image with a rubber roller. Sometimes a series of grooves ??“ called a band ??“ appears in the cylinder during printing and ruins the finished product. Once a band is discovered, the printing press needs to be shut down, and the band needs to be removed by polishing the copper cylinder and re-plating the chrome finish. This process causes considerable downtime, delaying time-critical printing processes, which wastes time, money, and resources. Banding became a considerable cost factor at R.R. Donnelly, and a task force was appointed to address the problem. In brainstorming sessions, the task force discussed a number of possible reasons for banding to avoid the problem in the first place, but the task force came up with a large list of factors, which could have potentially contributed to this problem. There were 37 factors being selected, some of which are listed in Table 1.

Table 1. Potential Causes for Bands in the Printing Process


The task force studied conditions under which bands occurred and the settings of the potential causes (inputs) at that time. For control purposes, a number of settings were also recorded when the printing process was in control (no bands). For organizational reasons, the task force was not in a position to assemble a lot of data, and data were recorded when it was convenient, not in a controlled environment setting. In total, the data consisted of 541 records with 255 in 1990, 223 in 1991, 37 in 1992, and 16 in 1993. Bands occurred in 227 cases (about 42 percent). There were no bands in 312 cases (57.7 percent), and there were missing values for band in two cases.15

6 Data Mining

The task force tried to analyze the data through a series of graphs but failed. In this re-analysis of the data, a P-chart (proportion of defectives) was applied to the target variable BAND/NO BAND (Figure 3). The chart was restricted to data points from April 1990 to November 1991, and data were summarized by month to have an acceptable number of points per month to calculate the proportion of defectives.

Figure 3. P-chart for the Proportion of Banding

At first sight, the printing process seems to be in control. The chart shows the proportion of defectives to be in the admissible range (Figure 3, shown in blue). However, because the data were not generated in a controlled experiment, the mean proportion of bands is artificially high (mean proportion for banding .395), and the admissible range covers the whole data range from 0 to 1 or no band to 100 percent banding. Figure 4 also shows a slightly upward trend of bands occurring, so the printing problem gets worse over the time frame considered. One way to identify potential causes is with Pareto diagrams. Figure 4 shows a Pareto diagram of press types (four different types were used) against banding problems in percent. The first two printing machines with highest occurrence of banding already accounted for 80 percent of all banding problems, a clear indication that press type is an important influential factor for bands.

Figure 4. Pareto Diagram of Press Type Data Mining 7

The value of Pareto charts is limited when numerous potential factors need to be considered. Moreover, Pareto charts are not suited to explore potential interactions between factors. In fact, the task force at R.R. Donnelly also did discover the impact of press type early on; however, although this factor was taken into account, the banding problem continued to exist at a lower rate. Clearly, control charts and Pareto diagrams (first generation of quality implementation) were not adequate to explain the banding problem fully. Evans and Fisher therefore decided to use specific data mining methods (a decision tree algorithm) available at that time. These data were re-analyzed using SAS??™ data mining solution Enterprise Miner.?„?16 Enterprise Miner implements data mining analysis as a process. Data mining tasks are represented as icons, which can be dragged and dropped onto a workspace, arranged as nodes in sequence, and connected to form process flow diagrams. Data mining tasks are grouped according to an underlying data mining methodology called SEMMA, which stands for Sample, Explore, Modify, Model, and Assess.17 Figure 5 shows a data mining flow using Enterprise Miner for the Donnelly banding data in the Diagram Workspace (the right side region of the graphical user interface). The Input node reads the data and automatically assembles a number of statistical metadata. These are descriptive statistics about each variable such as the role of the variable in the modelling process (input, target, identification, or a few other choices), and the variable??™s measurement level (nominal, binary, ordinal, or interval).


For more details about Enterprise Miner, see the ???Companion Document??? section.


SAS, ???From Data Mining to Business Advantage: Data Mining, The SEMMA Methodology and SAS Software,??? 1998

Figure 5. Data Mining Flow for Band Data

The Data Replacement node is one of the icons for data modification. As the name indicates, it allows replacing invalid data through user-selected values or the imputation of missing values using a wide range of imputation methods. Figure 6 displays an opened Data Replacement node with the Class Variables tab selected. As it appears, there were a number of inconsistencies (mainly between upper and lower case letters) in the original data, which blurred the analysis. A user-specified replacement value eliminates this problem with the data. Also there were a number of missing values in the original data (up to 12 percent for a given variable). A typical strategy would be either to disregard cases with missing values for analysis or replace missing values with a representative value, such as the mean or the mode (the most frequent value). Instead, a strategy to impute missing values through tree imputation was chosen. Tree imputation uses all available information except the one from the imputed variable (all
8 Data Mining

information from user-selected variables) from this specific record as input to calculate the value of the imputed variable with a tree algorithm. This approach ensures that a maximum of information is used for imputation and the imputation itself is done in a very flexible way.

Figure 6. Data Replacement Node

Partitioning of the data is the next task in the data mining flow. Partitioning provides mutually exclusive data for training, that is, calculating the model for explanation of bands and subsequent assessment (comparison) of the models. As a result, assessment of the models is done on data independent of those used for model generation. From data partitioning, the flow divides into a branch pointing to a Tree node and another branch that points into a Variable Selection node first and then into a Regression and a Neural Network model (see Figure 5). Variable selection is a very useful tool with a large number of model inputs. It assists analysts by dropping those variables that are unrelated to the target and retaining those that are useful for predicting the target (bands) based on a linear framework. The remaining significant variables are then passed to one of the modelling nodes such as the Regression or Neural Network nodes for more detailed evaluation. Figure 7 shows results from the Variable Selection node. A number of variables are rejected because of their low relationship with the target. In this process, the Variable Selection node always tries to reduce the number of levels of each class variable to groups to test if that strengthens relationships with the target. If that is the case, then the original variable is rejected, and the grouped variable is selected instead. Assuming for the moment that modelling has been completed, assessing all of the models would reveal which model explains banding most effectively. For example, double clicking on the Assessment node enables the analyst to select each of the three models and display a lift chart (Figure 8). For each model, records of the validation data are scored with the result (the formula) from the model, ordered from highest to lowest score, and then separated into deciles. In its first decile, the tree model classified 95.5 percent records correctly as bands. The baseline (Figure 8 in dark blue) shows the average of bands in the assessment data
Data Mining 9

(about 36 percent). The larger the distance between a model and the baseline, the better the model explains the data. As is apparent from the lift chart, the tree model outperforms the regression and neural network models.

Figure 7. Variable Selection Node

10 Data Mining

Figure 8. Lift Chart for Banding Models

Given its performance, it makes sense to take a closer look at the tree model results (Figure 9). A tree (also called a decision tree) is so called because the predictive model for banding can be represented in a tree-like structure. A decision tree is read from top down starting in the root node. Each internal node represents a split based on the values of one of the inputs with the goal of maximizing the relationship with the target. Consequently, nodes get purer (more or fewer bands depending on the split) the further down the tree. As is apparent, the percentage of solvent explains the most variation for bands. Whereas bands for the whole training data occur in 46.1 percent of cases, bands always result when the solvent percent is less than 31.35. Where the solvent percentage is equal to or greater than 31.35, press type is the next most important classifier. As noted earlier with the Pareto diagram (Figure 4), the press type Woodhoe 70 would generate the most bands, but the analysis has shown that this relationship would only be of interest for solvent values larger than 31.35 percent. The terminal levels of a tree are the ???leaves.??? For the branch Woodhoe 70, the humidity 31.35 and press type is Woodhoe 70 and humidity

