I once told a manufacturing statistician that a typical chip manufacturing fab had several hundred statistical process control (SPC) charts for a single product. They told me I was crazy and that was overkill. You can probably imagine what happened next when they learned of dozens of different products running (each with their own sets of charts), thousands of tool qualification charts, and tens of thousands electrical test measurements. It was something along the lines of total disbelief. Next was the good part, the number of charts coming from manufacturing equipment sensor traces (aka #FaultDetection and Classification or #FDC). Keep in mind here are hundreds of sensors per equipment, and hundreds of different equipment types in a factory monitoring upwards of a thousand individual operations. To get accurate control of a single sensor, you track upwards of 50 different summary statistics on different segments of a single sensor data during an equipment process run. For those keeping score, that puts the number of controls charts for equipment sensors into the hundreds of thousands. A typical giga-fab in Asia can have over 10 million FDC control charts. Missing a single one can result in multi-million dollar losses. The response from my stats friend was a blank stare.
When W. Edwards Deming popularized the principles for manufacturing SPC, he revolutionized manufacturing quality. Waste, scrap, and rework rates all went down. Final yields went up. People started to truly believe that manufacturing quality had a positive return on investment. SPC even became a sort of religion in manufacturing, with every sort of devotee you would find anywhere else: those that follow rigid and strict dogma, those that pay lip service but are not really invested, those that do what they’re told with a limited understanding of the actual principles, and those that evangelize the beauty of seeing the world through the lens of probability. The problem here is that SPC is a tool, not a belief system. Like every tool, there are right and wrong ways to use it.
Something went wrong when we scaled these principles to such a complex manufacturing environment as semiconductors. Please keep in mind that tolerances on some features in the chips in our phones today can be counted in hundreds or even tens of atoms (or less…). There is not enough money to pay enough people to monitor 10 million SPC charts every minute of every day or every year in a single factory in the way that Deming taught us. We’ve learned to ride our bicycle and now think we can take it out on ‘95’ or ‘the 405’ (pick your U.S. coast) during rush hour traffic and believe we’re going to be just fine if we follow the rules our mom and dad taught us. What is the result in a fab? Millions of dollars in wasted effort, zero-value interruption to equipment uptime, professional and unprofessional conflict, blown opportunity cost, and a lower quality of life for the workforce (no one likes doing tasks they think are meaningless).
Until we come up with a way out of this mess, there are a few key principles to keep it sane.
Keep the rigor on the critical few - When the severity of a failure mode occurring is high, the detection and response need to be robust, best time to make sure you’re using the SPC tool to its fullest. try to do this on too many charts and you’ll never be able to sustain the effort
3 sigma is not a federal law - We’re not only optimizing alpha and beta (likelihoods of false alarms and missed opportunities), we’re optimizing costs of responding and cost risk of not responding. When the chart is not critical, don’t be afraid to use looser limits. You still catch the catastrophic fails, but ignore the nuisance alarms.
Don’t ever use all the Western Electric rules on a single chart - Ever seen a factory grind to a halt from SPC violations? I have, over a weekend. Usually a single trend rule does the job and they are not one-size-fits-all for all processes.
Embrace automation - chart violations, alarms, notifications, immediate reactions, and even troubleshooting guides can all be automated. There are readily available software tools, make sure you have them and are using them right.
In the end, we better come up with some new way to tackle this problem soon or we’re going get run over. Really, I’m pretty sure this is not what Deming had in mind.