Dealing
adequately with technical uncertainties
Statistics,
RAMS
& Quality Management
Principles of MTBF calculations,
basic assumptions and consequences
As said earlier, MTBF means Mean Time
Between
Failures, which is the average time between two consecutive failures of
an item. MTBF values are usually given in hours.
MTBF and the so called Failure Rate
have a reciprocal relationship: MTBF = 1/Failure Rate, and
Failure Rate = 1/ MTBF.
While MTBF seems to be more
intuitive, it is quite difficult to handle in calculations. Therefore,
the
Failure Rate is the preferred metric used in MTBF calculations.
The following explanations use both
MTBF and Failure Rate.
MTBF calculations are the cornerstone for almost all
quantitative reliability and safety analyses. Without MTBF calculation,
these analyses could not even exist. Sometimes however, MTBFs
are often used as a mere sales pitch.
MTBF calculation of a system, in simple words, is just determining the
failure rates of every single component and finally adding all these
failure rates up in order to obtain the system failure rate (= the
reciprocal of the system MTBF).
MTBF calculation requires both component specific parameters and global
parameters. Component specific parameters would be the resistance of a
resistor, the viscosity of the grease of a gear, the rate of actions of
a switch, and so on, while global parameters would be ambient
temperature and environment type.
While there are virtually no
standards available for mechanical components, the reliability analyst
can chose from at
least 6 international standards for electronic components. Therefore,
MTBF calculation almost always means the calculation of the MTBF of electronic
systems.
Depending on the standard selected
for calculation, MTBF results can be very different. Differences of
factor 3 are quite usual on PCB level, and even factor 10 is not
uncommon on PCB level. The resason for this comes not only from the
different approaches of the standardas, but also from the uncertainty
of the assumptions made in these approaches.
Despite the lack of standards for mechanical components, MTBF
calculations are sometimes performed for mechanical equipment. These
calculations however are even more uncertain than those for electronic
equipment. They are very often based upon rough estimations,
comparisons
with similar equipment, engineering judgment, parametric approaches,
etc., or on the so called NPRD-1995 catalog (Nonelectronic Parts
Reliability Data), published by RiAC. While the successor NPRD-2011 is
obviously newer, its coverage ( # of different component types
addressed) is significantly lower than NPRD-1995.
Now let's focus on electronic
equipment
Most MTBF calculations are performed using just bills of materials
(BOMs) as they are exported from ERP systems. Such BOMs usually contain
sufficient information in order to assess the required parameters. If
not, manufacturer part numbers (also contained in BOMs) can be used
for internet research in order to obtain those parameters that cannot
directly be derived from the BOMs.
The wording "using just bills
of materials " has a specific connotation: It means that electrical schematics aren't
used at all for most MTBF calculations.
An important contributor to component failure rates is the relative
electrical stress of components in comparison to their ratings.
Electrical stress strongly depends on the electrical context, and
therefore electrical schematics would be the preferred means for
assessing component stress.
However, practical experience over many years shows, that using
average electrical stress values for all components makes almost no
difference in MTBF on system level in comparison to assessing every
component
individually.
Since the system failure rate is just the sum of all component failure
rates, and provided that reasonable average stress levels are used, the
MTBF analyst doesn't even need to understand the electrical schematics
in order to calculate a valid system MTBF. He only needs to know which
components and how many are built in the system.
Furthermore, this approach (average stress for all components) is a
significant time saver in MTBF calculation.
Safety analyses distinguish between dangerous failure modes and safe
failure modes. Therefore, it would be important to know the exact
failure rates of individual components, and as a consequence, it would
be necessary to assess individual stress levels for every component.
However, even safety standards like ISO 13849 suggest that it
can be acceptable to use average stress values even in safety
analyses.
The fact that using average stress values instead of individual stress
values yields almost the same result on system level has many reasons:
- For some component types, failure
rate models don't even ask for electrical stress. This is true for all
electronic standards.
- The depth of failure rate models
strongly depends on the component type. Some models are quite dedicated
using many parameters, while others are quite simplistic using probably
only one parameter. The simplistic failure rate models tend to yield
higher failure rates with electrical stress not being a model
parameter, while the dedicated failure rate models tend to yield lower
failure rates with electrical stress being a model parameter.
- As a result of 1. and 2., the
overall system failure rate (and MTBF) is mainly influenced by those
components having failure rate models not taking into account
electrical stress. This is true for all electronic standards.
- According to the so called
central limit theorem, the sum of many independent errors results in a
relatively small total error.
While
MTBF calculations seem to be straight forward, the theory behind these
calculations is rather not.
It starts with the so called bathtub curve, in particular the middle
section of that curve.
Constant Failure Rate
The bathtub curve is an idealized sketch showing the failure rate of a
product over time. The middle section of that curve has constant
failure rate (and therefore constant MTBF) and represents the
useful product life phase.
Constant failure rate is way more than just a simplification of
whatever dedicated behavior: The mathematical wording
constant failure
rate
is equivalent with the wording
random failures,
and this in turn is the same like
this is a mature
and perfectly designed product without any systematic failures.
Every systematic failure mode can at least theoretically be eliminated by design, but there is no
means at all to address random failures.
Random failures can be compared with basic noise: It is always present
and cannot be avoided. Random failures are caused by acts of nature
beyond any control.
Due to its random nature, particular random failures are generally
unpredictable. In practice this means that there is no way to predict
which unit will fail at what point in time.
However, what can be predicted is the number of failures during a
period of time.
Summarizing the above:
It is generally predictable how many units will fail during a given
period of time, but it is impossible to predict which units will fail
and when they will fail.
On theone hand this is a restriction, but on the other hand it makes
field data evaluation easy:
The only thing that needs to be known is the number of units failed
within a period of time.
From a practical statistical viewpoint the random failure approach
needs only very few data points (at least 3) in order to yield a valid
model.
These two circumstances, 1. not knowing which units failed when, 2.
only few data points needed, make it at all possible for many companies
to evaluate their field data.
More sophisticated failure rate models would require both more data
points (= more failures) and the exact knowledge of individual
operating times.
Consequences of constant failure rate
- If failures occur only
randomly,
preventive maintenance makes no sense at all because preventive
maintenance addresses predictable failures.
- A further consequence of
constant failure rate is
that products don't get older, they are quasi always new. Random
failures not only means that future failures are unpredictable, it also
means that there is no way to determine how long units have already
been running without failure. In other words, in the random failure
model there is no way to distinguish between older units and new units.
Serial Model
The so called serial model is a
further basic assumption which MTBF calculations rely on. This means
that
all components of a system are assumed to work in a series chain. If
any component fails, the whole system is assumed to fail. This is
of course very often not realistic, because
- the system may be redundant or
diversified and therefore can tolerate failures while still being
functional
- Even in a series chain, many
component failure modes may
be irrelevant for MTBF (e.g. drift of most 100 nF capacitors, drift of
many resistors, drift of most electrolytic capacitors, some failure
modes of ESD protection devices, diagnostic circuits, etc.)
- even without redundancy
/diversification, some failure modes still don't have any effect:
- failures in diagnosis
circuits may be relevant for maintenance technicians, but may have no
effect for the regular end-user
- the loss of a 100 nF
capacitor usually doesn't have any effect
- drifts in digital
applications very often don't have any effect
- the loss of suppressor diodes
may be tolerable unless there are no external events
- decrease in capacitance of
electrolyte capacitors doesn't mean anything
- .........
The consequence of the series chain assumption is that MTBF
calculations tend to be pessimistic.
A more realistic characterization of system behavior would require
additional and deeper analysis methods like FMEA, Fault Tree, Markov, or
Reliability Block Diagrams.
These methods are not used instead
of MTBF calculations, but in
addition to MTBF
calculations. They cannot
replace MTBF calculations because they are based upon component failure
rates.
Privacy Policy