Friday, September 15, 2006

Overdesign of Overcurrent Protection?

(Preface: To tell you the truth, the previous two entries were actually written nearly a week ago, on another diary site. I had wanted to move them over here sooner, but only yesterday did that other diary site come back online and I was able to grab the entry from there. In other words, it's been a few days since the last entry and this one.)

As more and more boards come in, we're discovering a seemingly endless stream of issues that need to be confronted and resolved before the next stage of development is possible. One of the most persistent and obstructive is the +1.5V and +1.2V power supply.

The issue is the following list of symptoms from my diagnosis.

  • System No Boot.
  • ALL_SYS_PWRGD (All Systems Power Good) LED shows power malfunction.
  • +1.5V power rail measures to be just under 1/2 of correct value.
  • Power-On Sequence signals up to that point are all normal.

(NOTE: On different boards, it measures out to different values, but all the malfunctioning boards have this voltage measuring out to be within the range of +0.6V to +0.8V. The same thing would happen with the +1.2V voltage -- it would measure out to be around half of its intended level.)

These voltages power vital components on the system:

  • Parts of the CPU
  • Parts of the Southbridge
  • Parts of the Northbridge
  • Wireless Ethernet
  • PCIE
  • the Graphics Processor
... to name a few.

We found no apparent issues with the circuit design of those areas. Below is the schematic for the power circuitry for the +1.5V power rail.


Figure 1: The +1.5V Power Supply Circuit Schematic.


Over the course of a week or so, the Power engineer from our customer (our own Power engineer was busy with another project) did his own investigation and found that the following possible causes.


Figure 2: Bulk Capacitors for the Northbridge. You can tell that the power line (=PP1V5_S0_NB_3G) is going to the Northbridge by the "NB" circled in green.



Figure 3: Bulk Capacitors for the Southbridge. You can tell that the power line (=PP1V5_S0_SB) is going to the Southbridge by the "SB" circled in green.


Notice the bulk capacitors (220uF) in both screenshots. Their purpose in this case is to store a lot of energy. When the machine is operating, if for some reason the ICs (NB or SB) suddenly need a large surge of energy, they can get it from these capacitors instead of drawing too much too quickly from the power ICs themselves (such as the Uxx00) that they overheat and burn out.

But these protective devices are causing problems, too. During boot-up, because these are initially empty, the demand for current really is that great. The power IC is supposed to be able to handle this kind of punishment for a while, but apparently they are not. So this is what's happening:

  1. The power ICs are forced to supply a large amount of current, which they detect that they don't want to handle.
  2. The power ICs shut themselves off because they think that if they kept going, it'd get a bit toasty.
  3. The Northbridge and Southbridge drain the previously somewhat-charged bulk capacitances of their stored power.
  4. The power ICs finally wake up, and sees that they need to get to work. Go back to Step 1.


This is why the voltage output from the power ICs is always a fraction of their correct values. Just to confirm it, I put in a 1-Ohm resistor in series with the Inductor L7800 at pin 1 on a problematic board. The problem went away immediately. I've attached a photo of it below.


Figure 4: The 1-Ohm Current-Limiting Resistor Experiment.


This is how it looked before:


Figure 5: Before the 1-Ohm Current-Limiting Resistor Experiment.


You could say that my soldering/rework skills have improved, eh? :D

So in any case, the current work-around for this stage in Prototyping is to unstuff those bulk capacitors (for the +1.5V issue; for the +1.2V and +3.3V there are similar solutions), to remove them before testing them. The 1-Ohm Resistor insertion rework takes longer to do than to simply remove the capacitors.

After working with the makers of the power ICs, it was found that the power IC itself is not "ignoring" the overcurrent condition for long enough at startup. What this means is that the power IC is supposed to just go ahead and supply that large current for some time until the bulk capacitors fill up and the system becomes stable, but it wasn't supplying that large current for long enough before thinking that danger is happening. So the power IC company has promised to adjust their ICs so that in a future version, the time before shutdown is longer. This version would be in full production only after the next phase of our developmen.

Our customer has decided, however, that that would be much too long to wait, so their Power engineer has found another usable power IC from another company.

So was this an overdesign of overcurrent protection? I don't know, because perhaps the system would never need that much current anyway. What I learned from this experience, though, is to always look beyond the immediate region of where the problem is supposed to occur, and investigate all the points where this region touches. I believe that if I had done that, I would have easily been the one to discover those bulk capacitors and helped solve the problem that much more quickly.

Friday, September 01, 2006

"Insulation" & "Mad Soldering 5k!|z"

"Insulation"


The prototypes just keep comin' in.

They arrive in the form of boards with components freshly placed on them via the SMT process. Part of my job was to build them into "Stealth" units, which conceals the actual final cosmetic outlook of the unit by containing the prototype in a bare-bones casing.

It isn't difficult, albeit this being my first time, and I had only 4 units to build. So I was merrily following the directions and putting on heatsinks and driving screws, when my boss came around and said:

"You forgot to put the tape around the big chips."

I was like, "Huh?" And he pointed out the following to me.


Figure 1: nVidia GPU (Graphics Processing Unit) and unprotected on-chip components.


To put it simply, the green square with the sprinkles of rectangles and the reflective center chip, that whole thing, is the nVidia GPU of our new prototype. Unfortunately, I do not have a picture that shows how the height of the little grains of rectangles (resistors and capacitors) are actually the same height as the central reflective square (the chip's die).

Check out the following picture.


Figure 2: Side view of the nVidia chip with heatsink attached.


The heatsink is the grey thing in the middle of the picture with the fingers sticking downwards. There is double-sided tape between the bottom of the heatsink and the GPU's die, holding the heatsink in place.

If you can make it out, there is very little clearance between the flat part (the bottom) of the heatsink and the packaging (the green part) of the GPU. Any little bit of tilting of the heatsink would cause it to connect with the resistors and capacitors on the packaging, causing shorts which could hang the system at best and be a system-level catastrophy at the worst.

So this was what my boss suggested to do.


Figure 3: nVidia chip with its on-chip components protected by Kapton tape.


I wanted to put this down because I think this was an important lesson. In manufacturing, boards are handled by people all the time, especially during testing when the boards are bare and there is no casing. In these situations, it is very easy to accidentally tilt the heatsink, and thereby cause a short if there were nothing insulating the components. I gotta try and think this way in the future, try to consider all the conceivable sources of dangers. And not just in the immediate time frame and place, but far down the line, too!





"Mad Soldering 5k!lz"


I was asked to solder on a few clock crystals this afternoon which were first desoldered from other boards. First of all, this was how the result looked, done via the SMT process:


Figure 4: SMT-soldered clock crystal.


Now, of course, this was done by a machine (a few, in fact), and a human being could never hope to match the precision. But, what I did was... atrocious:


Figure 4: The crystal soldered by yours truly.
(Picture forthcoming)


Clearly, I have a lot to learn even about basic skills like soldering. Everything worked fine, despite my best efforts. But what I learned was this.

The first thing to do is to make sure that all the contact pads for the crystal on the board are flat. This means removing any solder already on the pads. Then, you solder one contact corner of the crystal onto the correct corresponding pad on the board and watch the placement!!!

If the part is placed correctly, centered and all, then there would be enough of the pads exposed that any application of flux and then solder would cause the solder to be sucked right onto the metal, making the hand-soldering process really simple.

By the way, here's a size comparison with a dime.


Figure 5: Size comparison of a dime and the clock crystal.


This is something to keep in mind, especially since I am supposed to help our customer's R&D team in doing tests and measurements and whatnot, which would require this skill.

Besides, it's fun! :)

Introduction

This is the first entry, and I figured I'd give a little introduction on my background and the big little things that marked my way to where I am right now.

I was born in a hospital in Taipei, Taiwan, on March 19th, 1979. My mom was in labor for something like 3 days. From what my dad and other relatives told me, it was a painful process, so I'm very appreciative of my mother.

Elementary was a mixture of Physical Education and academics. I was trained for competitive swimming from second grade to fourth. Back in those days, they still used a nice stick on the buttocks and palms as encouragement.

The entire family moved to Kuala Lumpur, Malaysia, during my fourth grade, due to my dad's relocation. I went to the International School of Kuala Lumpur and learned to speak English and to appreciate foreign girls.

After I graduated from High School from the same school, I went to the Worcester Polytechnic Institute to major in Electrical Engineering and minor in Computer Science. But I think the most unforgettable experience was staying for four years in a place where there was 1 girl for every 5 guys.

Then I went to the University of Southern California to learn more about Engineering. The male-to-female ratio improved drastically at this institution. I majored in VLSI System Design, but also enjoyed classes in Adaptive Signal Processing, Statistics for Engineers, Fuzzy Systems and Neural Networks, and Public Speaking.

I tried to find a job after I graduated. But post-911 and economic conditions and my lack of industry experience and my citizenship status (I was in the US on student visa) collaborated in effecting my inability to find a job.

I did help out at a marketing position which involved representing the Child Protection & Education of America in their campaign for helping find missing children as well as increasing public awareness of the dangers to children.

In May 2005, I went back to Taiwan to find a job. I interviewed with many companies, and decided on the one that seemed to have the most potential. The position put me in China, where I lived the life of an assembly line worker for three months.

Then I became an actual engineer. I used my language skills to help translate in conversations between my coworkers and our American customer's representatives. I contributed my creativity in helping build posters for our visiting customers' viewing. I learned and taught Cadence Concept to my fellow new workers. I studied about Intel chipsets and system-level interconnects and specifications and standards. I re-learned to use the oscilloscope when I had to measure Power-On Sequences and do Signal Integrity tests for a new system. And I learned soldering skills while learning to repair and rework computer motherboards.

And found a girlfriend, among other things.

And that is how that story ended, and where this new story begins.