How to Construct a Stem and Leaf Diagram
A stem and leaf diagram will contain all of your data in all of its detail. One can look at a stem and leaf and extract every data value in your dataset. The utility of these diagrams is that they make it easier to see the range of values in your dataset and the relative frequency of each (they also help you determine the best grouping level for a frequency histogram). All of this is accomplished without sacrificing any detail; this contrasts with a frequency histogram which often does not allow the extraction of every data value (many histograms do some level of aggregation or grouping in order to produce a nice looking histogram).
To do a stem and leaf diagram, one first must get a feel for the range and distribution of the data. The intent is to make a good guess at what level of stem is appropriate for your data set. It's usually best to first try using the most significant digit(s) (i.e., the left most digits) as a first approximation. For example:
For example, take the following data set:
85 69 69 74 51 85 81 96 64 84 77 78 117 73 73 85 87 84 95 84
The appropriate stem level is the 10s digit. Write down the stems in order from top to bottom - if any stems don't have a value (e.g., the 10s stem), put it down anyway because it will allow us to see gaps in our data.
The next step is easy. Here, we write down the "leaves" of our data on each of the stems. The leaves constitute the less significant digits.
Note that each data value is easily extracted from the diagram. The highest value is 117 (11 stem in 10s place = 110 plus 7 in 1s place) and the lowest is 51. The data also looks like it is roughly normally distributed. By the way, when you do this by hand, don't bother sorting the leaves on each stem - this is supposed to be a quick and dirty method.
What do you do with data that don't have any leaves?
This situation arises most often when one is using single-digit ratings data. The stems are easy to identify (e.g., 1 through 9). The leaves are then the next significant digit (the one after the decimal point) - 0. Your stem and leaf would thus consitute a bunch of 0s as leaves. For example:
What if I have decimal points (e.g., GPA or batting averages)?
You just insert the decimal point where it's needed to increase clarity. For GPA, your stems would be 0., 1., 2., 3., and 4.; for batting averages, your stems would be .1, .2, .3, .4, .5, .6, ... (well, maybe for high school batting averages - no one in the pros hits above .400 for very long).
What if I have more than one non-significant digit (e.g., batting averages conventionally have two and state populations have a bunch)?
You include all of the digits in the leaves but separate each leaf by a comma. For example, if you had three batting averages in the .400s (.405, .425, .437), you would have three leaves on the .4 stem separated by commas (05, 25, 37). If you have lots of digits (e.g., with state populations), you should probably drop the less significant digits (e.g., the 100s, 10s, and 1s place in state populations) to maximize clarity.
My data is too spread out or too scrunched up on my stem and leaf diagram - it doesn't really show the distribution of the data very effectively.
It would be great if the above methods for choosing stems were sufficient for all situations, but they're not. The intent of using a stem and leaf is to develop a quick and easy graphical representation of the data distribution; if the above methods create too few stems (e.g., only 2 or 3), then you can't get a feel for the distribution because all of the leaves will be bunched up. If the methods create too many stems, then your data would be too spread out with perhaps one or two data values on each stem - this latter problem is subjective, though, because when you have lots of data, more stems may be advantageous.
So, you might need to either more or fewer stems than the above methods allow. The following methods allow you to split stems apart. There are two widely used methods to change the level of the stems:
Note that these same techniques can be used to compress data onto fewer stems - this is accomplished by using significant digits at a higher level (e.g., 100s rather than 10s) and then splitting. For example, using the10s digits for body weight will spread out your data too much (generating stems like 9, 10, 11, 12, 13, ...., 22, 23, 24). You might have thought that the 100s digit was too much compression (only 0, 1, and 2 stems), but by splitting them up (e.g., 0., 1*, 1t, 1f, 1s, 1., 2*, 2t) you can generate a better stem and leaf.
The best way to find a good stem level is to use trial and error. If you choose a stem level that generates a lot of stems with 1 or 2 leaves each, then you need to crunch the data. If you find that you have too few stems with lots of leaves on each, then you need to spread it out.
If you have two groups that you wish to compare, it can be especially informative to use a back-to-back stem and leaf diagram. To do so, you put leaves for one group on the right of the stems (as above) and the leaves for the other group on the left. As an example, the following stem and leaf shows the distributions for American League and National League team batting averages at the end of the 2000 season (in case you're curious, the team hitting .294 was the Colorado Rockies). The averages for AL teams tend to be higher because of their use of the "designated hitter" - this keeps the typically poor hitting pitchers away from the bat.