How genetically similar are two random people?
Understanding the similarities and differences among people occupies psychologists, anthropologists, artists, doctors and, of course, many biologists. Even when zooming in on only the genetic differences among people there is a dazzling range of issues to discuss. The day that DNA extracted at a crime scene can lead to a mug shot portrait seems to have already arrived, at least according to a recent publication on modeling 3D facial shape from DNA (P. Claes et al, PLOS Genetics, 10:e1004224, 2014). In the spirit of cell biology by the numbers, can we get some basic intuition from logically analyzing the implications of a few key numbers that pertain to the question of genetic diversity in humans.
We begin by focusing on single base pair differences, or polymorphisms (SNPs). Other components of variation like insertions and deletions, varying number of gene repeats (part of what are known as copy number variations, or CNVs) and transposable elements will be touched upon below. How many single base pair variations would you expect between yourself and a randomly selected person from a street corner? Sequencing efforts such as the 1000 genomes project give us a rule of thumb. They find about one SNP per 1000 bases. That is, other components set aside, the basis for the claim that people are 99.9% genetically similar. But this genetic similarity begs the question: how come we feel so different from that person we run into on the street? Well, keep on reading to learn of other genetic differences, but one should also appreciate how our brains are tuned to notice and amplify differences and dispense the unifying properties such as all of us having two hands, one nose, a big brain and so forth. To an alien we probably would all look identical, just like you may see two mice and if their fur coat is the same they would seem like clones even if one is the Richard Feynman of his clan and the other the Winston Churchill.
Back to the numbers. Let’s check on the accuracy and implications of the rule of thumb of one SNP per 1000 bases. The human genome is about 3 Gbp long. This suggests about 3 million SNPs among two random people. This is indeed the reported value to within 10% which is no surprise as this is the origin of the rule of thumb (BNID 110117). What else can we say about this number? With about 20,000 genes each having a coding sequence (exons) about 1.5 kb long (i.e. about 500 amino acids long protein on average), the human coding sequence covers 30 Mbp or about 1 percent of the genome. If SNPs were randomly distributed along the genome that will suggest about 30,000 SNP across the genome coding sequence, or just over 1 per gene coding sequence. The measured value is about 20,000 SNPs which gives a sense of how wrong we were in our assumption that the SNPs are distributed randomly. So we are statistically wrong, as any statistical test would give an impressively low probability for this lower value to appear by chance. This is probably an indication of stronger purifying selection on coding regions. At the same time, for our practical terms this less than 2 fold variation suggests that this bias is not very strong and that the 1 SNP per gene is a reasonable rule of thumb.
How does this distribution of SNPs translate into changes in amino acid in proteins? Let’s again assume homogenous distribution among amino acid changing mutations (non-synonymous) and those that do not affect the amino acid identity (synonymous). From the genetic code the number of non-synonymous changes when there is no selection or bias of any sort should be about four times that of synonymous mutations (i.e. synonymous mutations are about 20% of the possible mutations, BNID 111167). That is because there are more base substitutions that change an amino acid than ones that keep the amino acid identity the same. What does one find in reality? About 10,000 mutations of each type are actually found (BNID 110117) showing that indeed there is a bias towards under representation of non-synonymous mutations but in our order of magnitude world view it is not a major one.
One type of mutation that can be especially important though is the nonsense mutation that creates a stop codon that will terminate translation early. How often might we naively expect to find such mutations given the overall load of SNPs? Three of the 64 codons are stop codons, so we would crudely expect 20,000*3/64 ≈ 1000 early stop mutations. Observations show about 100 such nonsense mutations, indicating a strong selective bias against such mutations. Still, we find it interesting to look at the person next to us and think what 100 proteins in our genomes are differentially truncated. Thanks to the diploid nature of our genomes, there is usually another fully intact copy of the gene (the situation is known as heterozygosity) that can serve as backup.
How different is your genotype from each of your parents? Assuming they have unrelated genotypes, the values above should be cut in half as you share half of your father and mother genomes. So still quite a few truncated genes and substituted amino acids. The situation with your brother or sister is quantitatively similar as you again share, on average, half of your genomes (assuming you are not identical twins…). Actually, for about 1/4 of your genome, you and your sibling are like identical twins, i.e. you have the same two parental copies of the DNA. Insertions and deletions (nicknamed indels) of up to about 100 bases are harder to enumerate but an order of magnitude of 1 million per genome is observed, about 3000 of them in coding regions (so an underrepresentation of about half an order of magnitude). Larger variations of longer stretches including copy number variations are in the tens of thousands per genome but because they are such long stretches their summed length might be longer than the number of bases in SNPs.
The ability to comprehensively characterize these variations is a very recent scientific achievement, starting only in the third millennia with the memorable race between the human genome project consortia and the group led by Craig Venter. In comparing the results between these two teams, one finds that in comparing the genome of Craig Venter to that of the consensus human genome reference sequence, there is about 1.2% difference when indels and CNVs are considered, 0.1% when SNPs are considered: ≈0.3% when inversions are considered — a grand total of 1.6% (BNID 110248). In the decade that followed the sequencing of the human genome, technologies were moving forward extremely rapidly leading to the 1000 Genomes Project that might seem like a rotation project to some of our readers by the time they read these words. Who knows how soon the reader could actually check on our quoted numbers by loading his or her genome from their medical report and compare it to some random friend.