How many genes in our genome?

black knight

The metaphysics of genomics

“None shall pass,” declaimed the Black Knight when confronting King Arthur (and Patsy with his coconuts) in Monty Python’s ‘The Holy Grail’. Facing Arthur- King of the Britons- in single combat, the Black Knight was careful to sidestep the debate over whether “none” is singular or plural. His statement, “None shall pass” is safely ambiguous in this regard. However if he had been able to defeat Arthur (he didn’t), back at the castle that night would he have clarified his “none” with the singular, “No one was able to pass me!”, or the plural, “Not any were able to pass me.”? Being completely dismasted in the battle, it’s a moot point for him, but the question still stands.

Back in the early heady days of the genome project, a similar question that often got booted about was, “How many genes are there in our genome?” No lopping off of limbs was involved, but much argument was had, and heavy betting was on a number around twenty-five to thirty thousand, though some factions advocated for a much higher number, nearer to one hundred thousand. I was in the camp of the latter group. The resolution of the disagreement depended on agreeing upon what, exactly, is a gene, for which there can be many possible definitions, and whether “gene” is singular or plural. The 30K faction’s “gene” was not the same as the 100+K faction’s “gene”. Reporters learned to ask other, easier, questions, or just to leave geneticist alone entirely.

This isn’t just a problem in genetics, but a basic one of philosophy. That is, “How do we know what anything is?” This dog may look different than that dog, but we can agree this dog is a dog. Why, or how?

As one former president voiced it, it depends upon what the meaning of the word ‘is’ is. John Locke, philosopher, political theorist, dog racing enthusiast (I like to imagine so, at least) even had an entire chapter of his “Essay Concerning Human Understanding” on the word “is” (Chapter 7 of book III, which also has a section solely devoted to the word “but” [really]). Locke believed that we are not born with a shared set of ideas about the world, but that our innocent minds are blank, and experience is all that fills them up. This can lead to problems in communication. If we are not aware that my knowledge might be different than your knowledge, it’s easy to talk without either side understanding what the other is trying to say. If you’ve ever had a discussion with a die-hard member of the opposite political party, you’ll know what this is like. Without acknowledging this problem our wording will be imprecise; we have no way of knowing that when I say “gene” that I mean the same “gene” as you do.

True clarity in language may be possible, thought Locke, only if we were to have every single RNA transcript, leaf, grain of sand, and every cloud in the sky to have its own name.

Locke concluded that the only sure path out of this ambiguity was to have every single item have its own word, and do away with words being able to define entire classes of object. Every single dog, leaf, grain of sand, every RNA transcript and every cloud in the sky would have its own name. Fortunately he didn’t mean this as a serious proposal. For one, we’d need more letters in the alphabet, or alternatively have to deal with really long words. Dictionaries would be really, really big. More importantly, he said it would make communication impossible, which is the whole point of language, at least for the non-politicians.

Locke was an empiricist, one who believes that we can only know what our senses tell us, and that all our knowledge comes from experience. This is in contrast to the rationalists, who believe that we can use reason to understand our world. Of course, no one would deny that reason is useful, but a rationalist believes that solely through the use of reason we can understand an objective reality. Rene Descartes was rationalist, and his famous “I think, therefore I am” was his declaration that by using reason he can affirm that his existence was real. For me, looking down and saying, “Yep, I’m still here” is enough, but real philosophers take this stuff much more seriously.

About one hundred years after Locke, Emmanuel Kant wrote his “Critique of Pure Reason”, which was in some regards devoted to that one key word of Locke’s, “is”. How can we know that this dog is a dog, or a gene is a gene? Actually, Kant took one step back from Locke’s problem; he asked what do we know about how we try to know what something is? His answer was that we might know that we don’t know what something truly is, but we can know that we know some things about it.

Kant was probably a difficult conversationalist. One may not be surprised that he never married.

Is there a way of defining what things are that charts a course between the limiting skepticism of the strict empiricists and the heady optimism of the rationalists? Kant believed he had found a way.

Unlike Locke, who worried that ambiguity about meaning was an impediment to clear communication, Kant never got too worked up about the language aspect of the problem, and seemed to view the nexus between word and thought as too tight to leave room for much of anything interesting in the middle. Nonetheless Kant saw this uncertainty about our knowledge as a worrying problem: is there not a path between those negative skeptics like Locke, and the sometimes Pollyanna-ish rationalists? Skeptics, who reject our ability to know much, if anything, about the true nature of whatever, offer a limiting and somewhat depressing outlook on our relationship to the universe. However the rationalists, with their confidence in the power of reason, happily ignore the notion that our difficulty in agreeing on the meaning of things likely suggest something fundamentally limited about our ability to know the truth. As St Augustine, a fellow rationalist, blithely worded his take on the matter, “If we both see that what you say is true, and if we both see that what I say is true, where do we see this, I pray? Neither do I see it in you, nor do you see it in me: but we both see it in the unchangeable truth which is above our minds.” The meter and positivity of this quote is somewhat similar to Barney the dinosaur’s upbeat (and to anyone who had a toddler, maddening) theme song.

Kant sailed between this Scylla and Charybdis with his “Critique of Pure Reason”. Because of the ways our minds understand the world, he decided, there are a limited number of concepts that we all use to define a thing, such as whether the thing is one or many, whether it’s made of matter or not, etc. We all use these concepts to filter what our senses tell us about the thing in question to form a “schema”, an image of the thing. These schemas, he emphasized, though not false, are not really “the thing in itself”, they are not a vision of a completely true reality. They are, however, very useful, in that they let us interact with the universe in reasonable confidence that we’re not making complete fools of ourselves. Thus, he concluded, the skeptics are wrong in declaring that we having no way of agreeing about the objective or fundamental nature of anything, and those jaunty rationalists are put in their place as well. Kant likely set aside his nightly glass of barley water for a weak beer to celebrate.

What does philosophy have to do with the number of genes in our genome? Can’t we just sequence the genome, and Bob’s your uncle (or not, as the sequencing may reveal)?

So what does this have to do with the number of genes in our genome? Can’t we just sequence the genome, and Bob’s your uncle (or not, as the sequencing may reveal)? No, sequencing just gives us something like this:

CCCTACTTATAACATCTGGCCTAACTATATGGTTCCACTACCACTCTGTAGTTCTCCTATTTTTAGGATT

Genes, by any definition, are the regions of our genomes that can be transcribed into RNA, and often then into protein, and just by looking at that sequence we can’t tell if it’s part of a gene or not. We can begin to identify genes by sequencing or detecting the RNAs found in a cell and then mapping those bits to the genome. How they map to the genomic DNA can be very complicated:

complex genes

Genes can overlap, be stitched together in a variety of ways, contain vast regions of untranscribed genomic DNA that may be important for the overall architecture or packaging of that region of the chromosome, and multiple genes can share regions of DNA and chromosomal architectural features critical for their regulation. Which of these bits of the genomic DNA that are transcribed into RNA at any one time can be impossible to predict, as this is dependent on the action of other genes, whose regulation and transcription in turn is dependent on other genes, in fact sometimes on the very gene they are regulating, in a process known as feedback.

One way out of this mess is Locke’s super-dictionary approach, and to simply name every possible RNA, and be damned with trying to define what is or is not a gene. This was, to some degree, the approach of a group I was with, who were decidedly in the “big number” camp. Instead of using the term ‘gene’ we identified and catalogued “transcriptional units”, which we defined as clusters of transcripts that contain common cores of genetic information. These units could usually be associated with multiple different transcripts. We estimated that the total number of unique transcripts, once fully accounted for, would be greater than 75,000.

However the “low number” faction ultimately won out. In contrast to the Lockean approach of the group I was with, our current line is one more attune to the rationalists. We believe that we can discern an underlying logical reality in our genome. Given a new RNA transcript we won’t doubt that it should be assigned a gene name, and that this assignation has meaning. Two RNA transcripts might be be a little different, one misses one exon that the other contain, another has a longer end, but we can agree that they both belong to the Cox2 gene. With this rationalist approach we don’t need wishy-washy new terms like “transcriptional units”. Looking at the multiple transcripts in the diagram above, we have confidence that our understanding of genetic structure allows us to squash them all into a few genes. With this approach, we have less than 25,000 genes.

A Lockean or a rationalist approach towards defining genes both work, but both have problems.

Both methods work, but each has problems. The Lockean approach, with its plethora of transcriptional units, ignores that there really does seem to be a logic to the structure of the genome, and that transcripts arising from a common section of the genomic DNA (i.e., a “gene”) have functionally similar roles. However the rationalist approach makes assumptions about genes and chromosomal structure that remain untested, and as we’ll see below, seems to engender over-confidence about our knowledge of the genes.

Another approach to classifying genes might give a nod to Kant. Kant’s insight was that we must first acknowledge that what we know is defined by how we know. In terms of genes, we know genomic sequence, we know where RNA’s have been mapped to the chromosome, we know other regions of the genome with similar sequence (many genes have similar cousins elsewhere on our genomic DNA), and we have some understanding of how genes are regulated. What we can’t know (at least yet) is the complete regulatory story of a gene, and its complete set of functions (most genes have multiple roles). With these sets of concepts it’s reasonable to expect that we can identify a region of a chromosome that has the attributes that we attribute to a gene, and can assume with some confidence that others will also see this gene in a similar light. We may never know the gene “in itself”, as Kant would say, but we can assume that by declaring something to be a gene we are producing an idea that others will likely be able to recognize in a similar manner.

Therefore this Kantian approach might not change our current ~20K tabulation of genes, at least if we stick to protein-coding genes. If we include those sites that produce RNA transcripts of uncertain function, then the number could go up significantly. However this perspective should certainly clarify our continuing confusion about gene names. Most genes have multiple names. The Human Genome Organization has a committee (HUGO Gene Nomenclature Committee) to decide “official” names, but it’s an uphill battle to get many of them used. Whenever I’m given a list of gene names, say for a gene expression study on a cancer cohort, the first task is to convert this list into HUGO names, which always involves some puzzling over a few cryptically named genes.

When using one of a gene’s functions to name it we use a rationalist perspective, we think we know what this gene’s purpose is. Kant would likely not approve.

For example, the gene “officially” known as prostaglandin-endoperoxide synthase 2 is also called (more frequently) as cyclooxygenase 2 (or usually just “Cox2”, which is a lot shorter than the official name), as well as prostaglandin G/H synthase 2, glucocorticoid-regulated inflammatory prostaglandin G/H synthase, and prostaglandin H2 synthase 2. This is an enzyme with multiple enzymatic functions and many, many roles in the cell. By naming the gene after just one of its functions we are presuming a teleological perspective, that is, we think we know what this gene’s purpose is.

This confidence is a typical rationalist perspective. Of course we don’t know the gene’s purpose, in fact most would argue that a gene doesn’t really have a purpose, per se, it simply functions. By naming it in this manner we are ignoring Kant’s insight. “Purpose” is not one of the Kant’s basic concepts, like “quantity” and “substance”, therefore our innate understanding of genes does not encompass a teleological perspective.  Any declarations about a gene’s ultimate function are therefore likely not to be universally shared among other scientists. Thus the rationalist approach gives us a host of confusing gene names, put forward by researchers with different assumptions and interests, and the resulting disorder wastes about an hour of a new bioinformatics project for me.

Grammarians will tell us that “none” can be either singular or plural, depending on how we want to use it, so the Black Knight needn’t have been worried about his wording. Our current definition of “gene” is similar: a gene can produce many different transcripts which can have many different functions. But the total number of genes, and what they should all be named? First decide whether you align yourself with Locke, Descartes, or Kant.

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *