Some 15 years ago, ontologies were the big thing. Financing an EU project was easy if ontologies and semantics were mentioned as primary goals.
Now this time is gone, except in biology where ontologies are still used, often in a very different way from what they were originally intended to do in the “Semantic Web” good old time.
More specifically a common biology research activity is to measure the expression of proteins in two situations, for example in healthy people and in patients. Then the difference between the two sets of measurements is asserted, and the proteins and their genes that are activated in the illness situation are suspected to be possible targets for any new drug.
Gene differential expression is the biological counterpart of machine learning in CS, it is a one size fits all solving methodology.
Indeed those deferentially expressed genes are rarely possible targets for any new drug , as each protein and gene is implicated in so many pathways. So instead of refining the experimentation, to find genes that are implicated in a fewer pathways, a gene “enrichment” step is launched. “Enrichment” involves querying an ontology database, to obtain a list of genes/proteins that are related to the deferentially expressed genes, and that are hopefully easier target for putative drugs.
Here there are two problems.
* The first is the choice of the ontology, for example there is an excellent one which is named Uniprot. But there are some awful but often preferred choices, like Gene Ontology which gives dozens results when Uniprot gives one. Indeed if you have only one result after “enrichment” and if you are in a hurry, you are not happy, so the incentive to switch to Gene Ontology is strong.
* The second problem arises when the result set comprises several hundred genes/proteins. Obviously this is useless, but instead of trying to define a better experimentation, people thought that some statistical criteria would sort the result set and find a credible list of genes. This lead to the design of parameter free tools such as GSEA. Very roughly these tools compare the statistical properties of the result set with those of a random distribution of genes, if they are very different, then the conclusion is that they are not at random, which does not tell much more than that. This is similar and related to the criticism of fisher test, p-value and null hypothesis. This is a complicated domain of knowledge.
These tools are very smart, but the best tool cannot provide meaningful answers from garbage, so disputes soon arisen about the choice of the free parameter methodology, instead of questioning the dubious practices that made them needed in the first place.