TY - JOUR A1 - Casel, Katrin A1 - Fernau, Henning A1 - Gaspers, Serge A1 - Gras, Benjamin A1 - Schmid, Markus L. T1 - On the complexity of the smallest grammar problem over fixed alphabets JF - Theory of computing systems N2 - In the smallest grammar problem, we are given a word w and we want to compute a preferably small context-free grammar G for the singleton language {w} (where the size of a grammar is the sum of the sizes of its rules, and the size of a rule is measured by the length of its right side). It is known that, for unbounded alphabets, the decision variant of this problem is NP-hard and the optimisation variant does not allow a polynomial-time approximation scheme, unless P = NP. We settle the long-standing open problem whether these hardness results also hold for the more realistic case of a constant-size alphabet. More precisely, it is shown that the smallest grammar problem remains NP-complete (and its optimisation version is APX-hard), even if the alphabet is fixed and has size of at least 17. The corresponding reduction is robust in the sense that it also works for an alternative size-measure of grammars that is commonly used in the literature (i. e., a size measure also taking the number of rules into account), and it also allows to conclude that even computing the number of rules required by a smallest grammar is a hard problem. On the other hand, if the number of nonterminals (or, equivalently, the number of rules) is bounded by a constant, then the smallest grammar problem can be solved in polynomial time, which is shown by encoding it as a problem on graphs with interval structure. However, treating the number of rules as a parameter (in terms of parameterised complexity) yields W[1]-hardness. Furthermore, we present an O(3(vertical bar w vertical bar)) exact exponential-time algorithm, based on dynamic programming. These three main questions are also investigated for 1-level grammars, i. e., grammars for which only the start rule contains nonterminals on the right side; thus, investigating the impact of the "hierarchical depth" of grammars on the complexity of the smallest grammar problem. In this regard, we obtain for 1-level grammars similar, but slightly stronger results. KW - grammar-based compression KW - smallest grammar problem KW - straight-line KW - programs KW - NP-completeness KW - exact exponential-time algorithms Y1 - 2020 U6 - https://doi.org/10.1007/s00224-020-10013-w SN - 1432-4350 SN - 1433-0490 VL - 65 IS - 2 SP - 344 EP - 409 PB - Springer CY - New York ER - TY - JOUR A1 - Klie, Sebastian A1 - Nikoloski, Zoran A1 - Selbig, Joachim T1 - Biological cluster evaluation for gene function prediction JF - Journal of computational biology N2 - Recent advances in high-throughput omics techniques render it possible to decode the function of genes by using the "guilt-by-association" principle on biologically meaningful clusters of gene expression data. However, the existing frameworks for biological evaluation of gene clusters are hindered by two bottleneck issues: (1) the choice for the number of clusters, and (2) the external measures which do not take in consideration the structure of the analyzed data and the ontology of the existing biological knowledge. Here, we address the identified bottlenecks by developing a novel framework that allows not only for biological evaluation of gene expression clusters based on existing structured knowledge, but also for prediction of putative gene functions. The proposed framework facilitates propagation of statistical significance at each of the following steps: (1) estimating the number of clusters, (2) evaluating the clusters in terms of novel external structural measures, (3) selecting an optimal clustering algorithm, and (4) predicting gene functions. The framework also includes a method for evaluation of gene clusters based on the structure of the employed ontology. Moreover, our method for obtaining a probabilistic range for the number of clusters is demonstrated valid on synthetic data and available gene expression profiles from Saccharomyces cerevisiae. Finally, we propose a network-based approach for gene function prediction which relies on the clustering of optimal score and the employed ontology. Our approach effectively predicts gene function on the Saccharomyces cerevisiae data set and is also employed to obtain putative gene functions for an Arabidopsis thaliana data set. KW - algorithms KW - biochemical networks KW - combinatorics KW - computational molecular biology KW - databases KW - functional genomics KW - gene expression KW - NP-completeness Y1 - 2014 U6 - https://doi.org/10.1089/cmb.2009.0129 SN - 1066-5277 SN - 1557-8666 VL - 21 IS - 6 SP - 428 EP - 445 PB - Liebert CY - New Rochelle ER -