View Optimizing Queries
These suggestions are written for a technical audience and covers in detail what types of database queries are optimized for Infobright.
Infobright is a data storage engine with proprietary compression, storage, and retrieval methods. It leverages many features of MySQL including a support for variety widely-used APIs (e.g. SQL, ODBC, JDBC).
There are specific SQL commands that have been optimized for Infobright, meaning they use the Knowledge Grid to resolve queries. For other MySQL functions, Infobright will resolve them by defaulting to the MySQL optimizer in order to resolve the query. This can be considerably slower as the MySQL optimizer is row-oriented, is not able to use the Knowledge Grid and must decompress significantly more data. In these cases the query will return with a warning. Use SHOW WARNINGS; to see the message:
Query syntax not implemented in Infobright, executed by MySQL engine.
There are several things that can be done to optimize queries within Infobright:
Use More Efficient Data Types
The OR consistently has poorer performance because it essentially creates two separate intermediate structures and then gathers results together, omitting repetitions. The latter operation is very time-consuming. If possible, OR functions should be replaced by IN or UNION ALL (if there are no repetitions possible):
SELECT … WHERE a=5 OR a=10 --->; SELECT … WHERE a IN (5,10)
SELECT … WHERE a=5 OR a>10 --->; (SELECT … WHERE a=5) UNION ALL (SELECT … WHERE<pre> a>10)
UNION is slower than UNION ALL because it must detect and omit repetitions in the result (for a very large output, it uses operations like sorting etc.). If repetitions are not harmful, or if it is already known that the result does not contain repetitions, UNION ALL should be used.
Leveraging Columnar Architecture
Infobright is a columnar database, which means that access to every column of data is independent. As a consequence, it is important how many fields of one row of data are being accessed. For optimal performance, you should avoid accessing columns that not being used in the query results. Instead of SELECT * FROM… you should rather select only the necessary columns.
Limiting Tables Used in Results
The Infobright engine may optimize join execution if there is a possibility that one of the involved tables is not used in further phases of execution, like SELECT list, GROUP BY, ORDER BY. E.g., assuming t2 is a large fact table:
SELECT t1.a, sum(t2.b) FROM t1 JOIN t2 ON t1.key=t2.key
WHERE t1.x > 0 AND t2.y = 5
GROUP BY t1.a;
This query may be sometimes transformed into:
SELECT t1copy.a, sum(temp_tab.sum2) FROM
( SELECT t2.key AS k2, sum(t2.b) AS sum2 FROM t1 JOIN t2 ON t1.key=t2.key
WHERE t1.x > 0 AND t2.y = 5
GROUP BY t2.key ) temp_tab, t1 t1copy
WHERE temp_tab.k2 = t1copy.key
GROUP BY t1copy.a;
In such cases the inner query produces the grouping, which is nearly what is needed, but it is defined in terms of t2 only. Then the grouping is refined by externally transforming it into grouping by t1. Sometimes such transformations may improve performance by an order of magnitude, but it works only for aggregations which may be combined from parts (like sum).
A simpler example of transformation:
SELECT t1.a, t2.val FROM t1 JOIN t2 ON t1.a=t2.b WHERE t1.c = 5;
and a faster equivalent (using only t2 on SELECT list):
SELECT t2.b, t2.val FROM t1 JOIN t2 ON t1.a=t2.b WHERE t1.c = 5;