About 2 weeks ago Oracle published the MySQL 5.7.7-labs-json version which includes a very interesting feature called “Generated columns” (also know as Virtual or Computed columns). MariaDB has a similar feature as well: Virtual (Computed) Columns.
The idea is very simple: if we store a column
`FlightDate` date
in our table we may want to filter or group by year(FlightDate), month(FlightDate) or even dayofweek(FlightDate). The “brute-force” approach: use the above Date and Time MySQL functions in the query; however it will prevent MySQL from using an index (see below). Generated columns will allow you to declare a “Virtual”, non-stored column which is computed based on the existing field; you can then add index on that virtual column, so the query will use that index.
Here is the original example:
CREATE TABLE `ontime` ( `id` int(11) NOT NULL AUTO_INCREMENT, `FlightDate` date DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `OriginAirportID` int(11) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `DestAirportID` int(11) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, PRIMARY KEY (`id`), KEY `FlightDate` (`FlightDate`) ) ENGINE=InnoDB
Now I want to find all flights on Sundays (in 2013) and group by airline.
mysql> EXPLAIN SELECT carrier, count(*) FROM ontime_sm WHERE dayofweek(FlightDate) = 7 group by carrier *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 151253427 Extra: Using where; Using temporary; Using filesort Results: 32 rows in set (1 min 57.93 sec)
The problem here is: MySQL will not be able to use index when you use a function which will “extract” something from the column. The standard approach is to “materialize” the column:
ALTER TABLE ontime_sm ADD Flight_dayofweek tinyint NOT NULL;
Then we will need to load data into that by running “UPDATE ontime_sm SET Flight_dayofweek = dayofweek(flight_date)”. After that we will also need to change the application to support that additional column or use a trigger to update the column. Here is the trigger example:
CREATE DEFINER = CURRENT_USER TRIGGER ontime_insert BEFORE INSERT ON ontime_sm_triggers FOR EACH ROW SET NEW.Flight_dayofweek = dayofweek(NEW.FlightDate);
One problem with the trigger is that it is slow. In my simple example it took almost 2x slower to “copy” the table using “insert into ontime_sm_copy select * from ontime_sm” when the trigger was on.
The Generated Columns from MySQL 5.7.7-labs-json version (only this version supports it on the time of writing) solves this problem. Here is the example which demonstrate its use:
CREATE TABLE `ontime_sm_virtual` ( `id` int(11) NOT NULL AUTO_INCREMENT, `FlightDate` date DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `OriginAirportID` int(11) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `DestAirportID` int(11) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` int(11) DEFAULT NULL, `ActualElapsedTime` int(11) DEFAULT NULL, `AirTime` int(11) DEFAULT NULL, `Flights` int(11) DEFAULT NULL, `Distance` int(11) DEFAULT NULL, `Flight_dayofweek` tinyint(4) GENERATED ALWAYS AS (dayofweek(FlightDate)) VIRTUAL, PRIMARY KEY (`id`), KEY `Flight_dayofweek` (`Flight_dayofweek`), ) ENGINE=InnoDB
Here we add Flight_dayofweek
tinyint(4) GENERATED ALWAYS AS (dayofweek(FlightDate)) VIRTUAL column and index it.
Now MySQL can use this index:
mysql> EXPLAIN SELECT carrier, count(*) FROM ontime_sm_virtual WHERE Flight_dayofweek = 7 group by carrier *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm_virtual partitions: NULL type: ref possible_keys: Flight_dayofweek key: Flight_dayofweek key_len: 2 ref: const rows: 165409 filtered: 100.00 Extra: Using where; Using temporary; Using filesort
To further increase performance of this query we want to add a combined index on (Flight_dayofweek, carrier) so MySQL will avoid creating temporary table. However it is not currently supported:
mysql> alter table ontime_sm_virtual add key comb(Flight_dayofweek, carrier); ERROR 3105 (HY000): 'Virtual generated column combines with other columns to be indexed together' is not supported for generated columns.
We can add an index on 2 generated columns thou, which is good. So a trick here will be to create a “dummy” virtual column on “carrier” and index 2 of those columns:
mysql> alter table ontime_sm_virtual add Carrier_virtual char(2) GENERATED ALWAYS AS (Carrier) VIRTUAL; Query OK, 0 rows affected (0.43 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> alter table ontime_sm_virtual add key comb(Flight_dayofweek, Carrier_virtual); Query OK, 999999 rows affected (36.79 sec) Records: 999999 Duplicates: 0 Warnings: 0 mysql> EXPLAIN SELECT Carrier_virtual, count(*) FROM ontime_sm_virtual WHERE Flight_dayofweek = 7 group by Carrier_virtual *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm_virtual partitions: NULL type: ref possible_keys: Flight_dayofweek,comb key: comb key_len: 2 ref: const rows: 141223 filtered: 100.00 Extra: Using where; Using index
Now MySQL will use an index and completely avoid the filesort.
The last, but not the least: loading data to the table with generated columns is significantly faster compared to loading it into the same table with triggers:
mysql> insert into ontime_sm_triggers (id, YearD, FlightDate, Carrier, OriginAirportID, OriginCityName, OriginState, DestAirportID, DestCityName, DestState, DepDelayMinutes, ArrDelayMinutes, Cancelled, CancellationCode,Diverted, CRSElapsedTime, ActualElapsedTime, AirTime, Flights, Distance) select * from ontime_sm; Query OK, 999999 rows affected (27.86 sec) Records: 999999 Duplicates: 0 Warnings: 0 mysql> insert into ontime_sm_virtual (id, YearD, FlightDate, Carrier, OriginAirportID, OriginCityName, OriginState, DestAirportID, DestCityName, DestState, DepDelayMinutes, ArrDelayMinutes, Cancelled, CancellationCode,Diverted, CRSElapsedTime, ActualElapsedTime, AirTime, Flights, Distance) select * from ontime_sm; Query OK, 999999 rows affected (16.29 sec) Records: 999999 Duplicates: 0 Warnings: 0
Now the big disappointment: all operations with generated columns are not online right now.
mysql> alter table ontime_sm_virtual add Flight_year year GENERATED ALWAYS AS (year(FlightDate)) VIRTUAL, add key (Flight_year), lock=NONE; ERROR 1846 (0A000): LOCK=NONE is not supported. Reason: '%s' is not supported for generated columns.. Try LOCK=SHARED. mysql> alter table ontime_sm_virtual add key (Flight_year), lock=NONE; ERROR 1846 (0A000): LOCK=NONE is not supported. Reason: '%s' is not supported for generated columns.. Try LOCK=SHARED.
I hope it will be fixed in the future releases.
Conclusion
Generated columns feature is very useful. Imagine an ability to add a column + index for any “logical” piece of data without actually duplicating the data. And this can be any function: date/time/calendar, text (extract(), reverse(), metaphone()) or anything else. I hope this feature will be available in MySQL 5.7 GA. Finally, I wish adding a generated column and index can be online (it is not right now).
More information:
- Original blog post (older implementation) by MySQL server team about Generated Columns in MySQL 5.7.5
- New implementation of the virtual column: WL#8114: Don’t store virtual generated columns in database
The post Generated (Virtual) Columns in MySQL 5.7 (labs) appeared first on MySQL Performance Blog.