One of the biggest advantages of utilizing BigQuery for analytics instead of using a third-party off-the-shelf analytics tool (such as Google Analytics, Mixpanel, Localytics, etc.) is the ability to join multiple data sources. For example: joining analytic events coming from your mobile app with user data stored in your MySQL database. This allows you to answer deeper questions like “Which of my customers are spending the most per order?” or “What user behavior do site visitors exhibit before making their first purchase?”
We will begin with a high-level overview and then lean toward the technical side, since this is a topic of somewhat technical complexity.
At a glance, this article covers:
- Why companies replicate their MySQL databases to Google BigQuery
- The different approaches for MySQL replication
- A detailed real-world example implementation of MySQL to Google BigQuery replication
- Common pitfalls to avoid when building your replication solution
Why replicate MySQL to Google BigQuery?
It’s very common these days to have multiple replicas of your data in several different types of data stores. That’s because different parts of your application or business use data in different ways. By replicating your data to multiple data stores, each optimized for a different use case, you are able to optimize the performance of your system.
Perhaps the most common and straightforward example is read-replicas for your transactional database. Many applications also store their transactional data in ElasticSearch, for example, to enjoy search speed and the rich features of ElasticSearch. Others store the raw data in S3 for backup, or Memcached for caching, etc.
Back to our use case, data teams tend to replicate their MySQL data to BigQuery for three main reasons:
- Joining multiple data sources - The main motivation to move to BigQuery is to be able to join and intersect multiple data sources. For example, joining billing data and customer service data with user behavior data. This kind of analysis allows you to ask questions such as: “Which features lead to higher retention rates and more conversions to paid plans?”, or “What user behaviors increase churn?”, and to perform predictions such as: “Which users are likely to become paying customers?”
- Analytical queries performance - While MySQL is optimized for Online Transaction Processing (OLTP) workloads, BigQuery is geared towards aggregating and analyzing large amounts of data by taking advantage of its columnar data store and massive parallel processing architecture. These kinds of analytical queries tend to perform very poorly in MySQL, and risk overloading and blocking your operational transactional database. By replicating MySQL to Google BigQuery, developers are able to improve their analytical query speed by many orders of magnitude. In addition, separating the operational database from the analytical database also means that analytical queries will never affect the performance of the production database.
- Near real-time access - Another key advantage of replicating MySQL production data into BigQuery is the near-real time aspect of it. By using the Google BigQuery streaming API, new data that’s written to the binary log in MySQL typically will be available for querying in BigQuery within several minutes. This is great for enabling data analysis on BigQuery data in virtually real time.
Ok, so now that we know why you would want to replicate MySQL data to Google BigQuery, let’s talk about how.
Different approaches for MySQL replication
There are several main approaches for database replication. In this section we’ll describe the three most common approaches and compare the advantages and drawbacks of each of them.
Full dump and load
In this approach, periodically, the MySQL tables are fully dumped, the corresponding Google BigQuery tables are dropped and recreated, and the full dumps of the MySQL tables are loaded to BigQuery.
The advantage of this approach is that it is very simple and straightforward. The disadvantage is that dumps are very resource intensive, so that it can slow down the MySQL database during the table dumps. To have a consistent dump of your table you might even need to lock your database during the dump. In addition, implementing a full dump on very large tables incurs high latency.
Therefore, this approach is mostly recommended if you have very small tables.
If you do choose to use this approach, it would be best to use a replica instead of your master database, since using the latter might block your database and interfere with your production application.
Incremental dump and load
In this approach, the MySQL table is periodically queried for updates since the last query. The updates are then loaded into Google BigQuery and a consolidation query reconstructs the original table.
The main advantage of this approach over a full dump and load is that in each iteration, only the updates are extracted and loaded. Typically this significantly reduces the load, since normally only a small portion of the database’s rows are updated between two iterations.
Perhaps the most significant disadvantage of this approach is that it cannot capture row deletions, since deleted rows will never be returned in a query. Similarly, table alterations (such as schema changes: added or removed columns, etc.) are also not captured unless actively queried in each iteration. Another minor disadvantage is that it requires the replicated tables to contain an "updated at" column that the periodical query can use. The necessary consolidation step also adds some complexity to the implementation.
Similarly to the full dump and load approach, it’s best to use a replica and not the master database for the incremental dump and load to avoid blocking the master database.
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
In our opinion, this is the optimal approach. It is the only method that allows for near real-time replication at scale. Its main advantages are that it doesn’t lock or affect the performance of the database, it supports both deletions and table alterations (and therefore enables exactly one-to-one replication), it doesn’t have any requirements over the structure of the tables, and it is very coherent with the stream processing paradigm that allows transformations and near real-time performance. Not surprisingly, this is the most complex approach to implement, and this is its lone disadvantage.
Implementing binlog replication
In this section we provide a detailed real-world example of how to implement MySQL to Google BigQuery replication using MySQL’s binlog.
We assume that you already created your Google BigQuery cluster and that you know how to load data to Google BigQuery. If not, check out Google BigQuery’s Quick Start guide.
We will use an example table to demonstrate the different steps of the suggested implementation. Let’s imagine that we have a
products MySQL table that we would like to replicate in Google BigQuery and it has the following structure:
Each of the products in our example
products table has an
name and a
price. We would like to reconstruct this table in our BigQuery data warehouse on a continuous basis.
General overview of the process
- Step 1 - Creating the BigQuery tables
- Step 2 - Initial dump and load - Captures the initial state of the table
- Step 3 - Continuous binlog streaming - Captures the updates to the table continuously
- Step 4 - Consolidation - Updates the replicated table using the binlog and the previous version of the replicated table
Step 1 - Creating the BigQuery tables
Before pulling data from MySQL and loading it into Google BigQuery, you will need to create the target tables in BigQuery, since some properties such as column types and table keys can only be defined one time.
Two tables will need to be created: the first will hold the latest reconstructed state of the original MySQL table; the second will hold all the updates to the original MySQL table since the last consolidation. Both tables will have the same structure so that they can be joined for the consolidation step. It also means that the reconstructed table might hold a few columns that don’t exist in the original table, but these should be disregarded in analysis.
There are two main types of discrepancies between the MySQL and BigQuery tables that you will have to consider: column type discrepancies and key discrepancies. Since column types and keys cannot be changed after table creation in Google BigQuery, you need to define column types and table keys in advance upon table creation.
Column type conversion
MySQL and BigQuery have slightly different column types. BigQuery has an equivalent for the most common column types, and most other columns can be converted to a BigQuery column type.
For example: MySQL’s
FLOAT are direct equivalents of
FLOAT in Google BigQuery.
You can find a full list of suggested column type conversions here: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
The main discrepancy between MySQL and Google BigQuery regarding the primary key, is that in BigQuery, the primary key constraint is not enforced. Thus, two rows can have an identical primary key. Primary keys are only used as a hint by the Google BigQuery query planner to optimize your queries. Primary keys should be enforced by your ETL process.
Step 2- Initial dump and load
After you created the destination BigQuery table, you need to perform the initial dump from MySQL and load it into your BigQuery table. All updates to the table thereafter will be executed using data coming from the binlog stream.
First, you need to make sure that your database is configured for row-based binlog (we’ll talk about the different binlog formats and explain why we chose row-based later in this post). It’s important to define the binlog format before the initial dump because in order to avoid data loss or inconsistencies you would want to know exactly at what moment you took the table’s "snapshot" and have all of the updates that followed.
You can find detailed instructions for configuring your MySQL for row-based binlog in Alooma’s MySQL replication documentation.
Backup the relevant tables and get the binlog position. This is required, because the binlog replication sends events from a specific point in time. This step requires a user with the following permissions:
SELECT, LOCK TABLES.
Lock tables. Important: In order to be able to release the lock, DO NOT close the session when locking the tables
mysql> FLUSH TABLES WITH READ LOCK;
In a different session, dump your table to disk, e.g. using mysqldump, and get the current binlog position by executing
mysql> SHOW MASTER STATUS;
Save the output, which should look like:
+------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000003 | 1507 | | | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec)
- From the session where you locked the tables, unlock tables
mysql> UNLOCK TABLES;
Now that you have your initial dump, you load it to Google BigQuery.
Step 3 - Continuous binlog streaming
The binlog holds an ordered log of all the updates of the MySQL table. You can stream it continuously to a staging table in BigQuery and use it to recreate the original MySQL table in BigQuery.
MySQL has three binlog formats:
STATEMENT- every query that refers to the table is documented in a binlog line. Even if multiple table rows are affected by a query, only one line will be written to the binlog. For example, a single query that updates all the rows of the table will appear as a single entry in the binlog.
ROW- every update to a table row is documented in a binlog line. If multiple rows are affected by a single query, a line will be written to the binlog for every affected table row. For example: a single query that updates all the rows of the table will create an entry for each row in the binlog.
MIXED- both statements and row updates are written to the binlog.
We recommend using the
ROW format since it allows you to directly stream and load the binlog to the staging table in BigQuery. The reason is that if you want to reconstruct the state of a certain row with statement-based binlog, you have to keep track of the state of the entire table. With a row-based binlog you only need to track the state of individual rows.
For example the staging table for our
products table could look like this:
Step 4 - Consolidation
As we mentioned above, the consolidation table stores the latest reconstructed state of your MySQL table, while the staging table stores all the updates ever since. Every few minutes (or hours, depending on your specific needs and the size of the table) Alooma runs an upsert query on BigQuery that updates the reconstructed table using the staging table, and clears the staging table.
The consolidation process first joins the consolidated table together with the staging table, and then extracts the latest state of each row. The result is a reconstructed one-to-one version of the original MySQL table, since every row in the new table represents the most recently updated row of the original table in MySQL.
Alooma will also build near real-time views that are comprised of the consolidation queries.
Common pitfalls to avoid
There are many pitfalls to avoid when building a MySQL to Google BigQuery replication pipeline and we’ve already touched on some of them above. Below is a more exhaustive list we gathered.
- Conversion between MySQL and BigQuery Type - As mentioned above, MySQL and Google BigQuery have slightly different column types. Column types should be chosen wisely to avoid data loss.
- Update of a primary key column of a row in MySQL - If an update query modifies the primary key column of a certain row, the consolidation process described above will not work, as it will not identify that the update relates to the same row (since we partition by the primary key). We might discuss how we handle this case at a later post.
- Table structure modifications - When columns are added, deleted, or modified in the original MySQL table, the BigQuery queries need to adapt accordingly to avoid breaking the pipeline and causing data loss.
- Logs arriving out of order - Keeping the order of the logs is not always easy at large scale, but is critical in order to correctly replicate the table. Your data pipeline should be built to keep ordering as much as possible but also handle cases where logs come out of order. For example, a row update might arrive after a delete (even though the original order is update and then delete). If not handled correctly it might cause the deleted row to reappear in the replicated table in BigQuery.
- Downtime when consolidating - The consolidation process can take a couple of minutes (or even more depending on the size of the table and the binlog updates). How can you minimize the downtime for the consumers who query the BigQuery table?
- Deleting the binlog - While performing consolidation, records continue being loaded to the staging table. When you clear the staging table, you need to make sure to not delete records that haven’t been consolidated yet.
- Error handling - Certain components of your pipelines will have errors from time to time. Your pipeline needs to be able to handle network errors, BigQuery loading errors, and more without data loss. You should also set up monitoring that alerts you when such errors occur.
Should I build my own solution or use a third-party solution?
It is definitely possible to build your own MySQL to BigQuery solution. Especially, it makes sense to do so when starting out with small data volumes in a full dump and load approach. However, if your solution requires either high throughput, high availability, low latency or frequent schema changes, then you’re probably better off using a third-party solution.
As a rule of thumb, once you have tables that are bigger than 1GB or once you have more than a couple million daily updates to your tables, third party solutions start to become the more attractive alternative. For instance, when dealing with tons of data flowing through your system, the Alooma Data Pipeline Restream Queue feature ensures that you never lose an event and all errors are caught in order to be re-streamed, saving valuable engineering time.