BigQuery, Google’s data warehouse as a service, is growing in popularity as an alternative to Amazon Redshift. If you’re considering working with BigQuery, you’ll find that accessing the data is quite straightforward. You can easily query huge amounts of data by running SQL queries in a number of ways: via BigQuery’s Web UI, CLI, or by integrating with your favorite BI tool. But to do that, you’ll need to move your data into BigQuery. How hard can that be?
Let’s take a simple case, a report in one of BigQuery’s recommended formats - CSV, which is generated by a backend service once per day. Looking at BigQuery’s documentation, loading the data would appear to be quite easy. Using the CLI, all you need to do is:
bq loadand specify your local file and the name of the table you want to create on BigQuery
bq lsand see that the table appears in the dataset, should look like this:
tableId Type ----------- ------- names2010 TABLE
bq showto see the resulting schema in BigQuery:
Last modified Schema Total Rows Total Bytes Expiration ----------------- ------------------- ------------ ------------- ------------ 13 Mar 15:31:00 |- name: string 34041 653855 - |- gender: string - |- count: integer
Repeat this for your other data sources (some of which you might connect using the API, with the help of a developer). Now invite your data scientists to BigQuery, sit back and wait for the insights.
But is that really all there is to it?
In our experience, no. "Naively" loading your data as per the above instructions might get you into serious trouble: don’t try it at home.
It’s not BigQuery’s fault - Google have done an amazing job at making it easy to load and analyze data on their platform. It’s the quirky nature of data flows which makes things much more tricky, even in the simplest of cases.
A Very Simple Case and How Things Can Go Wrong
Let’s take the simple use case we outlined above. You work with a backend system which generates customer data in CSV files. Every day the system generates a CSV showing new customers added on that day. Your job is to load these CSVs into BigQuery to facilitate analysis.
Stage Zero: Loading CSVs Manually
At first, you might load the CSVs manually, as we showed in the documentation snippet above.
What can go wrong: This becomes outdated every few days. You’ll want to automate this repetitive task...
Stage One: Automating Data Loading
To automate loading, you need a server that connects to the backend database, queries it for the latest customer data, creates a CSV and loads to BigQuery. An easy way would be using a post request. That will take a few days of a developer’s time.
You’ll also need to guarantee data quality - as soon as things are running automatically, you’ll need monitoring and an automatic way to handle errors (or at least stop the process and allow an operator to fix what went wrong). In BigQuery, errors are returned for an entire CSV - BigQuery reports that the CSV failed, but typically the problem is with a specific data row. For example, there are 999 rows with integers, which complies with the schema, but one row which contains strings. A smart automatic process should be able to remove that one row that caused the problem, and load the rest of the CSV, which would then succeed.
There are three levels (at least) at which you could handle possible errors to ensure data quality:
Level 1 - recognize that a CSV failed to load, get the error message and alert the operator. This allows the operator to manually examine the CSV, see what caused the failure, fix it and load it manually into BigQuery.
Level 2 - try to separate the erroneous rows from the good rows in the same CSV. Allow the good data to flow into BigQuery, and leave the problematic rows on the side, allowing the operator to handle them manually. This prevents frequent halting of the data pipeline and also makes the manual work much easier.
Level 3 - for certain errors, fix them automatically. For example, if the file failed because the timestamp had only hours and not minutes and seconds, it’s possible to append ":00:00" to the end and pass it through. Certain errors will repeat themselves in your data and over time you can predict and handle them without manual intervention.
What can go wrong: There are additional failure scenarios. What happens if the server fails to read the source data from the backend database; receives the wrong data or incomplete data; or encounters a failure reading the data from its local disk? These things will happen and the data pipeline will grind to a halt. Or worse, they can go unnoticed, and later on you’ll find gaps in your data.
Even a robust script will not handle all possible error scenarios, and it will often be necessary to go back and adapt the script to handle a new issue, or troubleshoot some issues manually. It’s a given that unexpected errors will occur and your script - now your data pipeline - will require ongoing monitoring and maintenance to ensure the data flows as expected.
Stage Two: Updating vs. Inserting
Let’s say our customer data includes purchases. Customers will come back from time to time to purchase more. This means the same customer row in your table will be updated over time with different numbers, such as total purchase amount or items purchased.
If so, the original query won’t be adequate. Instead of asking for new customers added today, you need to query for all customers modified today, including older customers that you already saved to BigQuery. BigQuery does not allow modifying of an existing table. Since BigQuery does not support deletes or upserts, you’ll need to do the following:
Load all modified customer data to BigQuery in a new table.
Rework your script to check which customers in the old table do not exist in the new table, and copy them over (so you don’t lose old customers which weren’t modified).
If you made changes to the data within BigQuery, the reconciliation gets much more complex, as there could be multiple changes both to the existing data and the newly ingested data, with conflicts. You need logic to correctly merge all changes.
Delete the old table.
What can go wrong: Anyone would be nervous copying and deleting entire data sets over and over again every day, as with any destructive operation. But this is commonly how things work in BigQuery and similar products. Any hiccup in the way the data is copied over or correlated with old records can lead to serious errors in the data, or even loss of the entire dataset.
Stage Three: Schema Change
Inevitably, one day or another, the schema will change. For example, a new data field might be added with the customer’s credit score. Or an existing field with the time of last purchase might be changed from just the date, to a full timestamp including minutes and seconds.
When any schema change occurs you need to:
Know that it happened. This requires monitoring and picking up even the smallest change to any field in the source data.
Change your script to handle it.
Transform the existing table in BigQuery to the new format - in the above example, by adding the new credit score field and converting the purchase time to the new format. As above, this will require building a completely new table and copying the data into it in the correct format. You can either do this manually every time there is a schema change, or write code that will do it automatically - but the latter will be complex and error prone.
What can go wrong:
When there are major schema changes, the data may not load to BigQuery at all, since it will be incompatible with the existing table. That will require urgent manual intervention, but at least you can fix the problem as it happens.
The more dangerous case is subtle changes to the schema, e.g. changes to the exact definition of the data field. These small changes may go unnoticed, until you realize that your data is not what you expect. Reports will start to show strange trends, numbers won’t make sense, and tracking down the problem will be difficult; not to mention going in and fixing the data after some time has gone by.
Stage Four: Scaling Up
We’re still in the simple case of a back-end system with CSVs (now generated automatically by querying the database).
Let’s say the business grows and there are now millions of customers. The data is still in the same simple format but there is a lot more of it. It might take hours for your script to execute the query, formulate the CSV and load it into BigQuery, and it will be more difficult to handle the data within BigQuery.
To handle much bigger data, you might need to pull the data from your back-end database in chunks. This requires modifying the script to specify how to break down the data.
What can go wrong: Any of the above, multiplied! Input/output failures, reconciling updates of the data, and schema changes will be now much more difficult to catch and fix, now that any change can happen in multiple files, and loaded to BigQuery in separate chunks.
Taking it One Step Further: Streaming Data in Real Time
What is interesting about the preceding discussion is that we didn’t do anything complex - only very simple table-based data format ingested one file at a time into BigQuery. Accounting for some very common real-world factors, clearly it takes serious work to build a data pipeline, that still carries significant risks of failure.
Loading files to data warehouses is becoming a thing of the past. More and more businesses are considering real-time streaming of data, either to improve the scalability and integrity of their ETL process, or to extract real-time insights from large amounts of data, without having to wait for the next data load.
How much more difficult would it be to do streaming ETL into BigQuery as opposed to simple CSV loading?
If you look at BigQuery’s documentation for data streaming, the first part is straightforward - all you need to do is break your data into micro-batches, and stream them into BigQuery one record at a time using the
However, stream loading is much more error prone than loading of entire data files. When you stream data into BigQuery, each row of data might fail, whereas an entire CSV simply fails or succeeds in its entirety. In our experience, it’s common to see errors such as "service unavailable". In some cases the errors are consistent and you’ll need to put aside the data and retry at another time. Additionally, according to Google, “You might have to retry an insert because there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery.”
Much of the 2,000-word documentation page discusses the problem of consistency and deduplication. Google suggests supplying an
insertId for each inserted row, and then in case of an error, when you retry inserting the row, "BigQuery uses the
insertId property to de-duplicate your data on a best effort basis." There is no guarantee that deduplication will be successful in all cases.
This is a complex issue because, on the one hand, streaming inserts are not guaranteed to succeed and you’ll need logic to retry inserts on a regular basis. On the other hand, when you inevitably create duplicates, BigQuery will "try" to resolve them but does not guarantee that either.
It is possible to handle duplicates manually. However, to make this possible, it’s necessary to set up your data streaming in the following way, which will typically require changes to your data. The following is quoted from the same documentation page:
Create two tables with an identical schema. The first table is for the reconciled data, and the second table is for the real-time, unreconciled data.
On the client side, maintain a transactional data store for records.
insertAll()requests for these records. The
insertAll()request should specify the real-time, unreconciled table as the destination table.
You can then manually remove duplicates, and combine the new table containing the streamed data with the older reconciled data. The procedure for manually removing duplicates involves running the following query to check for duplicates (assuming you provided an
insertId for each row):
SELECT MAX(count) FROM( SELECT [ID_COLUMN], count(*) as count FROM `[TABLE_NAME]` GROUP BY [ID_COLUMN])
If the result is greater than 1, indicating that duplicates exist, you’ll specify a destination table and perform the following query:
SELECT * EXCEPT(row_number) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY [ID_COLUMN]) row_number FROM `[TABLE_NAME]`) WHERE row_number = 1
This yields a new table that is deduplicated, and you should be able to merge it with the original reconciled data.
None of these steps is extremely difficult on its own, but managing this on an ongoing basis for large quantities of streaming data can be a nightmare.
Especially in cases where data integrity is important and each row of data has significance, building a streaming data pipeline yourself can be daunting, far beyond the first case we discussed of a simple CSV.
Other Complex Cases
Beyond streaming, there are other use cases that will make things more difficult than our simple CSV case:
Transforming data from non-tabular formats to BigQuery-friendly formats like CSV, TSV or JSON - see our post on Building a Professional Grade Data Pipeline, which uses the example of log files, that need to be parsed and converted to a format BigQuery can understand.
Multiple data sources - in most organizations, there will be many very different data sources loaded into the data warehouse, each with its own schema, data peculiarities, frequency of change, etc. Not to mention the work involved in integrating with each of the data sources using an API or other connector. The simple data pipeline we illustrated above multiplies in complexity with each additional data source - especially when there is a requirement to combine, correlate or compare different data sources on the fly.
ETL - what if you need to transform the data as you’re loading it? Check out the BigQuery tutorial on ETL to see what is involved. If you set it up with Google’s Cloud Dataflow platform, this is what the data pipeline will look like:
Just to illustrate that it can get much harder to keep the data flowing.
Get a Streaming Data Pipeline Out of the Box with Alooma
Recognizing how difficult it can be to set up even simple data pipelines, we created a world-class data pipeline that anyone could use. It includes:
Existing integrations with all popular data sources and data warehouses, including BigQuery. Letting you easily support large, multiple data sources.
Architected for "exactly once processing", with our innovative Restream Queue that catches errors and allows you to view, fix them and load them into BigQuery- never losing an event.
Managed schema changes - automatic handling of schema changes. Alooma can either re-structure tables in real time to make most schema changes completely transparent, or notify an operator to make the required changes manually.
Built-in streaming - data is streamed to BigQuery by default, with robust handling of errors and duplication.
Built-in ETL - provide your own Python code and we’ll execute it to rationalize and transform the data on the fly. No need to set up complex ETL flows in a tool like Google Cloud Dataflow.
Give us a try before starting building your own pipeline to load data into BigQuery. We guarantee you will save yourself some sleepless nights, and move onto to the juicy insights much faster.