How do I go about building a business intelligence app in Python?
General outline of the process:
Extract data from your sources
Transform data (cleanup, normalize schemas, etc)
Load to a data warehouse or data lake
Visualize the data, or otherwise act upon it
Let’s dive into each of these.
This part depends a lot on where your data is currently stored. For example, if it is currently stored in Amazon S3, you can use Boto to connect to it, list the files and download the new ones to insert them into the flow. If your data is stored in a database, use appropriate packages to connect to it (e.g., mysql-replication, psycopg2, etc). To import Salesforce data, use simple-salesforce. Python is great for this task, because it is flexible, and has lots of open source packages that do some of the work for you.
Again, this depends a lot on your specific use-case, but Python can do a great job here. You’ll have to maintain code that handles each input type (database table, salesforce object, type of files in S3, etc) and fix data errors you encounter in that input. Obviously, this part is something that evolves over time.
Depending on where you want to store the data, write a program that loads the transformed and cleaned data to the data warehouse. Usually, for SQL and SQL-like data stores that would mean formatting your data to CSV and then telling your data warehouse to load it (e.g., for Amazon Redshift, upload the CSV to S3, and then use the COPY command; For MySQL use the LOAD DATA command).
Use visualization packages like Matplotlib or Seaborn to visualize the data. You’ll probably want to provide some Web-based interface for your users to interact with these visualizations, so you can use a web framework like Django or Flask.
or Act upon it
Instead of visualizing the data, you may want to write programs that automatically respond to the data. Once all of your data is in a central data warehouse this is fairly simple. Just break up your business logic into small and simple blocks that query the data and then send Slack notifications, Pagerduty alerts, or trigger some other business workflow’s view webhooks.
Of course, you will also have to think about monitoring (each of these steps will fail, sooner or later. How do you know that it failed?), fault-tolerance (when things break - do you lose data?), scalability (because hopefully, your business will succeed and have more data), and flexibility (When the schema of your events changes, how easy is it to handle it? When you have to integrate a new data-source, how long does it take?)
And another word of advice…
I strongly advise to at least consider the possibility of using pre-built services to do parts of this for you. For example, you can use Alooma for the ETL part (Disclosure: I work at Alooma), and Redash or Periscope for the visualization part. It will let you focus on your business and on analyzing the data, instead of on building the infrastructure itself.
Published at Quora. See Original Question here