How do I handle unstructured data?
When I think about handling unstructured data that is being collected from many sources such as emails, videos, audio files, web pages, and social media messages - three stages come to my mind: collecting, storing and analyzing.
Your first step will be setting up your data sources across all domains. Whether using click stream, advertising data, usage data, operational feeds, CRM, or any other - you should think about what is the best client for the task.
Persisting your data in the cloud or on premise allows you to query and analyze it. Fortunately, we’re living in the future, and we have some really great tools that don’t require a strict schema - like MongoDB, Elasticsearch or Cassandra (database). In mongoDB for example, you can store documents by a unique identifier and access them later using this id.
Usually after processing and storing your data, you would like to access it again, and perhaps learn something from it.
Usually, you’ll have to organize your data a bit and adjust it for analytics.
Where I work (Alooma) for example, we translate our customers’ data (structured or unstructured) into JSON objects that we stream to a cloud data warehouse in real-time. As data passes through the stream, Alooma allows you to provide Python code in order to organize and prepare the data to be available for analytics on top of the data warehouse.
This end to end process gives a sense of “structure” to your “unstructured” data and lets you both store and extract value from your data.
Published at Quora. See Original Question here