What is spark and how to use MongoDB?

Yaara Gazit

Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm: It gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets that are diverse in nature (text data, graph data etc) as well as the source of data (batch v. real-time streaming data).

Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster even when running on disk.

Spark lets you quickly write applications in Java, Scala, or Python. It comes with a built-in set of over 80 high-level operators. And you can use it interactively to query data within the shell.

In addition to Map and Reduce operations, it supports SQL queries, streaming data, machine learning and graph data processing. Developers can use these capabilities stand-alone or combine them to run in a single data pipeline use case.

MongoDB usage - I recommend reading the docs, this will help you get started.

Some things worth mentioning regarding MongoDB: It’s a document store, in other words - it is a huge, distributed key<->value map, where the values are sort of like JSON objects (though Mongo stores them in a binary format called BSON).

It’s great for reading and writing thousands of documents per second, if you’re rapidly developing an application, and you can't predict what your schema will look like in the next 1,2,3 months.

It’s not so great for querying across documents. This is fairly reasonable, as there is no guarantee that the documents are stored sequentially in any order (just like a hash map), and there's no guarantee that all documents in the collection have the same schema. Having said that, MongoDB does implement an indexing mechanism, so you can search the index, and retrieve large amounts of documents via a query. It will still be slower than any transactional, sorted, relational database.

So bottom line - MongoDB is great if you are rapidly developing an app, you don't plan for large batch retrieval queries, or large joins, but you do plan for high volume of transactions. If you know your schema, and you're code is fairly stable, and you want somewhat reasonable analytics capabilities (on a data set that doesn't excess about 1TB), go for Postgres.

And if you still want to go for MongoDB, I would point out that while you may be drawn to unstructured data now, it may be difficult to maintain it for a while and you would probably eventually want to load MongoDB to BigQuery or MongoDB to Redshift to analyze it.

(Full Disclosure - I’m a Software Developer at Alooma that offer data pipelines as a service)

View Original Question On QuoraView Original Question On Quora
Get your data flowing today!
Contact us to start using Alooma for free