Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.
Questions tagged [bigdata]
456 questions
12
votes
7 answers
What is an 'old name' of data scientist?
Terms like 'data science' and 'data scientist' are increasingly used these days.
Many companies are hiring 'data scientist'. But I don't think it's a completely new job.
Data have existed from the past and someone had to deal with data.
I guess the…

user67275
- 263
- 1
- 3
- 15
6
votes
2 answers
What's an efficient way to compare and group millions of store names?
I'm a total amateur as far as data science goes, and I'm trying to figure out a way to do some string comparison on a large dataset.
I've a Google BigQuery table storing merchant transactions, but the store names are all over the board. For…

TerryMatula
- 163
- 4
4
votes
2 answers
Simple Explanation of Apache Flume
Can anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can understand better.
What is it used for? At which stage of a BigData analysis…

DanielWelke
- 163
- 1
- 10
4
votes
3 answers
Learning resources for data science to win political campaigns?
Does anyone know, where I can learn about applying data science to win a political campaign? I know the Obama campaign had 12 data scientists in 2008 and 165 data scientists in 2012. In 2012, they ran over 65,000 simulations every night, for 14…

Tyrion Lannister
- 75
- 1
- 4
4
votes
2 answers
Amazon S3 vs Google Drive
The majority of people use S3. However, Google Drive seems a promising alternative solution for storing large amounts of data. Are there specific reasons why one is better than the other?

iliasfl
- 609
- 5
- 16
4
votes
1 answer
How to get Big Data Sets?
It sounds like dump question but as a beginner, I'm really getting confuse.
For my academic thesis, I choose a conference paper on Big Data in Healthcare field. Now, problem is to get the Data sets.
I can't find any resources to download the data…

Innat
- 181
- 8
3
votes
1 answer
Reducing search iteration over millions of data
The problem goes like this, with a story.
You have an application that contains a search field. When you search some input, there's an auto-complete component that pops up showing up similar results to your input.
Each result is a location in the…

Ben Beri
- 131
- 2
3
votes
8 answers
What is the best Big-Data framework for stream processing?
I found that Apache-Storm, Apache-Spark, Apache-Flink and TIBCO StreamBase are some powerful frameworks for stream processing. but I don't know which one of them has the best performance and capabilities.
I know Apache-Spark and Apache-Flink are two…

Omid Ebrahimi
- 249
- 4
- 10
3
votes
2 answers
What are the differences between Apache Spark and Apache Flink?
Both Apache-Spark and Apache-Flink projects claim pretty much similar capabilities.
what is the difference between these projects. Is there any advantage in either Spark or Flink?
Thanks

Omid Ebrahimi
- 249
- 4
- 10
3
votes
1 answer
SAP HANA vs Exasol
I am interested in knowing the differences in functionality between SAP HANA and Exasol. Since this is a bit of an open ended question let me be clear. I am not interested in people debating which is "better" or faster. I am only interested in what…

Keith
- 326
- 2
- 14
2
votes
3 answers
Examples of the Three V's of Big Data?
What are some examples of the Three V's of Big Data? The three V's stand for: volume, velocity, variety.
Reference:
Three V's of Big Data, provided by Norwegian University of Science and Technology.
https://www.ntnu.edu/ime/bigdata/what-is

Mike Stratton
- 131
- 1
- 7
2
votes
2 answers
Privacy through fake data?
With companies and governments hungry for all the data about people, I was wondering if it was possible to gain some privacy by drowning relevant information in a sea of random data. For example a browser extension which keeps searching for random…

ZoltanE
- 23
- 2
2
votes
1 answer
Which Big-Data Frameworks have most simple interfaces?
I found that Apache-Spark has pretty much simple interface and easy to use. But I want to know about other interfaces.
Can anyone give me a ranking of Big-Data frameworks in base of simplicity of their interfaces. also this is useful to express most…

Omid Ebrahimi
- 249
- 4
- 10
2
votes
2 answers
What is advantage of using Dryad instead of Spark?
I found that Apache-Spark very powerful in Big-Data processing. but I want to know about Dryad (Microsoft) benefits. Is there any advantage for this framework than Spark?
Why we must use Dryad instead of Spark?

Omid Ebrahimi
- 249
- 4
- 10
2
votes
1 answer
Is datawarehouse considered as datalake in big data environnment?
Suppose I have a datawarehouse (DWH) and now I would like to add many other bigdata sources of information most of them are not structured. I still keep the DWH with no architectural change. The only thing I do is to enrich the bigdata with the data…

Avi
- 135
- 7