Questions tagged [bigdata]

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization.

456 questions
12
votes
7 answers

What is an 'old name' of data scientist?

Terms like 'data science' and 'data scientist' are increasingly used these days. Many companies are hiring 'data scientist'. But I don't think it's a completely new job. Data have existed from the past and someone had to deal with data. I guess the…
user67275
  • 263
  • 1
  • 3
  • 15
6
votes
2 answers

What's an efficient way to compare and group millions of store names?

I'm a total amateur as far as data science goes, and I'm trying to figure out a way to do some string comparison on a large dataset. I've a Google BigQuery table storing merchant transactions, but the store names are all over the board. For…
TerryMatula
  • 163
  • 4
4
votes
2 answers

Simple Explanation of Apache Flume

Can anybody explain Apache Flume for me in a plain language? I'd appreciate an explanation with a practical example instead of abstract theoretical definitions, then I can understand better. What is it used for? At which stage of a BigData analysis…
DanielWelke
  • 163
  • 1
  • 10
4
votes
3 answers

Learning resources for data science to win political campaigns?

Does anyone know, where I can learn about applying data science to win a political campaign? I know the Obama campaign had 12 data scientists in 2008 and 165 data scientists in 2012. In 2012, they ran over 65,000 simulations every night, for 14…
4
votes
2 answers

Amazon S3 vs Google Drive

The majority of people use S3. However, Google Drive seems a promising alternative solution for storing large amounts of data. Are there specific reasons why one is better than the other?
iliasfl
  • 609
  • 5
  • 16
4
votes
1 answer

How to get Big Data Sets?

It sounds like dump question but as a beginner, I'm really getting confuse. For my academic thesis, I choose a conference paper on Big Data in Healthcare field. Now, problem is to get the Data sets. I can't find any resources to download the data…
Innat
  • 181
  • 8
3
votes
1 answer

Reducing search iteration over millions of data

The problem goes like this, with a story. You have an application that contains a search field. When you search some input, there's an auto-complete component that pops up showing up similar results to your input. Each result is a location in the…
Ben Beri
  • 131
  • 2
3
votes
8 answers

What is the best Big-Data framework for stream processing?

I found that Apache-Storm, Apache-Spark, Apache-Flink and TIBCO StreamBase are some powerful frameworks for stream processing. but I don't know which one of them has the best performance and capabilities. I know Apache-Spark and Apache-Flink are two…
Omid Ebrahimi
  • 249
  • 4
  • 10
3
votes
2 answers

What are the differences between Apache Spark and Apache Flink?

Both Apache-Spark and Apache-Flink projects claim pretty much similar capabilities. what is the difference between these projects. Is there any advantage in either Spark or Flink? Thanks
Omid Ebrahimi
  • 249
  • 4
  • 10
3
votes
1 answer

SAP HANA vs Exasol

I am interested in knowing the differences in functionality between SAP HANA and Exasol. Since this is a bit of an open ended question let me be clear. I am not interested in people debating which is "better" or faster. I am only interested in what…
Keith
  • 326
  • 2
  • 14
2
votes
3 answers

Examples of the Three V's of Big Data?

What are some examples of the Three V's of Big Data? The three V's stand for: volume, velocity, variety. Reference: Three V's of Big Data, provided by Norwegian University of Science and Technology. https://www.ntnu.edu/ime/bigdata/what-is
Mike Stratton
  • 131
  • 1
  • 7
2
votes
2 answers

Privacy through fake data?

With companies and governments hungry for all the data about people, I was wondering if it was possible to gain some privacy by drowning relevant information in a sea of random data. For example a browser extension which keeps searching for random…
ZoltanE
  • 23
  • 2
2
votes
1 answer

Which Big-Data Frameworks have most simple interfaces?

I found that Apache-Spark has pretty much simple interface and easy to use. But I want to know about other interfaces. Can anyone give me a ranking of Big-Data frameworks in base of simplicity of their interfaces. also this is useful to express most…
Omid Ebrahimi
  • 249
  • 4
  • 10
2
votes
2 answers

What is advantage of using Dryad instead of Spark?

I found that Apache-Spark very powerful in Big-Data processing. but I want to know about Dryad (Microsoft) benefits. Is there any advantage for this framework than Spark? Why we must use Dryad instead of Spark?
Omid Ebrahimi
  • 249
  • 4
  • 10
2
votes
1 answer

Is datawarehouse considered as datalake in big data environnment?

Suppose I have a datawarehouse (DWH) and now I would like to add many other bigdata sources of information most of them are not structured. I still keep the DWH with no architectural change. The only thing I do is to enrich the bigdata with the data…
Avi
  • 135
  • 7
1
2