Big data:
Big
data refers to data sets that are too large or complex for traditional
data-processing application software to adequately deal with. Data with many
cases (rows) offer greater statistical power, while data with higher complexity
(more attributes or columns) may lead to a higher false discovery rate. Big
data challenges include capturing data, data storage, data analysis, search, sharing,
transfer, visualization, querying, updating, information privacy and data
source. Big data was originally associated with three key concepts: volume,
variety, and velocity. Other concepts later attributed with big data are
veracity
Tools Of Big Data
- Apache
Hadoop
- Apache
Spark
- Apache
Strom
- Apache
Cassandra
- MongoDb
- R
programming
- Neo4j
- Apache
SAMOA
Hadoop:
Apache
Hadoop is a collection of open-source software utilities that facilitate
using a network of many computers to solve problems involving massive amounts
of data and computation. It provides a software framework for distributed
storage and processing of big data using the MapReduce programming model.
Originally designed for computer clusters built from commodity hardware—still
the common use—it has also found use on clusters of higher-end hardware. All
the modules in Hadoop are designed with a fundamental assumption that hardware
failures are common occurrences and should be automatically handled by the
framework.
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules;
Hadoop Distributed File System (HDFS) – a distributed file-system that stores
data on commodity machines, providing very high aggregate bandwidth across the
cluster;
Hadoop YARN – introduced in 2012 is a platform
responsible for managing computing resources in clusters and using them for
scheduling users' applications;
Hadoop MapReduce – an implementation of the MapReduce
programming model for large-scale data processing.
Scala:
Scala is a
strong statically typed general-purpose programming language which supports
both object-oriented programming and functional programming. Designed to be
concise, many of Scala's design decisions are aimed to address criticisms of
Java.
Apace Spark:
Apache Spark
is an open-source unified analytics engine for large-scale data processing.
Spark provides an interface for programming entire clusters with implicit data
parallelism and fault tolerance.
No comments:
Post a Comment