Hunting, cloud depth, Hadoop 】 : stable, efficient and flexible data processing platform

if you talk about big data with others, then you’ll set the conversation turned to the only yellow elephants – Hadoop (a sign of it is an elephant in the yellow). The open source software platform is launched by the Apache foundation, the value of it is able to handle the very large data is simple and efficient.

but, what is it? In a nutshell, is a large amounts of data can be distributed processing software framework. First of all, it will save a large amount of data sets in a distributed server in the cluster, then it will run on each server in the cluster “distributed” data analysis applications.

what’s special about it? First of all, it is very reliable, even if one or a set of server goes down, you big data applications can still function. Second, it is very efficient, in the process, do not need to back and forth from the entire network of large amounts of data to be obtained.

for, Apache given official description:

the Apache Hadoop software framework using simple programming model to distributed processing large data. Its design concept is from a server extension to tens of thousands of machines, every single individual to provide local computing and storage capacity. Similar with other rely on hardware to ensure the reliability of the software platform, Hadoop bug detection and processing capabilities in the application layer. As a result, it only through a computer cluster can provide practical services.

what is more, Hadoop is almost a modular, which means that you can use any other software tools to replace every part of it. Therefore, Hadoop is not only reliable and efficient, and more flexible organization structure.

the Hadoop distributed file system (HDFS)

well, if until now you still don’t understand what is Hadoop (in fact, I feel a little abstract), so, you only need to remember the Hadoop two of the most important part of it is enough. A framework is its data processing, the other one is it is used to store the data of the distributed file system (distributed filesystem). Of course, the complexity of the internal composition of Hadoop, however, as long as there are more than the two components, it can be the basic running.

is the distributed file system mentioned above we distributed storage cluster. To put it bluntly, it is the “master” in the Hadoop that part of the actual data. Although Hadoop can also use other file system for data storage, but by default, it will use the distributed file system (HDFS) to complete the “task”.

for example, HDFS is like a big bucket, your data can be placed on it. Until you call them, these lovely little guys “are obedient to stay in the safe big bucket. Whether you are analyzed in Hadoop, or want to obtain data from Hadoop, use other tools for processing, HDFS are can provide you with the most smooth and reliable operation experience.

data processing framework & amp; Graphs

“LTR” (translator note: graphs is Google development tool of c + + programming, parallel computing for large-scale data sets)

data processing framework, coordination with data is actually running a tool. By default, it is called, is a system based on Java. Compared with the HDFS, you may be more familiar with graphs. Causes of the case, may be the two main reasons:

1. Its graphs) is a tool for real process data

2. When people use it to data processing, feel is good

in the so-called “normal” relational database, data search and analysis, to use the standard “structured query language (SQL). A relational database using a query language, but is not limited to SQL. It can help other query language data needed to be obtained from the data store. Therefore, also some people call this query language become “unstructured check need language” ().

in fact, Hadoop is not a database. Although it can help you to store data, and you to draw on, but it does not involve any query language (neither SQL is different also other languages). Because it is not only a data storage system, thus the Hadoop needs a system, similar to the graphs to help it deal with data.

takes that a series of graphs. In fact, every job is a single Java application, after entering the database, it can according to need to bring up the different information. Use the graphs instead of the traditional query language, can make the lookup of the data more precise and flexible. However, on the other hand, in so doing, also increased the complexity of data processing.

however, there are some tools can simplify the process. For example, Hadoop also includes an application called Hive (and development led by Apache). It can be query language into graphs of the work. Process appears to be simplified, but the batch mode can only process one job at a time. As a result, the trial Hadoop is more like the use of a data warehouse, rather than a data analysis tool.

scattered in the cluster

Hadoop another unique is that all of the above functions are distributed system, and this in the middle of the traditional database system has a very big difference.

support on multiple machines in a need of database, the “work” is often assigned well: all the data is stored in one or more of the machine, and the data processing software are controlled by another or a group of servers.

in the group of Hadoop, HDFS and graphs systems exist in each machine in the group. Doing so has two advantages: one, even if the group in a server outage, the system can still run normally; Second, it put together data storage and data processing system, can improve the speed of data retrieval.

all in all, as I said, Hadoop is reliable and efficient.

when graphs received a information inquiry instruction, it will immediately start the two “tool” for processing. One is “tracking” (the JobTracker), located in the Hadoop master node; The other was a “task tracker” (TaskTrackers), located in the Hadoop on each network node.

the whole process is linear: first of all, by running the Map function, work “trackers” will work operation into a well-defined modules, and send it to the “task tracker” (” task tracker “located in group machine in entity data needed). Second, by running the Reduced function, with the previously defined the accurate data and the corresponding other machines in the group to find the required data, is passed back to the Hadoop group central node.

HDFS distribution method and described above are very similar to the way. A single master node in the group to rent the location of the tracking data in the server. The master node is also known as data node, the data is stored on a block of data in a data node. HDFS to copy of data blocks, the general is a replication units with 128 MB, then scatter the replicated data into the group of each node.

this distributed “style” makes the Hadoop has another advantage: because of its data and processing system are in the same server in the group, so that when your group for every machine, you of the whole system hardware space and computing power are promoted a grade.

equipped with your Hadoop

as I mentioned before, even the users don’t need to struggle with HDFS or graphs. Because, Hadoop also provides “elastic compute cloud solutions”. This scheme is mainly refers to the amazon with simple cloud storage service, and the (system) based on Apache Cassandra, it replaces the HDFS, bear the Hadoop distributed file system.

in order to avoid the graphs of “linear” approach, the limitation of Hadoop and a called Cascading application, it can provide a convenient tool for developers, and increase the flexibility of the work.

objectively speaking, the current Hadoop can not be regarded as a perfect, very simple data processing tool. As I mentioned, due to the graphs in processing data limitations, many developers still used to use it only as a simple and efficient storage. As a result, the Hadoop data can’t reflect the advantages of processing.

however, Hadoop is indeed the best, the most versatile big data processing platform. Especially when you don’t have the time and money to use relational database to store huge data. Therefore, Hadoop is always in the “house” of big data on an elephant, waiting for data processing.