APACHE HADOOP

Apache Hadoop is A Freely Licensed Software Framework Developed By The Apache Software Foundation And Used To Develop Data-intensive, Distributed Computing.

1.What is Hadoop framework?

Hadoop is a open source framework which is written in java by apache software foundation. This
framework is used to write software application which requires to process vast amount of data (It could
handle multi tera bytes of data). It works in-parallel on large clusters which could have 1000 of computers
(Nodes) on the clusters. It also

2.What is Hadoop Map Reduce ?

For processing large data sets in parallel across a hadoop cluster, Hadoop MapReduce framework is used.
Data analysis uses a two-step map and reduce process.

3.How Hadoop MapReduce works?

In MapReduce, during the map phase it counts the words in each document, while in the reduce phase it
aggregates the data as per the document spanning the entire collection. During the map phase the input data is
divided into splits for analysis by map tasks running in parallel across Hadoop framework.

4.Explain what is shuffling in MapReduce ?

The process by which the system performs the sort and transfers the map outputs to the reducer as inputs is
known as the shuffle

5.Explain what is distributed Cache in MapReduce Framework?

Distributed Cache is an important feature provided by map reduce framework. When you want to share some
files across all nodes in Hadoop Cluster, DistributedCache is used. The files could be an executable jar files
or simple properties file.

6.Explain what is NameNode in Hadoop?

NameNode in Hadoop is the node, where Hadoop stores all the file location information in HDFS (Hadoop
Distributed File System). In other words, NameNode is the centrepiece of an HDFS file system. It keeps
the record of all the files in the file system, and tracks the file data across the cluster or multiple machines

7.Explain what is JobTracker in Hadoop? What are the actions followed by Hadoop?

In Hadoop for submitting and tracking MapReduce jobs, JobTracker is used. Job tracker run on its own
JVM process

Hadoop performs following actions in Hadoop

Client application submit jobs to the job tracker

JobTracker communicates to the Namemode to determine data location

Near the data or with available slots JobTracker locates TaskTracker nodes

On chosen TaskTracker Nodes, it submits the work

When a task fails, Job tracker notify and decides what to do then.

The TaskTracker nodes are monitored by JobTracker

8.Explain what is heartbeat in HDFS?

Heartbeat is referred to a signal used between a data node and Name node, and between task tracker and
job tracker, if the Name node or job tracker does not respond to the signal, then it is considered there is
some issues with data node or task tracker

9.Explain what combiners is and when you should use a combiner in a MapReduce Job?

To increase the efficiency of MapReduce Program, Combiners are used. The amount of data can be
reduced with the help of combiner’s that need to be transferred across to the reducers. If the operation
performed is commutative and associative you can use your reducer code as a combiner. The execution of
combiner is not guaranteed in Hadoop

10.What happens when a datanode fails ?

When a datanode fails

Jobtracker and namenode detect the failure

On the failed node all tasks are re-scheduled

Namenode replicates the users data to another node

11.Explain what is Speculative Execution?

In Hadoop during Speculative Execution a certain number of duplicate tasks are launched. On different
slave node, multiple copies of same map or reduce task can be executed using Speculative Execution. In
simple words, if a particular drive is taking long time to complete a task, Hadoop will create a duplicate
task on another disk. Disk that finish the task first are retained and disks that do not finish first are killed.

12.Explain what are the basic parameters of a Mapper?

The basic parameters of a Mapper are

  • LongWritable and Text
  • Text and IntWritable

13.Explain what is the function of MapReducer partitioner?

The function of MapReducer partitioner is to make sure that all the value of a single key goes to the same
reducer, eventually which helps evenly distribution of the map output over the reducers

14.Explain what is difference between an Input Split and HDFS Block?

Logical division of data is known as Split while physical division of data is known as HDFS Block

15.Explain what happens in textinformat ?

In textinputformat, each line in the text file is a record. Value is the content of the line while Key is the
byte offset of the line. For instance, Key: longWritable, Value: text

16.Mention what are the main configuration parameters that user need to specify to run Mapreduce Job?

  • The user of Mapreduce framework needs to specify
  • Job’s input locations in the distributed file system
  • Job’s output location in the distributed file system
  • Input format
  • Output format
  • Class containing the map function
  • Class containing the reduce function
  • JAR file containing the mapper, reducer and driver classes

17.Explain what is WebDAV in Hadoop?

To support editing and updating files WebDAV is a set of extensions to HTTP. On most operating system
WebDAV shares can be mounted as filesystems , so it is possible to access HDFS as a standard filesystem
by exposing HDFS over WebDAV.

18.Explain what is sqoop in Hadoop ?

To transfer the data between Relational database management (RDBMS) and Hadoop HDFS a tool is used
known as Sqoop. Using Sqoop data can be transferred from RDMS like MySQL or Oracle into HDFS as
well as exporting data from HDFS file to RDBMS

19.Explain how JobTracker schedules a task ?

The task tracker send out heartbeat messages to Jobtracker usually every few minutes to make sure that
JobTracker is active and functioning. The message also informs JobTracker about the number of available
slots, so the JobTracker can stay up to date with where in the cluster work can be delegated

20.Explain what is Sequencefileinputformat?

Sequencefileinputformat is used for reading files in sequence. It is a specific compressed binary file format
which is optimized for passing data between the output of one MapReduce job to the input of some other
MapReduce job.

21.Explain what does the conf.setMapper Class do ?

Conf.setMapperclass sets the mapper class and all the stuff related to map job such as reading data and
generating a key-value pair out of the mapper

22.Name the most common Input Formats defined in Hadoop? Which one is default?

The two most common Input Formats defined in Hadoop are:

  • TextInputFormat
  • KeyValueInputF2ormat
  • SequenceFileInputFormat
  • TextInputFormat is the Hadoop default.

23.What is the difference between TextInputFormat and KeyValueInputFormat class?

TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and
actual line as Value to the mapper.

KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab
character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

24.What is InputSplit in Hadoop?

When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process.
This is called InputSplit.

25.How is the splitting of file invoked in Hadoop framework?

It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like
FileInputFormat) defined by the user.

26.Consider case scenario: In M/R system, - HDFS block size is 64 MB1 - Input format is FileInputFormat – We have 3 files of size 64K, 65Mb and 127Mb How many input splits will be made by Hadoop framework?

Hadoop will make 5 splits as follows:

- 1 split for 64K files

- 2 splits for 65MB files

2 splits for 127MB files

27.What is the purpose of RecordReader in Hadoop?

The InputSplit has defined a slice of work, but does not describe how to access it. TheRecordReader class
actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the
Mapper. The RecordReader instance is defined by the Input Format.

28.After the Map phase finishes, the Hadoop framework does “Partitioning, Shuffle and sort”. Explain what happens in this phase?

Partitioning: It is the process of determining which reducer instance will receive which intermediate keys
and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive1
them. It is necessary that for any key, regardless of which mapper instance generated it, the destination
partition is the same.

Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks
each. But they also begin exchanging the intermediate outputs from the map tasks to where they are
required by the reducers. This process of moving map outputs to the reducers is known as shuffling.

Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The
set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to
the Reducer.

29.If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

The default partitioner computes a hash value for the key and assigns the partition based on this result.

30.What is a Combiner?

The Combiner is a ‘mini-reduce’ process which operates only on data generated by a mapper. The
Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from
the Combiner is then sent to the Reducers, instead of the output from the Mappers.

31.What is JobTracker?

JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

32.What are some typical functions of Job Tracker?

The following are some typical tasks of JobTracker:-

- Accepts jobs from clients

- It talks to the NameNode to determine the location of the data.

- It locates TaskTracker nodes with available slots at or near the data.

- It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving
heartbeat signals from Task tracker.

33.What is TaskTracker?

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations – from a
JobTracker.

34.What is the relationship between Jobs and Tasks in Hadoop?

One job is broken down into one or many tasks in Hadoop.

35.Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?

It will restart the task again on some other TaskTracker and only if the task fails more than four (default
setting and can be changed) times will it kill the job.

36.Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?

Speculative Execution.

37.How does speculative execution work in Hadoop?

JobTracker makes different TaskTrackers pr2ocess same input. When tasks complete, they announce this
fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies
were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their
outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

38.Using command line in Linux, how will you - See all jobs running in the Hadoop cluster - Kill a job?

Hadoop job – list

Hadoop job – kill jobID

39.What is Hadoop Streaming?

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop
Mapper and Reducer implementations.

40.What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce
job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on
stdout.

41.What is Distributed Cache in Hadoop?

Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars
and so on) needed by applications during execution of the job. The framework will copy the necessary files
to the slave node before any tasks for the job are executed on that node.

42.What is the benefit of Distributed cache? Why can we just have the file in HDFS and have the application read it?

This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now
if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On
the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to
access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times
from HDFS. Also HDFS is not very efficient when used like this.

43.What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?

This is a tricky question. There is no such mechanism. Distributed Cache by design is read only during the
time of Job execution.

44.Have you ever used Counters in Hadoop. Give us an example scenario?

Anybody who claims to have worked on a Hadoop project is expected to use counters.

45.Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?

Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

46.Is it possible to have Hadoop job output in multiple directories? If yes, how?

Yes, by using Multiple Outputs class.

47.What will a Hadoop job do if you try to run it with an output directory that is already present? Will it - Overwrite it - Warn you and continue - Throw an exception and exit

The Hadoop job will throw an exception and exit.

48.How can you set an arbitrary number of mappers to be created for a job in Hadoop?

You cannot set it.

49.How can you set an arbitrary number of Reducers to be created for a job in Hadoop?

You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it
up as a configuration setting.

50.How will you write a custom partitioner for a Hadoop job?

To have Hadoop use a custom partitioner you will have to do minimum the following

three:

- Create a new class that extends Partitioner Class

- Override method getPartition

- In the wrapper that runs the Mapreduce, either

Add the custom partitioner to the job programmatically using method set Partitioner Class or – add the
custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

51.How did you debug your Hadoop code?

There can be several ways of doing this but most common ways are:-

- By using counters.

- The web interface provided by Hadoop framework.

52.Did you ever built a production process in Hadoop? If yes, what was the process when your Hadoop job fails due to any reason?

It is an open-ended question but most candidates if they have written a production job, should talk about
some type of alert mechanism like email is sent or there monitoring system sends an alert. Since Hadoop
works on unstructured data, it is very important to have a good alerting system for errors since unexpected
data can very easily break the job.

53.What is compute and Storage nodes?

Compute Node: This is the computer or machine where your actual business logic will be executed.

Storage Node: This is the computer or machine where your file system reside to store the processing data.

In most of the cases compute node and storage node would be the same machine.

54.How does master slave architecture in the Hadoop?

The MapReduce framework consists of a single master JobTracker and multiple slaves, each cluster-node
will have one TaskskTracker. The master is responsible for scheduling the jobs' component tasks on the
slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

55.How does an Hadoop application look like or their basic components?

Minimally an Hadoop application would have following components.

  • Input location of data
  • Output location of processed data.
  • A map task.
  • A reduced task.
  • Job configuration

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which
then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks
and monitoring them, providing status and diagnostic information to the job-client.

56.Explain how input and output data format of the Hadoop framework?

The MapReduce framework operates exclusively on pairs, that is, the framework views the input to the job
as a set of pairs and produces a set of pairs as the output of the job, conceivably of different types. See
the flow mentioned below (input) -> map -> -> combine/sorting -> -> reduce -> (output).

57.What are the restriction to the key and value class ?

The key and value classes have to be serialized by the framework. To make them serializable Hadoop
provides a Writable interface. As you know from the java itself that the key of the Map should be
comparable, hence the key has to implement one more interface WritableComparable.

58.Can Reducer talk with each other?

No, Reducer runs in isolation.

59.Where the Mapper’s Intermediate data will be stored?

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual
mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop
administrator. The intermediate data is cleaned up after the Hadoop Job completes.