Thursday, November 13, 2014

Decommissioning DataNodes on HDP 2.0

Issues & Adventures with Decommissioning Data Nodes on HDP 2.0

Use the Ambari UI to decommission...

  1. Decommission the HBase RegionServer
  2. Decommission the NodeManager
  3. Finally Decommission the DataNode


The HBase RegionServer and NodeManager statuses will to to "Decommissioning" and eventually to "Decommissioned".

Use have to use the Ambari Rest API to see that the DataNode is on its way to being Decommissioned...

> curl -s --user admin:40rt0n http://prdslsldsafht25.myfamilysouth.com:8080/api/v1/clusters/prdslsldsafht/hosts/prdslsldsafht12.myfamilysouth.com/host_components/DATANODE | less

Look for "desired_admin_state" : "DECOMMISSIONED".

What seems to be the correct steps...
One would think that Decommissioning should simply work, but it seems that there are bugs that require the procedure to be specific...

  1. First make the Active NameNode NN1
  2. Perform the decomm of the DN (Via the Ambari UI).
    1. You will notice that NN2 does not recognize the decomm of the DN.
    2. There is a bug that indicates that the Decomm action might be taken by a random NN,
      Which may invalidate my hypothesis...
      1.   https://issues.apache.org/jira/browse/AMBARI-4927

What goes wrong if Decomm occurs with NN2 and the Active NN...

  • If the Active is NN2 and Standby is NN1...
    • The decommissioning of a node, is only recognized by NN1, NN2 will continue to write to the datanode as NN1 trys to ensure that all blocks are replicated off of it. NN2 is at fault as no more new blocks should be sent to a decommissioning node.
    • Thus it is possible for Standby-NN1 to be decommissioning a DN while Active-NN2 is writing to the DN.
    • It may be that HA decommissioning tests are always running with Active-NN1 and Standby-NN2 and never the other way around.


Decommissioning nodes would not fully decommission because a very few blocks continue to be indicated as under-replicated, but I could not find those blocks and files via fsck to handle them.

NN UI...
| Decommissioning Datanodes : 2
| Node Transferring
| Address Last
| Contact Under Replicated Blocks Blocks With No
| Live Replicas Under Replicated Blocks
| In Files Under Construction Time Since Decommissioning Started
| prdslsldsafht12 10.211.25.122:50010 2 232143 0 4 0 hrs 8 mins
| prdslsldsafht13 10.211.25.123:50010 2 261441 0 3 0 hrs 8 mins

hdfs fsck...
|  Total dirs:    140574
|  Total files:   5367953
|  Total symlinks:                0 (Files currently being written: 5353)
|  Total blocks (validated):      5683249 (avg. block size 40621873 B) (Total open file blocks (not validated): 1796)
|  Minimally replicated blocks:   5683249 (100.00001 %)
|  Over-replicated blocks:        385172 (6.7773204 %)
|  Under-replicated blocks:       0 (0.0 %)
|  Mis-replicated blocks:         0 (0.0 %)
|  Default replication factor:    3
|  Average block replication:     3.0742598
|  Corrupt blocks:                0
|  Missing replicas:              0 (0.0 %)
|  Number of data-nodes:          75
|  Number of racks:               5
| FSCK ended at Thu Nov 13 21:26:31 MST 2014 in 1680353 milliseconds


I simply pull the nodes, making them dead-nodes by turning off the DN process.
Stangely, Ambari shows that the nodes are in a decommissioned state.
But, I still have the option to stop the DataNode service, so I do that.
The NN UI continues to indicate that the 2 nodes are still in the process of decommissioning.
So, we need to make the NN think that the node is dead by removing the hostnames from dfs.exclude and running -refreshNodes.
After running -refreshNodes, the nodes no-longer display in the Decommissioning nodes list.
And, the main NN UI page shows that 2 nodes are Decommissioned.

Thus, we can by-pass the Decommissioning process if it can't resolve the last few under replicated blocks.

The following link was helpful in that it indicated that it was possible for the decommission process to completely stall on a final few under-replicated blocks. One should be able to find the files associated those under replicated blocks and perform an appropriate handling on then. But, in our case, fsck was not finding any under-replicated blocks at all.
http://stackoverflow.com/questions/17789196/hadoop-node-taking-a-long-time-to-decommission