Monday, June 20, 2016

Upgrading Hadoop 2.0 to 2.4.2 (Avoiding a broken Ambari UI)


Hortonworks Hadoop Upgrade

HDP 2.0 to 2.4.2

This blog is a bit of info regarding a Hadoop cluster upgrade from HDP 2.0 to 2.4.2.

Given the current state of Hadoop at the company, we are mainly focused on an upgrade of HDFS, as current production systems mainly make use of HDFS, Yarn, Mapreduce, Hive and Pig. Currently, we also run a stand-alone Spark 1.4.0 cluster that is setup in an HA mode, along with Tachyon to improve disk access among other optimizations. Spark and Tachyon are built to use HDFS by default. We run 20 Spark Worker nodes directly on Data Nodes of the cluster. We don't run Yarn NodeManagers on nodes that run Spark Workers, so that the Spark Worker Nodes are mainly dedicated to Spark. One Tachyon worker also runs on each of the Spark Worker nodes.

We have been working out the details of the Hadoop upgrade on a small 5 node Dev cluster, were we start with HDP 2.0 and test the upgrade steps to HDP 2.4.2.
The initial plan was to start with with Ambari 1.5 and HDP 2.0.
We will upgrade Ambari to 1.7 and then to 2.0. Then upgrade HDP 10 2.2.
Then Upgrade Ambari to 2.2.
Once we are at Ambari 2.2, we would upgrade HDP to 2.0.
Then upgrade HDP to 2.4.2 via Ambari.
The problem with this approach is that experience has shown that upgrades to Ambari fail 9 out of 10 times. In most cases, after incremental updates to Ambari, Ambari will get into a state where certain pulldowns, dialogs, and/or tabbed pages are missing, making it impossible to perform some typical Ambari procedure, such as adding or monitoring a given service. As with our typical experience, the incremental upgrade of Ambari failed after the upgrade from 1.7 to 2.0.
Firstly, we could not get webhcat up and running via Ambari, and Secondly, the Admin "Manage Ambari" pulldown was missing.
So I have requested to we try the upgrade, by short-cutting through the incremental upgrade steps, moving directly from 2.0 to 2.4.2. Luckily, HortonWorks' latest upgrade to the HDP core (2.4.2) includes an upgrade path from 2.0 to 2.4.2, at least from a manual upgrade angle. This is was very good news to me when I heard and read about it. So here is what the "shortcut" plan entails...
  1. On all of the local drives of the name-nodes, journals, and data-nodes, we move the hadoop data dirs to a read-only backup dir. All of or data in hadoop is located on JBOD drives. We have 10 drives per node and we mount these drive to /grid/01 to 10. Within each /grid/[n]/ dir, hadoop creates its data/meta-data directory "hadoop". We will move these hadoop/ dirs to hadoop_2.0.bk/ and make the dirs read-only.
  2. We make note of all config data that contains paths to /grid/[n]/hadoop.
  3. We make note of the location of all services by node.
  4. Next, we backing-up all configs in the current PRD Ambari-DB (Postgres) and in /etc//conf, and the current Hive metadata (MySql).
  5. We then remove all prior Hadoop and HDP packages from each node.
  6. We install the latest version of Ambari (2.4.2)
  7. We install all new HDPO 2.4.2 services to the cluster, ensuring that services are placed on the same nodes as the old cluster, and as mapped in step 3.
  8. We bring up Hadoop 2.4.2 and test it.
  9. We enable HA following the same service map as the old cluster.
  10. We shutdown all of HDFS.
  11. We run a script that renames all of the /grid/[n]/hadoop/ dirs to /grid/[n]/hadoop_2.4.2.init.
  12. We run a script that renames all of the /grid/[n]/hadoop_2.0.bk/ dirs /grid/[n]/hadoop/.
  13. We start up the journal nodes.
  14. We start up the NN1 manually with the -upgrade option. NN1 will first upgrade the Journal Nodes and the Name Node data structures. Then NN1 will tell the data nodes to perform data structure upgrades.
  15. We start up all of the Data Nodes (which will receive a command from the NameNode to perform an upgrade). Thus, Data Nodes will upgrade in parallel.
  16. We start up the StandBy NameNode (NN2) with the options "-bootstrapStandby -force"
  17. The name node should become active once all data nodes have upgraded.
  18. If the NN nodes not become active, if it make sense (based on the name-node logs), to force the NN out of standby mode... We run: sudo -u hdfs hdfs dfsadmin -safemode forceExit
  19. Once the active NN is out of standby mode, run smoke-tests.
This upgrade process has been tested on our Dev Hadoop cluster. This is a good solution for our Production cluster as we are mainly concerned with an upgrading HDP to the Latest. Using this process, our original Hadoop configurations are lost. But we don't mind, as we understand how to optimize HDFS and MapReduce, and it will be easy given the latest version of the Ambari UI. Our Spark cluster will be replaced with Spark on Yarn, given the latest version of HDP, which also brings up the the latest version of Spark 1.6, and allows us to make use of our cluster resources more effectively.

I also have to add here, the steps that we used to upgrade our Hive MetaStore to the latest version.
So, initially, we attempted to use the following steps (to a failed result)...

  1. Shutdown Hive and Hive MetaStore.
  2. Restore the old Hive MetaStore SQL DB (containing our original metadata).
  3. Execute Hive MetaStore upgrade scripts to upgrade MySql hive schema from 0.12 to 1.2.1.2.4.
Every thing looked good until we tried to perform a group-by select, on a partition field. A group-by on a partition field would result in a error during the map-reduce process, where map-reduce would complain that existing partitions were ambiguous. So it seemed that some strange change to the Hive MetaStore table was causing subtle and bad effects. In fact, the deletion of a partition resulted in all of /apps/hive/warehouse/ getting deleted. This was really bad... After wrestling and failing with attempts at reviewing and patching MetaStore Schema in the MySql database, we decided to take a different approach... Something that starts us off at a clean pristine state, with regard to the Hive MetaStore...
  1. Start with the latest pristine version of the Hive MetaStore schema for 1.2.1.2.4.
  2. Capture the "create-table output" from the original HDP 2.0 production system, the "show create table " Hive command.
  3. Run this create table command command on the new system (HDP 2.4.2), to recreate our tables.
  4. Capture the "show partitions " output from the old cluster (HDP 2.0).
  5. Generate "Alter Table Add Partition" commands based on that output.
  6. Run the generated "Alter Table Add Partition" on the new cluster (HDP 2.4.2) to re-generate the tables partitions.
  7. Now, this worked with no issues. But you have to understand what you are doing when you Hive tables are "managed-tables". With "managed-tables", your tables can not be created if the directory already exists within /apps/hive/warehouse, so we also have a step where we rename /apps/hive/warehouse to /apps/hive/warehouse.save (and make is read-only), and we create an empty /apps/hive/warehouse/ dir, before running all of the create-table script, what will restore all of the Hive MetaStore meta-data for our tables, as they were when we have (HDP 2.0) installed. After all of the tables are created, we do a directory rename on the /apps/hive/warehouse dirs to flip our our data dir (that contains the actual data), back into position, and we restore write access to it again.
  8. Finally, we run a set of hive queries to ensure that all of our data can be accessed via Hive again, in the HDO 2.4.2 install.
These steps are merely an outline of what needs to be done in general, there will be more details to you particular run-book, based on your cluster environment, and based on what you discover during upgrade dry-runs that you will be doing on a development cluster... You will be doing dry-runs... Won't you.

The bottom line is that, a clean install of a particular version of Ambari and it's corresponding database, followed by a clean install of the HDP version of choice, is the best path to a clean running Ambari and HDP install, given the current state of the install/upgrade process.
My experience has been that the most important aspect of the upgrade process, preserving and upgrading HDFS "just works", (will it better work, or you can forget about it...). So we leverage that aspect of the upgrade process to by pass all of the headaches associated to incremental upgrades to Ambari.

We will exercise, this upgrade process on our dev cluster a few more times, them we will do it to production cluster at the end of the week, outside of production hours.
So stay tuned... I will update this post after the production system upgrade.

I'm hoping that this data helps you to have a safe, pleasant upgrade experience.
Hadoop upgrades can be tricky, so exercise your theories and create a run-book of what works on a test cluster. Then exercise your run-book a few times, on your dev cluster, to get all of the mysteries and kinks out. 

2016-06-20
-Sidlo

Successful Production HDP 2.0 to 2.4.2 Upgrade Completed

We started the upgrade process on Friday 23-Jun, and had a completely successful upgrade with no data loss as of 25-Jun. The upgrade steps and process took a total of 28hrs. Besides my team, two awesome ThinkBig consultants, Serhiy Blazhiyevskyy and Cedric Barnett where a huge help in ensuring the success of the upgrade.

One thing that we noticed was that installs of clients were taking a long time, and then later, restarts of services were taking a long time. After some debugging, we found that there was a cluster level variable called fetch_nonlocal_groups, that is by default, set to True. The problem with this default is that on client installs and on services startups, Hadoop's group names are checked for non-locally, or over ldap. So, if you have ldap based authorization set up on you cluster, the and, if you don't have those Hadoop groups setup in your ldap servers, ldap will perform an exhaustive search which can timeout, and cause the startup of install process to fail. Before we discovered that the fetch_nonlocal_groups variable can be set to False on your cluster's blue print via Ambari's configs.sh script, we had to suffer through a slow and tedious install which we had to retry about 4 times before it was finally successful. Then we had to suffer through slow service startups, before we finally figured out how to correct the situation. If you have a similar experience or environment, use the configs.sh script to change the fetch_nonlocal_groups variable for you cluster blueprint to False. During a cluster install, it may be possible to perform the variable update before services are pushed to nodes by Ambari, but I am not an expert in Ambari's internals, so I am just guessing.

One useful tip, with regard to something unexpected that can come up during this type of production upgrade, is to find that hards drive have failed, or are failing enough for some services to fail. We had 8 drives fail on 8 nodes, and on one other node, we had 8 out of 10 drives fail. In my experience, a DataNode will not start up if even 1 drive has failed on the node (even if node failure tolerance has been set to 2 or 3). Now, to ensure that we have no dataloss, we needed to start up with all good drives attached to running DataNodes and upgrade with all good drives that are available. So, we need to have DataNodes that have bad drives start up, by having them dis-regard any drives that are bad. Luckily, this can be done through Ambari's configuration groups. Being able to use Configuration Groups to get DataNodes to start up even when they have bad drives was a real life...data-saver.

2016-06-25
-Sidlo