Total Pageviews

Tuesday, December 16, 2014

Trace Investigation with HDInsight Hive query



Pre-requisites

This article is to parallelize  trace investigations using Hadoop Map Reduce and thereby reduce time and effort in investigations. 

Current Problem

1.       During on call events there is a time and effort investment in recognizing the exact issues. There is no mechanism which can give a prediction of probable issues.
2.       Traces are big data. Huge GB files are scanned to filter out exact traces. Currently we scan through traces which we download as buffer locally. Here Parallel execution of multiple filters is putting more load to the system.
3.       We are having 10-15 filter strings but executing all of them in one go is not possible in current scenario.

Proposed Solution
We propose a solution of using HDInsight to use Hive Query and do parallel execution using map reduce . Map Reduce is a technology to divide the large data in multiple chunks and send it to mappers. Mappers are executors which work on small data size and provide output. Outputs from different mappers are combined and reduced to a cumulative output using Reduceer.

We can execute Hive query language implementation on the winfab traces to accomplish multiple known filters consequently on the big data and filter out the traces in text format.

Before going forward here are few terminology that will help us the technologies used for this parsing.

What is Hive
Hive is a data warehousing infrastructure based on Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing (using the map-reduce programming paradigm) on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides a simple query language called Hive QL, which is based on SQL and which enables users familiar with SQL to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive QL also allows traditional map/reduce programmers to be able to plug in their custom mappers and reducers to do more sophisticated analysis that may not be supported by the built-in capabilities of the language.


What is HDInsight

HDInsight is a Hadoop distribution powered by the cloud. This means HDInsight was architected to handle any amount of data, scaling from terabytes to petabytes on demand. You can spin up any number of nodes at anytime.


Powershell Script

In powershell execute Get-AzurePublishSettingsFile.
Currently we have a hdinsight created in a different Azure account.



1.       Save the publishsetting file in some location
2.       Execute Import-AzurePublishSettingsFile -PublishSettingsFile <<filepath where you saved the publishsettingfile>>
3.       Verify connection using Use-AzureHDInsightCluster -Name wftraceanalyzer
Note: There is an issue in connecting to the cluster, it might be because of incompatible powershell and Hadoop SDK. Please read this link
4.       After verification, execute the following powershell script

$clusterName = "<<hdinsight cluster name>>"
$subscriptionName ="<<subscription name>>"
$storageAccountName="<<storage account name>>"
$storageAccountKey = "<<storage key>>"
$containerName="wftraceanalyzer"
$destfolder ="hive/warehouse/traces/"
$logoutputfolder = "F:\Hadoop\Trace"
$inputfilepath="E:\investigation\failoverapi12_12\Failover_ApiFaults\log\Traces"

$destContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey

$files = Get-ChildItem $inputfilepath\*.* -Include *.trace | Where-Object {$_.Name -match "fabric_trace"}
foreach($file in $files)
{
    $fileName = $file.Name   
    $blobName = "$destfolder$fileName"
    Set-AzureStorageBlobContent -File $file -Container $containerName -Blob $blobName -Context $destContext -Force
}

#Hive query to create the table on the file structure foundation and load the trace data in hive table
$querystring="DROP TABLE trace; CREATE TABLE trace (date STRING, level STRING, process INT, thread INT, type STRING, text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; LOAD DATA INPATH '/hive/warehouse/traces/' INTO TABLE trace;"
$jobDef = New-AzureHDInsightHiveJobDefinition -Query $querystring
$hivejob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $jobDef

#Troubleshooting 1:type~RE.PrimaryReplicate@<PartitionId>
$querystring= "SELECT * from trace where type like '%RE.PrimaryReplicate%'"
$response = Invoke-Hive $querystring | Out-File $logoutputfolder\REPrimaryReplicate.txt


#Troubleshooting 2:type~FM.*Upgrade
$querystring= "SELECT * from trace where type like '%FM%Upgrade%'"
Invoke-Hive $querystring | Out-File $logoutputfolder\FMUpgrade.txt

#Troubleshooting 3:type~FM.AppUpgrade
$querystring="SELECT * from trace where type like '%FM%AppUpgrade%'"
Invoke-Hive $querystring | Out-File $logoutputfolder\FMAppUpgrade.txt

#Troubleshooting 4:type~FabricNode.NodeOpening
$querystring="SELECT * from trace where type like '%FabricNode%NodeOpening%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FabricNodeOpening.txt

#Troubleshooting 5:type~CM.*Upgrade && level~Warning
$querystring="SELECT * from trace where type like '%CM%Upgrade%' and level like '%arning%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\CMUpgrade.txt

#Troubleshooting 6:type~FM.*Upgrade && ~pending
$querystring="SELECT * from trace where type like '%FM%Upgrade%' and text like '%pending%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FMUpgradePending.txt

#Troubleshooting 7:type~FM.FabricUpgrade && ~pending
$querystring="SELECT * from trace where type like '%FM%FabricUpgrade%' and text like '%pending%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\FMFabricUpgradePending.txt

#Troubleshooting 8:type~RA.Upgrade
$querystring="SELECT * from trace where type like '%RA%Upgrade%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\RAUpgrade.txt

#Troubleshooting 9:type~RA.FT|RAP.FT|RAP.Api && text ~ "on node <node_id>"
$querystring="SELECT * from trace where type like '%RA%FT' or type like '%RAP%FT%' or type like '%RAP%Api' and text like '%on node%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\RAFT.txt

#Troubleshooting 9:type~RE.PrimaryCatchUpQuorum
$querystring="SELECT * from trace where type like '%RE%PrimaryCatchUpQuorum'"
Invoke-Hive $querystring| Out-File $logoutputfolder\REPrimaryCatchUpQuorum.txt

#Troubleshooting 10:type~Lease.SlowSendCompletion
$querystring="SELECT * from trace where type like '%Lease%SlowSendCompletion%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\LeaseSlowSend.txt

#Troubleshooting 11:type~Config && text~value
$querystring="SELECT * from trace where type like '%Config%' and text like '%value%'"
Invoke-Hive $querystring| Out-File $logoutputfolder\Config.txt


5.       There are bunch of filters which are taken and sent using Hive query and corresponding logs are stored in output folder

Input folder: where the traces are kept



Output folder:


The execution takes 5-15 min based on the data size. Eventually it gives all the traces.


Nabarun

Sunday, March 17, 2013

3 days to create a data visualizer tool from scratch


In this blog I will like to pen down my 3days experience in learning a technology called NodeXL from scratch and creating an app out of that which can be downloaded for usage at http://netme.cloudapp.net/Downloads.aspx.
This could have been done in a shorter span, but considering my laziness, the delay was bound to happen J
Following is the story line best to my knowledge.

March 11, 2013 6:00pm


One of the member of the Microsoft’s Most Valued Professional group,  saw the Data Visualization section in NetMe and asked me about my experience in NodeXL.
NodeXL is a free open source set of libraries which helps you create a graph out of a GraphML or GEXF file format. You can also create your custom graphs using the libraries. This is an easy to learn feature and very effective in Data Visualization.
 The objective was that NodeXL comes with an Microsoft Office Excel dependency, he was looking for an approach how we can remove this dependency and create a windows form app which will fetch the twitter related nodes and its children and show it in a graphical format.
NodeXL was totally new to me and creating an app on that seems to be challenging. I thought to give a try and see how it goes.

March 11, 2013 9:00 pm


I got to know the basic features of what NodeXL do. Downloaded the source code of the same, but diving into 18 projects in the source code repository was not only time consuming but also not feasible for me. So I posted the discussion in NodeXl community forum at http://nodexl.codeplex.com/discussions/436250
To my query, I was really satisfied by the quick response given by the moderator. At around 10 pm IST I received a response from him that it was possible and there was a library set that we can download and start using.
 

March 13, 2013


Till now I was confident enough about the usage in NodeXL. I prepared a small POC wherein I created a graph using a sample graphml file. GraphML file format are used to create directed, undirected and mixed graphs. It is based on XML based file format. A sample GraphML format could be created as likewise
<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
     http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <graph id="G" edgedefault="undirected">
    <node id="n0"/>
    <node id="n1"/>
    <node id="n2"/>
    <node id="n3"/>
    <node id="n4"/>
    <node id="n5"/>
    <node id="n6"/>
    <node id="n7"/>
    <node id="n8"/>
    <node id="n9"/>
    <node id="n10"/>
    <edge source="n0" target="n2"/>
    <edge source="n0" target="n3"/>
    <edge source="n1" target="n2"/>
    <edge source="n2" target="n3"/>
    <edge source="n3" target="n5"/>
    <edge source="n3" target="n4"/>
    <edge source="n4" target="n6"/>
    <edge source="n6" target="n5"/>
    <edge source="n5" target="n7"/>
    <edge source="n6" target="n8"/>
    <edge source="n8" target="n7"/>
    <edge source="n8" target="n9"/>
    <edge source="n8" target="n10"/>
  </graph>
</graphml>
 

Previous NodeXL community discussions helped me a lot in this. I am not highlighting the code as that can be easily searched in the NodeXL community discussion forum. Rather let me focus on the major problem statement of how to interact NodeXL with Twitter API.
At first I thought to take help of community member on this so posted the second thread on the forum

March 14, 2013


To this I got the whole step by step approach by the moderator on how we can use the NodeXL libraries to fetch twitter data. This was indeed of great help, but at this moment the Azure Mobile script which I wrote for NetMe flashed up. I thought to reuse the code.
The code was to fetch top 18-20 tweets with their username for any particular searched keyword.
Following is the AMS Node.JS code that is fulfilling the requirement
function read(query, user, request) {
    console.log(request.parameters.type);
    readExpert(request);
}
function readExpert(request)
{    console.log(request.parameters.topic);
    var res="<xml>";  
   var req = require('request');     
    req.get({            uri:"http://search.twitter.com/search.json?q="+request.parameters.topic+"&result_type=mixed",            headers: {'content-type': 'application/x-www-form-urlencoded'},      
},

    function(error,response,body){
        if (!error && response.statusCode == 200) { 
               var resp = JSON.parse(body);
               for(var attribute in resp)
               {                  
                   if(attribute == "results")
                   {
                       console.log(resp[attribute]);
                       var len = resp["results"].length;                      
                       for(var loop=0; loop< resp[attribute].length; loop++)
                        {
                                res += "<Tweet>"
                                res += "<UserName>" + resp[attribute][loop].from_user_name +"</UserName>";
                                res += "<UserStatus>" + resp[attribute][loop].text+ "</UserStatus>"
                                res += "</Tweet>";                         
                        }
                   }
               }
               res += "</xml>";
               request.respond(200, res);
                                }   
                                else   
                                {
            console.log(error);
            request.respond(statusCodes.BAD_REQUEST, error);
                                }
    });
}

Now the next step was to use this data and convert it into graph using NodeXL libraries. Here let me make a note for all the libraries I used
1.       Smrf.NodeXL.Adapters
2.       Smrf.NodeXL.Control.WPF
3.       Smrf.NodeXL.Core
4.       Smrf.NodeXL.Layouts
5.       Smrf.NodeXL.Visualization.Wpf
The code for creating a graph is as follows

XmlDocument graphML = new XmlDocument();
 
//Helper class is having the logic of sending the request to Azure Mobile Service which has been mentioned in my previous blog

                HelperClass _helperClass = new HelperClass("https://datavisualizer.azure-mobile.net/tables/datavisualizer", <<Azure Mobile Service Key>>, Guid.NewGuid().ToString());
                string response = _helperClass.ReadExperts(topic);
                string tweet = String.Empty;
                XmlDocument doc = new XmlDocument(); 

//Define NodeXL control
                NodeXLControl nodexl_Graph = new NodeXLControl();
if (response != "")
                {
                    doc.LoadXml(response);
                    IVertex headVertex = nodexl_Graph.Graph.Vertices.Add();
                    headVertex.SetValue(ReservedMetadataKeys.PerVertexShape, VertexShape.Circle);
                    headVertex.SetValue(ReservedMetadataKeys.PerVertexLabel, topic);
                    headVertex.SetValue(ReservedMetadataKeys.PerVertexLabelFillColor, System.Drawing.Color.Blue);
                    foreach (XmlNode node in doc.SelectNodes("//xml//Tweet"))
                    {
                        IVertex vertex = nodexl_Graph.Graph.Vertices.Add();
                        foreach (System.Xml.XmlElement item in node)
                        {
                            if (item.Name == "UserName")
                            {
                                vertex.SetValue(ReservedMetadataKeys.PerVertexShape, VertexShape.Circle);
                                vertex.SetValue(ReservedMetadataKeys.PerVertexLabel, item.InnerText);
                            }
                            if (item.Name == "UserImage")
                            {
                                var webClient = new WebClient();
                                byte[] imageBytes = webClient.DownloadData(item.InnerText);
                                BitmapImage imageSource = new BitmapImage();
                                using (MemoryStream stream = new MemoryStream(imageBytes))
                                {
                                    stream.Seek(0, SeekOrigin.Begin);
                                    imageSource.BeginInit();
                                    imageSource.StreamSource = stream;
                                    imageSource.CacheOption = BitmapCacheOption.OnLoad;
                                    imageSource.EndInit();
                                }
                                vertex.SetValue(ReservedMetadataKeys.PerVertexImage, imageSource);
                                vertex.SetValue(ReservedMetadataKeys.PerVertexLabelFillColor, Colors.Transparent);

                            }
                            if (item.Name == "UserStatus")
                            {
                                IVertex childvertex = nodexl_Graph.Graph.Vertices.Add();
                                childvertex.SetValue(ReservedMetadataKeys.PerVertexShape, VertexShape.Square);
                                childvertex.SetValue(ReservedMetadataKeys.PerVertexLabel, item.InnerText);
                                childvertex.SetValue(ReservedMetadataKeys.PerColor, System.Drawing.Color.Green);
                                IEdge childEdge = nodexl_Graph.Graph.Edges.Add(vertex, childvertex);
                                childEdge.SetValue(ReservedMetadataKeys.PerColor, System.Drawing.Color.Red);
                            }
                            IEdge edge = nodexl_Graph.Graph.Edges.Add(headVertex, vertex);
                        }
                    }
                }
                nodexl_Graph.DrawGraph(true);
 
// NodeXL is a WPF control. If you need to host this control in Windows form you need to create an element host control and assign the control to that.
                elementHost_Graph.Child = nodexl_Graph;

POC

 
This app has been uploaded under the downloads page in http://netme.cloudapp.net/ .  To try this app you need to download the zip folder, extract it and run the msi file. It will get installed in your machine and you will get to see the NetMe shortcut in your desktop. As this is a POC, I have not focused much on the look and feel of the app.
Once the app gets installed you will be able to search for any keyword and see the corresponding graph. You can shift the nodes accordingly.
The major intent of this blog was to highlight how fast and easy it is to code using NodeXL libraries and create your own data visualizer tool within a short time span. Hope you will like it  J