Learn and shine

Sunday, February 28, 2016

Apache Hive Basics

Hive Back ground

1. Hive Started at Facebook.
2. Data was collected by cron jobs every night into Oracle DB.
3. ETL via hand-coded python
4. Grew from 10s of GBs(2006) to 1TB/day new data in 2007 , now 10x that

Facebook usecase
1. Facebook uses more than 1000 million users
2. Data is more than 500 TB per day
3. More than 80k queries for day
4. More than 500 million photos per day.

5. Traditional RDBS will not the right solution, to do the above activities.
6. Hadoop Map Reduce is the one to solve this.
7. But Facebook developers having lack of java knowledge to code in Java.
8. They know only SQL well.
So They introduced Hive
Hive
1. Tables can be partitioned and bucketed.
Partitioned and bucketed are used for performance
2. Schema flexibility and evolution
3. Easy to plugin custom mapper reducer code
4. JDBC/ODBC Drivers are available.
5. Hive tables can be directly defined on HDFS
6. Extensible : Types , formats, Functions and scripts.
What do we mean by Hive
1. Data warehousing package built on top of hadoop.
2. Used for Data Analytics
3. Targeted for users comfortable with SQL.
4. It is same as SQL , and it will be called as HiveQL.
5. It is used for managing and querying for structured data.
6. It will hide the complexity of Hadoop
7. No need to learn java and Hadoop API’s
8. Developed by Facebook and contributed to community.
9. Facebook analyse Tera bytes of data using Hive.

Hive Can be defined as below
• Hive Defines SQL like Query language called QL
• Data warehouse infrastructure
• Allows programmers to plugin custom mappers and reducers.
• Provides tools to enable easy to data ETL
Where to use Hive or Hive Applications?
1. Log processing
2. Data Mining
3. Document Indexing
4. Customer facing business intelligence
5. Predective Modeling and hypothesis testing
Why we go for Hive
1. It is SQL like types and if we provide explicit schema and types.
2. By using Hive we can partition the data
3. It has own Thrift sever, we can access data from other places.
4. Hive will support serialization and deserialization
5. DFS access can be accessed implicitly.
6. It supports Joining , Ordering and Sorting
7. It will support own Shell hive script
8. It is having web interface
Hive Architecture

1. Hive data will be stored in Hadoop File System.
2. All Hive meta data like schema name, table structure,view name all the details will be stored in Metastore
3. We will Hive Driver, it will take the request and compile and convert into hadoop understanding language and execute the same.
4. Thrift server is will access hive and fetch data from DFS.

Hive Components

Hive Limitations
1. Not designed for online transaction processing.
2. Does not offer real time queries and row level updates
3. Latency for Hive query’s is high(It will take minutes to process)
4. Provides acceptable latency for interactive data browsing
5. It is not suitable for OLTP type applications.
Hive Query Language Abilities

What is the traditional RDBMS and Hive differences
1. Hive will not verify the data when it is loaded, but it is do at the time of query issued.
2. Schema on read makes very fast initial load. The file operation is just a file copy or move.
3. No updates , Transactions and indexes.
Hive support data types

Hive Complex types:
Complex types can be built up from primitive types and other composite types using the below operators.

Operators
1. Structs: It can be accessed using DOT(.) notation
2. Maps: (Kye-value tuples), it can be accessed using [element-name] as notation
3. Arrays: (Indexable lists) Elements can be accessed using the [n] notation, where n is an index (zero –based) into the array.
Hive Data Models
1. Data Bases
Namespaces – ex: finance and inventory database having Employee table 2 different databases
2. Tables
Schema in namespaces
3. Partitions
How data is stored in HDFS
Grouping databases on some columns
Can have one or more columns
4. Buckets and Clusters
Partitions divided further into buckets on some other column
Use for data sampling

Hive Data in the order of granularity

Buckets
Buckets give extra structure to the data that may be used for more efficient queries
A join of two tables that are bucketed on the same columns – including the join column can be implemented as Map Side Join
Bucketing by user ID means we can quickly evaluate a user based query by running it on a randomized sample of the total set of users.

These are the basics about Hive.

Thank you for viewing the post.

Thursday, February 25, 2016

Clickjacking prevention using X Frame Options and J2EE Filter

1. What is Clickjacking.
It is also known as User Interface redress attack, UI redress attack, UI redressing
It is a malicious technique of tricking a Web user into clicking on something different from what the user perceives they are clicking on, thus potentially revealing confidential information or taking control of their computer while clicking on seemingly innocuous web pages. It is a browser security issue that is a vulnerability across a variety of browsers and platforms
2. How to prevent Clickjacking using Filter in java
Below example shows how Clickjacking will happens and how we can prevent the same.

Here I have created a Simple LoginServlet , after successful login, page will be redirected to success page.
Everyone knows how to create servlet and deploy the same. But still I am writing here to understand who have no idea how to create.
Step 1: Start eclipse
Step2: create a Dynamic Web Project -> clickjacking_prevention
Step3: first we need to create a login.jsp page, under Webcontent of the project

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
    pageEncoding="ISO-8859-1"%>




Login page


              User Name                   
          Password

Step 4: Need to create a success page

<%@ page language="java" contentType="text/html; charset=ISO-8859-1"
    pageEncoding="ISO-8859-1"%>




Login Success


                 Login Successful        
You can construct page as you like

Step 5: Now we need to create a LoginServlet

package com.siva;

import java.io.IOException;

import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class LoginServlet extends HttpServlet{

 /**
  * 
  */
 private static final long serialVersionUID = 1L;

 public void doPost(HttpServletRequest request, HttpServletResponse response)
   throws ServletException, IOException {

  String username = request.getParameter("username");
  String password = request.getParameter("password");
  if("siva".equalsIgnoreCase(username)&& "raju".equalsIgnoreCase(password)){
   System.out.println("inside if condition");
   response.sendRedirect("loginSuccess.jsp");
  }
 }
}

Step 6: Now we need to do Configuration in web.xml for LoginServlet



  clickjacking_prevention
  
    login.jsp
   
  
    
  
    LoginServlet
    com.siva.LoginServlet
  
  
   LoginServlet
   /loginServlet

Step 7: Once this configuration done, Now we can run the project using any of the servers like Apache tomcat or Jboss.
You can use the http://localhost:8080/clickjacking_prevention/

It will open page like above and you can enter username as siva and password as raju, then submit,
You can redirected to loginSuccess page

Create a html file and provide name as you like and paste the below code.



  click jaking

Once we run this html file we can see the same data which is showed in the loginSuccess page

Step 10 : Now we can see the difference between above two images. One is url page and one is iframe constructed page, both are same.
So hacker can use this , and patch in your actual site and steal the data.
Now How to prevent this.
We need to add this code in our filter or jsp page.
response.addHeader("X-FRAME-OPTIONS", “DENY” );
Here I have written Filter to overcome clickjacking

package com.siva;

import java.io.IOException;

import javax.servlet.Filter;
import javax.servlet.FilterChain;
import javax.servlet.FilterConfig;
import javax.servlet.ServletException;
import javax.servlet.ServletRequest;
import javax.servlet.ServletResponse;
import javax.servlet.http.HttpServletResponse;



public class ClickjackingPreventionFilter implements Filter 
{
  private String mode = "DENY";
  
// Add X-FRAME-OPTIONS response header to tell any other browsers who   not to display this //content in a frame.
     public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
         HttpServletResponse res = (HttpServletResponse)response;
         res.addHeader("X-FRAME-OPTIONS", mode );   
         chain.doFilter(request, response);
     }
     public void destroy() {
     }
     
     public void init(FilterConfig filterConfig) {
         String configMode = filterConfig.getInitParameter("mode");
         if ( configMode != null ) {
             mode = configMode;
         }
     }
}

Step 11: Once Filter has completed now we need to add same filter configuration in web.xml file


        ClickjackPreventionFilterDeny
        com.siva.ClickjackingPreventionFilter
        
            modeDENY
    
    
    
     
        ClickjackPreventionFilterDeny
        /*

Once we have done configuration , you can run the same Iframe example again, you can see the below page without any content, it will show warning in IE and it will not show any details in other browser.

This is how we can prevent the clickjacking attacks.
Thank you for viewing the post.

Sunday, February 14, 2016

Getting started Hadoop with oracle or vmware virtual box and Ubuntu

Hadoop installation with Single DataNode( VMware or Oracle virtual box)
Download latest version VM ware from the below link
http://www.traffictool.net/vmware/
Download Oracle virtual box from the below site and install the same in local system.
http://www.oracle.com/technetwork/server-storage/virtualbox/downloads/index.html
Run the Virtual box(VirtualBox.exe) Application
click on new ->

And click on Next->Next-And create virtual box
Once that’s done virtual box will look like this. Select the Ubuntu downloaded package.

Start the Virtual box, then provide password from which user you want to start.

Once virtual box started then screeb will look like this

Open the terminal, by right click on the screen or search for terminal and open the same.

Command:to update the ubuntu
1. sudo apt-get update
Once update is complete

Command: install openssh server
2. sudo apt-get install openssh–server
Command: create a hadoop directory
3. mkdir /usr/local/hadoop
Download the hadoop latest version from below link
http://hadoop.apache.org/releases.html
copy to virtual box and extract the tar file
Here I extracted under /usr/local/hadoop/
Command: to extract the tar file
4. tar -xvf .tar.gz
After extracting enter this command ls –lrt , you can see the list of folders related to hadoop
Command: To add hadoop to the group
5. sudo addgroup hadoop
Command: create new user called hduser
6. sudo adduser --ingroup hadoop hduser

Command: assign hduser to sudo
7. sudo adduser hduser sudo
Command: change the owner for hadoop as hduser
8. sudo chown –R hduser:hadoop /usr/local/hadoop
Command: switch to hduser
9. su – hduser

Command: install ssh
10. sudo apt-get install ssh
Command: generate a ssh key
11. ssh-keygen -t rsa –P ""
/home/hduser/.ssh/id_rsa
Command: copy id_rsa.pub key to authorized_keys
12. cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Command: install vim editor
13. sudo apt-get install vim
Command: Edit the sysctl.conf file to dispable few of the ipv6 realted configuration
14. sudo gedit /etc/sysctl.conf or sudo vi /etc/sysctl.conf
Add below lines

net.ipv6.conf.all.disable_ipv6=1
                   net.ipv6.conf.default_ipv6=1
                  net.ipv6.conf.io.disable_ipv6=1

Command:Start the ssh
15. ssh localhost
Command: get the updates
16. sudo apt-get update
Command: edit the bashrc file to add the path of java and hadoop
17. sudo vi ./bashrc or sudo gedit ./bashrc

export  HADOOP_HOME = /usr/local/hadoop
         export  JAVA_HOME=/usr   [or] where ever your java installed location

Command: Source the bashrc file
18. source .bashrc

Command: Now check the version of java and hadoop
19. java –version
20. hadoop version
Command:Create a data directory inside /usr/local/hadoop
21. mkdir /usr/loca/hadoop/data
Command: edit the hadoop_env.sh file to add the configuration
22. sudo gedit /usr/loca/hadoop/etc/hadoop/hadoop_env.sh

export JAVA_HOME=/usr 
        export HADOOP_OPTS=”$HADOOP_OPTS –Djava.net.preferIPv4Stack= true  -Djava.library.path=$HADOOP_PREFIX/lib”

Command: edit the yarn_env.sh file to add the configuration
23. sudo gedit /usr/loca/hadoop/etc/hadoop/yarn_env.sh

export HADOOP_CONF_LIB_NATIVE_DIR=${HADOOP_PREFIX:-“lib/native”}
        export HADOOP_OPTS=” Djava.library.path=$HADOOP_PREFIX/lib”

Now we need to edit the some of the hadoop related files, to start the single node
Go to /usr/local/hadoop/etc/hadoop$
Command: Edit the existing file and add the below configuration
24. sudo gedit core-site.xml


fs.default.name
hdfs://localhost:9000


hadoop.tmp.dir
/usr/local/hadoop/data

Command: Rename mapred-site.xml.template to mapred-site.xml
Go to /usr/local/hadoop/etc/hadoop
25. mv mapred-site.xml.template mapred-site.xml
26. sudo gedit mapred-site.xml


    
       mapreduce.framework.name
       yarn

Then close this file
Edit the hdfs-site.xml,
Command: to edit the hdfs-site.xml
27. sudo gedit hdfs-site.xml


dfs.replication
3

Command:Edit the yarn.xml
28. sudo gedit yarn.xml


   
       yarn.nodemanager.aux-services 
       mapreduce_shuffle
  
 
        yarn.nodemanager.aux-services.mapreduce_shuffle.class
       org.apache.hadoop.mapred.ShuffleHandler
  


       yarn.resourcemanager.resource-tracker.address
       localhost:8025
  


        yarn.resourcemanager.scheduler.address
       localhost:8030
  


        yarn.resourcemanager.address
       localhost:8050

Command: Need to format the namenode
29. /usr/local/hadoop/bin/hadoop namenode –format
After this format done then we need to start the dfs and yarn
30. /usr/local/hadoop/sbin/start-dfs.sh
31. /usr/local/hadoop/sbin/start-yan.sh
Command: to display all the running datanodes and namemodes
32. jps

This is how we can setup the hadoop using oracle/vmware virtual box.

Thank you for viewing this post.

Saturday, February 6, 2016

Getting started with web2py using pythonanywhere web hosting service (cloud)

This post tells you, how to deploy and execute python code written using web2py framework in
pythonanywhere web hosting service kind of cloud for python.
First write any sample code using web2py framework. Sample codes you can check my previous posts like getting started with web2py , blog app using web2py

Once you have completed the simple project, then we need to deploy the same using the cloud.
Web2py can be deployed any web hosting services, now we will look how we can deploy using https://www.pythonanywhere.com/

Click on Signup here! (if you are not sign up yet)

Click on Create a Beginner account and provide the required details, it is enough to post any details in internet.

After successful sign up and Login then, you can redirected to pythonanywhere

Click on DashBoard Then Click on Web

Now we need to create new web app (Click on Add anew web app).

Pythonanywhere will support so many python frameworks, Select the web2py , since we are implementing application using web2py.

Provide the password , which is required to access the application. It will created directory (/home/siva82k/web2py/) with my username.
Click on Next, which will create url for you

Now you can check your application through internet, usually welcome application will be copied to your account.
My case my url will be http://siva82k.pythonanywhere.com, if you try to click on this you can redirected to your application.

Now it’s time to deploy our existing code into pythonanywhere site. First go to our application, where we have written our code, click on pack all as shown in below image.

Then save the code in local system.Once it is completed then go to your pythonanywhere site.
Click on Administrative Interface, and the provide the password, which you have given while creating web2py account.

After successful Login, it will redirected to below page

We need to upload the file into python anywhere site . Provide the details under upload and install packed application
I am providing Application name as sivaweb2py
Upload a package from your local system, Earlier where you have downloaded.
Then click on install, our application got installed on pythonanywhere machine.
Now what ever we did in local host machine same thing available in pythonanywhere internet.
You can check my previous posts related to web2py examples.
Earlier we checked Role based access, same thing we will check in pythonanywhere
Click on https://siva82k.pythonanywhere.com/sivaweb2py/blog/view

It will ask the username and password , in my case I have provided user only (siva82k@gmail.com), have access to post the blog.
https://siva82k.pythonanywhere.com/sivaweb2py/blog/post

If we provide correct user name and password, it will take us to post the blog.

If we provide other details, other than post access then it will say you are not authorized.

One more example which we have worked earlier, basics to add the 2 numbers
https://siva82k.pythonanywhere.com/sivaweb2py/basics/request_args/10/20

We can test whatever we did in our previous examples in local system, same thing available in internet.
This is how we can deploy our web2py code using pythonanywhere.
Thanks for viewing this post