Sunday, 24 March 2013

IT BAL # Session 9

 What is Google Refine:

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Now the features are described below:

Data Loaded in Google Refine:

                               






One of the most exciting features of Google refine is faceting.

Facet is created on a particular column. The facet summarizes the cells in that column to give a big picture on that column, and allows to filter to some subset of rows for which their cells in that column satisfy some constraint.

                                     



The Clustering feature can be accessed in 2 different ways. If you have already created a default text facet on a column, the text facet will show a "Cluster" button near its top right corner. If you haven't, you can invoke the column's drop-down menu and pick Edit cells > Cluster and edit..

                                        



Google Refine supports "expressions" mostly to transform existing data or to create new data based on existing data













One can use Google Refine to perform reconciliation of names in your data against any database that exposes a web service following this Reconciliation Service API specification. One such database is Freebase.



                                           




Friday, 15 March 2013

IT LAB ; Session #8

Perform Panel Data Analysis of "Produc" data

Solution:
Three types of models are:
      Pooled affect model
      Fixed affect model
      Random affect model 

We will be determining the best model by using functions:
       pFtest : for determining between fixed and pooled
       plmtest : for determining between pooled and random
       phtest: for determining between random and fixed

The data can be loaded using the following command
data(Produc , package ="plm")
head(Produc)





Pooled Affect Model 

pool <-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("pooling"),index =c("state","year"))
summary(pool)








Fixed Affect Model:

fixed<-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("within"),index =c("state","year"))
summary(fixed)






Random Affect Model:

random <-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("random"),index =c("state","year"))
> summary(random)




Testing of Model

This can be done through Hypothesis testing between the models as follows:

H0: Null Hypothesis: the individual index and time based params are all zero
H1: Alternate Hypothesis: atleast one of the index and time based params is non zero

Pooled vs Fixed

Null Hypothesis: Pooled Affect Model
Alternate Hypothesis : Fixed Affect Model

Command:

> pFtest(fixed,pool)


Result:
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects
Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

Pooled vs Random

Null Hypothesis: Pooled Affect Model
Alternate Hypothesis: Random Affect Model

Command :
> plmtest(pool)

Result:

  Lagrange Multiplier Test - (Honda)
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Affect Model.

Random vs Fixed

Null Hypothesis: No Correlation . Random Affect Model
Alternate Hypothesis: Fixed Affect Model

Command:
 > phtest(fixed,random)

Result:

 Hausman Test
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

Conclusion: 

So after making all the tests we come to the conclusion that Fixed Affect Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation.