Sunday, 24 March 2013

IT BAL # Session 9

 What is Google Refine:

Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

Now the features are described below:

Data Loaded in Google Refine:

                               






One of the most exciting features of Google refine is faceting.

Facet is created on a particular column. The facet summarizes the cells in that column to give a big picture on that column, and allows to filter to some subset of rows for which their cells in that column satisfy some constraint.

                                     



The Clustering feature can be accessed in 2 different ways. If you have already created a default text facet on a column, the text facet will show a "Cluster" button near its top right corner. If you haven't, you can invoke the column's drop-down menu and pick Edit cells > Cluster and edit..

                                        



Google Refine supports "expressions" mostly to transform existing data or to create new data based on existing data













One can use Google Refine to perform reconciliation of names in your data against any database that exposes a web service following this Reconciliation Service API specification. One such database is Freebase.



                                           




Friday, 15 March 2013

IT LAB ; Session #8

Perform Panel Data Analysis of "Produc" data

Solution:
Three types of models are:
      Pooled affect model
      Fixed affect model
      Random affect model 

We will be determining the best model by using functions:
       pFtest : for determining between fixed and pooled
       plmtest : for determining between pooled and random
       phtest: for determining between random and fixed

The data can be loaded using the following command
data(Produc , package ="plm")
head(Produc)





Pooled Affect Model 

pool <-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("pooling"),index =c("state","year"))
summary(pool)








Fixed Affect Model:

fixed<-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("within"),index =c("state","year"))
summary(fixed)






Random Affect Model:

random <-plm( log(pcap) ~log(hwy)+ log(water)+ log(util) + log(pc) + log(gsp) + log(emp) + log(unemp), data=Produc,model=("random"),index =c("state","year"))
> summary(random)




Testing of Model

This can be done through Hypothesis testing between the models as follows:

H0: Null Hypothesis: the individual index and time based params are all zero
H1: Alternate Hypothesis: atleast one of the index and time based params is non zero

Pooled vs Fixed

Null Hypothesis: Pooled Affect Model
Alternate Hypothesis : Fixed Affect Model

Command:

> pFtest(fixed,pool)


Result:
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects
Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

Pooled vs Random

Null Hypothesis: Pooled Affect Model
Alternate Hypothesis: Random Affect Model

Command :
> plmtest(pool)

Result:

  Lagrange Multiplier Test - (Honda)
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Affect Model.

Random vs Fixed

Null Hypothesis: No Correlation . Random Affect Model
Alternate Hypothesis: Fixed Affect Model

Command:
 > phtest(fixed,random)

Result:

 Hausman Test
data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Affect Model.

Conclusion: 

So after making all the tests we come to the conclusion that Fixed Affect Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation. 

Thursday, 14 February 2013

IT LAB ; Session #6

Assignment 1:

Create log of returns for NIFTY data from 01 Jan 2012 to 31 Jan 2013 and calculate the historical volatility

clprice<-read.csv(file.choose(), header =T)
head(z)
closingprice<-clprice[,5]
closingprice.ts<-ts(closingprice, frequency =252)
st<-log(closingprice.ts)
stlag<-log(lag(closingprice.ts,k=-1))
log.returns<-(st-stlag)/stlag
plot(log.returns)
T =(252) ^ 0.5
historicalvolatility<-sd(returns) * T
historicalvolatility



Assignment 2:

Create ACF plot for the above log of returns data and perform the adf test and comment on it

The ACF plot can be done using the below formula

acf(log.returns)
The Plot suggests us that our data lies inside the confidence interval of 95% and there is maximum possibility of being stationary.

The ADF test is performed using the below formula:

adf.test(returns)


Wednesday, 23 January 2013

IT lab : Jan-22nd

Assignment_1 : Using mileage groove data , fit lm and comment on its applicability.





Assignment_2 :

Residual value :




qqplot :

 qqline :




Assignment_3 : Justify Null hypothesis using ANOVA.


here , "p" value is greater than 5% . So we cannot reject the null hypothesis .



Wednesday, 16 January 2013

IT Lab ( 15th Jan )


Assignment_1 : cbind.



Assignment_2 : Multiplication of matrices




Assignment_3 : Regression


Assignment_4 : Normal curve


Normal graph plot .




Tuesday, 8 January 2013

IT Lab ( 8th Jan )


Points and Lines with title


Scatter graph with Labeling


Histogram



Merge