How to use R 2

My first introduction about how to use R is here.

I have attended in the Workshop hosted by UVa library statistics consulting group. It was useful for me because they provide the detailed descriptions and explanations about each code. Because my first experience of R did not include any of those step-by-step explanations about how to use R. This workshop provided explanations of the each code in the SCRIPTS window with “#”. And by clicking ctrl + enter, the results of running those code were show in the CONSOLE, so that I could follow easily what I was doing with R.

I have summarized my notes below:

  • ctrl +enter: bring codes from scripts to console
  • write down orders in R script (upper left) and click ctrl+enter to see the results in the Console
  • remove
    • # the rm function removes objects from memory
    • rm(x, y, z)
    • or broom mark upper right where s,y,z showed
  • Setting working directory
    • [menu] session–> set working directory
    • tab key will call folder to put ” “
  • names: name of the columns
  • nrow: number of rows
  • ncol: cumber of columns.
  • $ will help to see certain culumns
  • can calculate frequencies or numeric
  • to see the numbers put the order in the ( )
  • subsetting : use [ ]
  • c( ): select columns (name of the column) # for example, c(company, sales)
  • save the file as .Rda by selecting sections and saved it as new name

F-ration and regression

  • F-ratio = The most important part of the table is the F-ratio, which is calculated using equation, and the associated significance value of that F-ratio. For these data, F is 99.59, which is significant at p < .001 (because the value in the column labelled is less than .001). This result tells us that there is less than a 0.1% chance that an F-ratio this large would happen if the null hypothesis were true. Therefore, we can conclude that our regression model results in significantly better prediction of record sales than if we used the mean value of record sales. In short, the regression model overall predicts record sales significantly well.
  • Regression: make a prediction about the future when the outcome is a continuous variable
  • Logistic regression: when the outcome is a categorical outcome(fireman, doctor, or pimp)

A small standard error tells us that most pairs of samples from a population will have very similar means (i.e. the difference between sample means should normally be very small). A large standard error tells us that sample means can deviate quite a lot from the population mean and so differences between pairs of samples can be quite large by chance alone.

When to use MANOVA

I used MANOVA (Multivariate Analysis of Variance) for a quantitative data analysis of my dissertation.

So, when to use MANOVA?

  • When you have several dependent variables (DV)
  • When there is only one independent variable or when there are several, we can look at interactions between independent variables, and we can even do contrasts to see which groups differ from each other.

What are benefits of using MANOVA?

  • we can look at interactions between independent variables
  • we can even do contrasts to see which groups differ from each other
  • MANOVA can tell us the relationship between DV(outcome variables).

Compared to ANOVA, what are good things of using MANOVA? why MANOVA is used instead of multiple ANOVAs?

  • the more tests we conduct on the same data, the more we inflate the familywise error rate; the more dependent variables that have been measured, the more ANOVAs would need to be conducted and the greater the chance of making a Type I error.
  • MANOVA has greater power to detect an effect, because it can detect whether groups differ along a combination of variables, whereas ANOVA can detect only if groups differ along a single variable
  • ANOVA can tell us only whether groups differ along a single dimension whereas MANOVA has the power to detect whether groups differ along a combination of dimensions.

Educational Data Mining

I attended two-day Educational Data Mining (EDM) workshop by Dr. April Galyardt provided by College of Education, University of Georgia from June 9 to 10, 2014. Before beginning of the workshop, I had several questions before beginning of the workshop. My goal of taking this workshop is to get clear answers about these questions.

We learned with her handout.

1. What is EDM?

  • EDM is an emerging discipline, concerned with developing methods for exploring the unique types of data that come from educational settings, and using those methods to better understand students, and the settings which they learn in.
  • It is similar to Learning analytics and knowledge (LAK). LAK is the measurement, collection, analysis and reporting of data about learners and their contexts, for purposes of understanding and optimizing learning and the environments in which it occurs.
  • EDM vs. LAK
    • LAK and EDM share the goals of improving education by improving assessment, how problems in education are understood, and how interventions are planned and selected. EDM is more focused on generalizability. While LAK researchers have placed greater focus on addressing needs of multiple stakeholders with information drawn from data. (see, p.3 of handout, for details)

2. When may EDM be useful for my research based on my program of inquiry?

  • when need detailed formative assessment
  • Compared to Regression, I may use EDM when I need more interpretability for my data.
  • useful for design based research (DBR)
  • For my research..
    • When my regression data does not meet my needs. If it looks like more complicated things are going on in my variables. If I want to do more interpretation for my data.

3. What types of data can I use for EDM design?

  • not necessarily to be a big data but anything I want to.

4. What tools may I use for EDM?

  • R or Rapidminer (but R is more common for EDM)

 

<Day 1>

So far, EDM starts from the concept of regression. But the way of finding the best fit model is different from Regression. For example, Regression only accounts on significant data/variables. However, in EDM, with Lasso, some significant variables are critical but also not significant data can be used as predictable variable. Because, Lasso shows when we can have better model(s) when which variable is added. Even though it is not significant, we still can interpret the variable affected the model based on the graph.

 

<Day 2>

  • EDM can do both Inference vs. Prediction.
    • Inference: explanatory. testing causal theory, similar to regression, finding causality. i.e., what predicts graduation?
    • Prediction: predicting new/future observation. data mining.  i.e., Who will graduate?
  • Before choosing ways of EDM, I need to make sure whether I mainly need inference or prediction.
    • For example, Lasso: What are the most important variables for pre-dicting Y?, then Lasso is a great tool.
    • By April Galyardt Supervised Unsupervised
      Continuous Y Regression- Linear regression- Non-parametric regression-Lasso-Ridge Regression

      -Regression Trees

      Latent Variable Models Dimension Reduction Principal Components Independent Components Factor analysis IRT
      Categorical Y Classification-Logistic regression-Linear discriminate analysis-k-nearest neighbors-decision trees

      -suppo vector machines

      Clustering- k-means- mixture models (=Gaussian model: most commonly used)- hierarchical clustering

      – spectral clustering

      – diagnostic classification models

Using R

R uses commands

  • Two parts of commands in R: objects and functions and these are separated by ‘ <- ‘, means ‘is created from’.
    • Objects: could be a single or collections of variable (s), stat model.
    • Functions: things that I do in R to create my objects
    • For example, object <- function = object is created from function.
  • Concatenate function c() : make groups together
  • Importing Data: most commonly used R-friendly formats are tab-delimted text(.txt in Excel and .dat in SPSS)

Tips.

  1. I can put more than one command on a single line if I prefer.
  2. R is case sensitive: capital letter and small letter

Here are some useful links for R

Regression Commands

Vector support machines (April recommended during the WS)

Support vector machine in R (useful article-April recommended)

pslcdatashop.web.cmu.edu  (can share data-April recommended)

How to read scatterplot matrix in R?

MOOC (Coursera): R programming (4 weeks): Jul/7-Aug/4, 2014

 

Reference

A. Field, J. Miles, Z. Field (2012). Discovering Statistics using R. Sage.

UGA Academic Computing Center

Useful information: the center, ACC, helps all about statistical analysis of data!

Academic Computing Center

Help for Your Analyses Needs?

The Academic Computing Center (ACC) provides research consultation, data analysis support, and assistance in interpreting statistical analyses for faculty and students in the College of Education. The ACC is associated with the Department ofEducational Psychology and Instructional Technology.

320 Aderhold Hall
The University of Georgia
Athens, GA 30602-7143
706-542-5230

Benefits

  • We perform requested descriptive and inferential statistical analysis of data.
  • We assist with the interpretation of statistical analysis.
  • We provide clients with copies of all programs used for statistical analyses.
  • In addition, we are now offering help for those who want to take a more hands-on approach to their own data analysis.
  • We will help you develop the necessary skills to run the personal computer versions of both SPSS and SAS software.

Features

  • If you don’t have external funding, our help is FREE!

ACC Personnel
Dr. Steve Cramer, Director
Aderhold 320A
Phone: 706.542.5589

t-test

Independent -means t-test:

used when there are two experimental conditions and different participants were assigned to each condition. also referred to as the independent-measures or independent-samples t-test

Dependent-means t-test:

used when there are two experimental conditions and the same participants took part in both conditions of the experiment. also referred to as the matched-pairs or paired-sample t-test.

calculating with standard error* : small standard error=most samples similar means

Assumptions of the t-test

  • The sampling distribution is normally distributed. In the dependent t-test this means that the sampling distribution of the differences between scores should b normal. not the scores themselves.
  • The independent t-test, because it is used to test different groups of people, also assumes: homogeneity of variance (variances in these populations are roughly equal), scores are independent (because they come from different people).

* Standard error: the standard deviation of sample means. It is a musure of how representative a sample is likely to be of the population.  As sample get large (greater than 30), the sampling distribution has a normal distribution with a mean equal to the population mean(=central limit theorem)