Skip to main content
NZSSN Social Statistics Network logo
  • Home
  • About
  • Courses
  • People
  • Links
  • Contact
Home » Courses

Longitudinal Data Analysis

Dates: 
February 20, 2012 - February 24, 2012
Instructor: 
Dr Gary Marks

The basic purpose of the course is to provide students will the ability to use SAS to analyse data from complex (including longitudinal) surveys using a range of statistical techniques appropriate to the research question and the nature of the response variable.
 
There will be an emphasis on proposing plausible research questions and analysing data to investigate these questions. This course will comprise presentations and practical computer based lab sessions. Students will analyze two major Australian longitudinal studies: the Longitudinal Surveys of Australian Youth (LSAY) including test data from the OECD’s 2003 Program for International Student Achievement (PISA) and the Household, Income and Labour Dynamics of Australian study (HILDA).
 
For this course, the software package used will be SAS. Other statistical packages will not be used. The course does not involve high level mathematics.
 

Prerequisites: 

It is expected that students have a strong grounding in introductory statistics (for the analysis of survey data) up to and including ordinary least squares (OLS) regression. Familiarity with logistic regression would be helpful. It will be assumed that students are familiar with SAS or sufficiently adept at using other statistical packages that writing SAS syntax for data manipulation and statistical analysis will not cause too many problems. Note that SAS command files will be supplied.

It is not necessary for students to be familiar with the Longitudinal Surveys of Australian Youth (LSAY) or Household Income and Labour Dynamics of Australia (HILDA) studies, but it would help to know something about them or other similar Youth or Household panel studies.

This course is designed for social science researchers who wish to address research questions using appropriate (including advanced) statistical procedures on cross-sectional or longitudinal survey data.
 

Learning Objectives: 

The course will provide students with a range of advanced skills for the analysis of cross-sectional data extending to longitudinal and other types of clustered data. The statistical techniques taught include logistic, multinomial and ordinal regression, fixed and random effects models, repeated (random) effect models and event history analysis. Advanced statistical techniques such as, logistic, multinomial and ordinal regression are variants of the OLS regression. The nature of the dependent variable (dichotomous, polychotomous and ordinal) and thus the error term requires transforming the dependent variable into a linear form akin to OLS regression. Clustered data is also an issue for the estimation of errors since sampling theory assumes that the sampled units are independent of each other. Students within schools are not independent and the annual earnings among individuals are highly correlated.
 
Longitudinal and Clustered Data
The essence of longitudinal data is that the same people (or sample units) are surveyed on a regular basis over time. By surveying the same people it is possible to analyse trajectories and pathways which can be only implied from cross-sectional studies. Longitudinal data is just type of clustered data where there is more than one observation per sample unit, i.e. students in schools, patients in hospitals.
 
The main advantages of longitudinal data are:

  1. Analysis of unit-level change. Enables analysis of individual level changes such as entering/exiting unemployment, increasing earnings, getting married, etc.
  2. Analyses of group level changes. For example, longitudinal data analyses can show if the labour market situation of early school leavers improves or deteriorates with time elapsed since leaving school. (What is your guess?)
  3. A handle on dynamics. For example, cross-sectional surveys estimate that 12% of adults live in poverty (defined a particular way). The impression is that 12% are in permanent poverty but longitudinal data analysis shows considerable movement in and out of poverty and provides the ability to model what influences movement in and out of poverty. The same argument can be made for unemployment, income, wealth or any other potentially variable social outcome.
  4. Enables time sequencing. In cross-sectional surveys all information is collected at the same time. Longitudinal data enables the researcher to specify exactly the sequence events took placed, i.e. was it job-marriage-children or marriage-job-children etc. In cross-sectional data the timing of events can be ascertained by recall ‘When do you get married/have your first child” but there is recall error which increases with time elapsed since the event.
  5. Allows sorting out causal pathways. For example, there is a strong correlation between students’ satisfaction with school and their performance. Is this because more satisfied students perform better or because higher performing students tend to be more satisfied?
  6. Aggregation of multiple observations. For example, household income over a five year period may be a better indicator of a household’s financial situation than income at one point time.
  7. Allows for sophisticated analyses, such as fixed effects models which control for all (stable) exogenous influences (see below).

Fixed and Random Effect Models
Fixed effect models are useful if the analyst wishes to control for unobserved time invariant influences on the response variable. Unobserved (unmeasured) characteristics that may be relevant to say employment outcomes, but not collected in surveys, include cognitive ability, physical appearance, personality, interests, motivation, and financial literacy. Therefore, fixed effects models allow for the estimation of (largely) unbiased effects.
 
The ability to use fixed effects models are important advantage of longitudinal data where there are several observations per subject. Examples are earnings among individuals over several years, tests scores of students over their school careers, psychological wellbeing among individuals at multiple time points.
 
Random effects models are useful for the analysis of hierarchical data in which first level units/observations are clustered within higher level units, e.g. students within schools, residents within communities, employees within companies and patients within hospitals. The assumption is that the higher level grouping has some influence on lower level units.
 
Random effect models have several names: hierarchical linear models, multilevel models and variance component models since it separates between-group and within-group variance and assesses how much of the variation at each level is explained by the predictor variables. The term “mixed” models is used because they include both random and fixed effects.
 
Statistically, random effect models assume that the effects of higher level units (clusters) are independent of the effects of lower-levels units whereas in fixed effects models they are correlated. In fixed effect models, it is not possible to obtain estimates for stable influences (sex, ethnicity) since it relies on changes on status (from being single to married from being employed to unemployed etc.).
 
Logic of Course
The logic of the course is as follows. Using cross-sectional data the general linear model (OLS regression) is revised for normally distributed dependent variables (e.g. test scores) and skewed continuous dependent variables (earnings and wellbeing) which are transformed to make them near-normal. The generalized linear model is introduced for dependent variables that are not continuous or normally distributed. The essence of the generalized linear model is that the linear model is related to the response variable via a link function (e.g. the logit in logistic regression). Examples are dichotomous dependent variables (e.g. poverty, unemployment), polychotomous (labour market status, marital status), ordinal (happiness, ideology) and counts (number of bouts of unemployment).
 
The sequence of normal continuous, dichotomous, polychotomous and ordinal dependent variables is repeated for clustered data. Multilevel or hierarchical linear models are an extension of the general linear model with lower-level units clustered within higher level units, e.g. students within schools, annual earnings among individuals. They (among other things) estimate the between-cluster variance as a proportion of the total variance. This is straightforward for continuous dependent variables but can be extended to other types of dependent variables. Random effect models using longitudinal data can also be extended to take into account the correlations of the dependent variable across waves.
 
The final topic is event history models which are a set of statistical procedures for the analysis of discrete events that involves an event and time. These are often undesirable events, death, poverty, accidents, unemployment but can include leaving home, partnering and marriage, gaining a full-time job, exiting unemployment, etc. Longitudinal data is suited to event history analysis, although cross-sectional studies can collect information (by recall) on the times and dates events occurred.
 
Statistical Topics covered

  1. Basic statistical concepts (revision)
    1. Normal Distribution
    2. Mean, Standard deviation and variance
    3. Positively and Negatively Skewed distributions
    4. Sampling, Populations and Statistical Inference
    5. Covariance and Correlations
  2. Bivariate Ordinary Least Squares (OLS) regression (revision)
  3. Multivariate OLS regression (for normally distributed interval outcomes)
  4. Logistic regression (for dichotomous outcomes)
  5. Multinomial regression (for nominal outcomes)
  6. Ordinal regression (for ordinal outcomes)
  7. Counts and Rare Events
  8. Person-year data
  9. Fixed effects models for continuous outcomes
  10. Fixed effects models for dichotomous outcomes
  11. Random effects models random effects (repeated measures) models for continuous outcomes
  12. Random effects (repeated measures) models for dichotomous outcomes
  13. Random effects (repeated measures) models for nominal (polychotomous) outcomes
  14. Event history analysis.

Substantive areas analysed

  1. PISA test scores
  2. University entrance performance
  3. Entrance to university
  4. Post-School Study
  5. Earnings
  6. Life Satisfaction
  7. Poverty
  8. Financial Stress
  9. Unemployment
  10. Transition to adulthood (leaving home, marriage)

Students will be given some time to do their own analyses using the techniques learnt on these outcomes (e.g. earnings, poverty) or similar outcomes (e.g. income, financial stress) or other outcomes available in the data.
 
SAS Statistical Procedures used

  • Proc Corr – Correlations
  • Proc Reg – General linear models, OLS regression
  • Proc Logistic – Generalized linear model (logistic, multinomial & ordinal regression)
  • Proc Glm – General Linear model, fixed effects models
  • Proc Mixed – Random effects models, hierarchical/multilevel models
  • Proc Genmod – Generalized Linear Model (repeated effects)
  • Proc Surveylogistic – Generalized Linear Model (repeated effects)
  • Proc Lifereg – Event history analysis (Accelerated Failure Time Models)
  • Proc Phreg – Event history analysis (Proportional hazard models, or Cox regression)
Course Text: 

None, but course notes will be supplied.
 

References: 

Allison, P. 2005. Fixed Effects Regression Methods for Longitudinal Data Using SAS.

2012 Summer Programme
Second week in progress
Thanks to all attendees!

Advanced Qualitative Data Analysis using NVivo 9 (3 days) (CANCELLED)
Advanced Structural Equation Models using Mplus
Applied Computer-assisted Qualitative Data Analysis using NVivo
Case Study Research
Data Analysis using Stata
Fundamentals of SPSS (CANCELLED)
Intermediate Statistics (CANCELLED)
Introduction to Program Evaluation
Introduction to Social Network Research and Analysis (CANCELLED)
Introduction to Statistics
Introduction to Structural Equation Modelling Using AMOS™
Introduction to Survey Design
Introductory Bayesian Statistics (CANCELLED)
Longitudinal Data Analysis
Mathematics for Statistics (3 days) (CANCELLED)
Q Methodology (2 days)
Qualitative Research Techniques
Using Mixed Methods in Research and Program Evaluation

Selected course(s)

You have no selected course(s).

0 Items $0.00
Website hosted by:
Ngā Pae o te Māramatanga
© 2010 New Zealand Social Statistics Network | Powered by Drupal | E-commerce by Reign | Design by AOC