Home     About     Archive     Links

Reproducible Research - PA2

Analysis on U.S. NOAA storm database

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. The events in the database start in the year 1950 and end in November 2011.

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

# cleanup
rm(list=ls())
echo = TRUE  # Make code visible
source("multiplot.R")  # multiplot
suppressWarnings(library(plyr))
library(knitr)
suppressWarnings(library(ggplot2))

system.time(df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), 
                           header = TRUE, 
                           #quote = "", 
                           strip.white=TRUE,
                           stringsAsFactors = FALSE))
##    user  system elapsed 
##   94.81    0.59   95.44
#   user  system elapsed 
# 469.36    3.18  656.79 << pc1
#str(df)
dim(df)  # 902297     37
## [1] 902297     37
#colnames(df)

Results

  • In this study, it’s assumed that harmful events with respect to population health comes from variables FATALITIES and INJURIES.

  • Select useful data

#
df <- df[ , c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
#str(df)
#dim(df) # 902297      8
#length(unique(df$EVTYPE)) # 985

#head(df$BGN_DATE)
df$BGN_DATE <- as.POSIXct(df$BGN_DATE,format="%m/%d/%Y %H:%M:%S")
#head(df$BGN_DATE)
#unique(df$PROPDMGEXP)
#unique(df$CROPDMGEXP)
#
  • Create new variables: TOTAL_PROPDMG, TOTAL_CROPDMG and TOTALDMG with: TOTALDMG = (TOTAL_PROPDMG + TOTAL_CROPDMG)
tmpPROPDMG <- mapvalues(df$PROPDMGEXP,
                         c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"), 
                         c(1e3,1e6, 1, 1e9,1e6,  1,  1,1e5,1e6,  1,1e4,1e2,1e3,  1,1e7,1e2,  1, 10,1e8))

tmpCROPDMG <- mapvalues(df$CROPDMGEXP,
                         c("","M","K","m","B","?","0","k","2"),
                         c( 1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))
#colnames(df)
df$TOTAL_PROPDMG <- as.numeric(tmpPROPDMG) * df$PROPDMG
df$TOTAL_CROPDMG <- as.numeric(tmpCROPDMG) * df$CROPDMG
colnames(df)
##  [1] "EVTYPE"        "BGN_DATE"      "FATALITIES"    "INJURIES"     
##  [5] "PROPDMG"       "PROPDMGEXP"    "CROPDMG"       "CROPDMGEXP"   
##  [9] "TOTAL_PROPDMG" "TOTAL_CROPDMG"
remove(tmpPROPDMG)
remove(tmpCROPDMG)

df$TOTALDMG <- df$TOTAL_PROPDMG + df$TOTAL_CROPDMG
##
head(unique(df$EVTYPE))
## [1] "TORNADO"               "TSTM WIND"             "HAIL"                 
## [4] "FREEZING RAIN"         "SNOW"                  "ICE STORM/FLASH FLOOD"

Population health impact

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
summary1 <- ddply(df,"EVTYPE", summarize, propdamage = sum(TOTALDMG), injuries= sum(INJURIES), fatalities = sum(FATALITIES), persdamage = sum(INJURIES)+sum(FATALITIES))

summary1 <- summary1[order(summary1$propdamage, decreasing = TRUE),]
#head(summary1,10)
#tmp = head(summary1,10)

summary2 <- summary1[order(summary1$persdamage, decreasing = TRUE),]
#head(summary2,10)

plot2 <- ggplot(data=head(summary2,10), aes(x=EVTYPE, y=persdamage, fill=persdamage)) + 
  geom_bar(stat="identity",position=position_dodge()) +
  labs(x = "event type", y = "personal damage (injuries and fatalities)") + 
  scale_fill_gradient("personal damage", low = "lightblue", high = "darkblue") + 
  ggtitle("Effect of Severe Weather on Public Health") +
  theme(axis.text.x = element_text(angle=90, hjust=1))
print(plot2)

center </br>

  • From the above figure we can see that TORNADOES have the most significant impact on public health.

</br>

  1. Across the United States, which types of events have the greatest economic consequences?
##

plot1 <- ggplot(data=head(summary1,10), aes(x=EVTYPE, y=propdamage, fill=propdamage)) + 
  geom_bar(stat="identity",position=position_dodge()) +
  labs(x = "event type", y = "property damage (in $USD)")  +
  scale_fill_gradient("$USD", low = "lightblue", high = "darkblue") +
  ggtitle("Effect of Severe Weather on the U.S. Economy") +
  theme(axis.text.x = element_text(angle=90, hjust=1))
print(plot1)

center </br>

  • The FLOODS, HURRICANES/TYPHOONES and TORNADOES are the events with the greatest economic consequences.
summary3 <- summary1[order(summary1$"injuries", decreasing = TRUE),]
#head(summary3,10)

plot3 <- ggplot(data=head(summary3,10), aes(x=EVTYPE, y=injuries, fill=injuries)) + 
  geom_bar(stat="identity",position=position_dodge()) +
  #labs(x = "event type", y = "personal damage (injuries and fatalities)") + 
  labs(x = "event type", y = "personal damage (injuries)") + 
  scale_fill_gradient("injuries", low = "yellow", high = "red") +
  theme(axis.text.x = element_text(angle=90, hjust=0.8))
  #scale_fill_gradient("personal damage", low = "yellow", high = "red") + 
  #theme(axis.text.x = element_text(angle=60, hjust=1))

summary4 <- summary1[order(summary1$"fatalities", decreasing = TRUE),]
#head(summary4,10)

plot4 <- ggplot(data=head(summary4,10), aes(x=EVTYPE, y=fatalities, fill=fatalities)) + 
  geom_bar(stat="identity",position=position_dodge()) +
  labs(x = "event type", y = "personal damage (fatalities)") +
  scale_fill_gradient("fatalities", low = "yellow", high = "red") + 
  theme(axis.text.x = element_text(angle=90, hjust=0.8))
#print(plot3)
#print(plot4)
multiplot(plot3, plot4, cols=2)

center

#

</br>

  • TORNADO is the harmful event with respect to population health, and
  • FLOOD is the event which have the greatest economic consequences.

Statistical Inference

Introduction

This is the project for the statistical inference class. In it, you will use simulation to explore inference and do some simple inferential data analysis. The project consists of two parts:

  1. A simulation exercise.
  2. Basic inferential data analysis.

Simulation exercises

The exponential distribution can be simulated in R with rexp(n, lambda) where lambda is the rate parameter. The mean of exponential distribution is 1/lambda and the standard deviation is also 1/lambda. Set lambda = 0.2 for all of the simulations. You will investigate the distribution of averages of 40 exponentials. Note that you will need to do a thousand simulations.

set.seed(33)
lambda <- 0.2
# We perform 1000 simulations with 40 samples 
sample_size <- 40
simulations <- 1000

# Lets do 1000 simulations
simulated_exponentials <- replicate(simulations, mean(rexp(sample_size,lambda)))
# 
simulated_means  <- mean(simulated_exponentials)
simulated_median <- median(simulated_exponentials)
simulated_sd     <- sd(simulated_exponentials)

Results

1. Show the sample mean and compare it to the theoretical mean of the distribution.

The theoretical mean and sample mean for the exponential distribution.

tm <- 1/lambda                     # calculate theoretical mean
sm <- mean(simulated_exponentials) # calculate avg sample mean

Theoretical mean: 5

Sampling mean: 4.9644306

The sample mean is very close to the theoretical mean at 5.

hist(simulated_exponentials,  freq=TRUE, breaks=50,
     main="Sample Means of Exponentials (lambda 0.2)",
     xlab="Sample Means from 1000 Simulations")
abline(v=5, col="blue", lwd=2)

center

2. Show how variable the sample is (via variance) and compare it to the theoretical variance of the distribution.

# Calculation of the theoretical sd
theor_sd <- (1/lambda)/sqrt(40)
# Calculation of the theoretical variance for sampling data
theor_var <- theor_sd^2

simulated_var    <- var(simulated_exponentials)

The variance for the sample data is : 0.6456619 and the theoretical variance is : 0.625 Both these values are quite close to each other.

3. Show that the distribution is approximately normal.

qqnorm(simulated_exponentials, ylab = "Sample Means of Exponentials (lambda 0.2)")
qqline(simulated_exponentials, col = 2)

center

we see that the distribution is approximately normal as the straight line is closer to the points.

You're up and running!

Next you can update your site name, avatar and other options using the _config.yml file in the root of your repository (shown below).

_config.yml

The easiest way to make your first post is to edit this one. Go into /_posts/ and update the Hello World markdown file. For more instructions head over to the Jekyll Now repository on GitHub.

<<<<<<< HEAD