clear
set more off
cd "C:\Documents and Settings\Andrew Leigh\My publications\Aust - forecasting elections 2004\"

use centrebet.dta, clear
sort dateno

* Merging the polling data
for any morganpolls newspolls acnpolls galaxypolls: merge dateno using X.dta \ drop _merge \ sort dateno
drop if dateno==.
* Setting sample size to 1000 for Morgan, and 1700 for Newspoll
gen morg_samp=1000 if coalition_morg~=.
gen news_samp=1200 if coalition_news~=.
replace news_samp=1700 if dateno==38265
for var coalition_* alp_*: replace X=X/100

* Merging data for 3 other bookies
sort dateno
merge dateno using "C:\Documents and Settings\Andrew Leigh\My publications\Aust - forecasting elections 2004\otherbookies.dta"
drop _merge
la var coalition_sb "SportingBet"
la var coalition_ias "International All Sports"
la var coalition_sa "SportsAcumen"

* Date format. First, we subtract 21916 from dateno, which converts Excel dates into Stata dates
replace dateno=dateno-21916

* Merging Betfair data
ren date datetemp
ren dateno date
sort date
merge date using "C:\Documents and Settings\Andrew Leigh\My publications\Aust - forecasting elections 2004\temp2.dta"
drop _merge
la var coalition_bf "Betfair"
ren date dateno
ren datetemp date

* Now, dropping election day data (Oct 9, 2004 = 16353)
drop if dateno==16353

tsset dateno
tsfill
for any morg news acn galaxy: replace X_samp=. if alp_X==. \ gen temp_X=X_samp*alp_X
egen r_alp=rsum(temp_morg temp_news temp_acn temp_galaxy)
egen r_samp=rsum(morg_samp news_samp acn_samp galaxy_samp)
for any r_alp r_samp: recode X .=0
gen alp_polls=(r_alp+l.r_alp+l2.r_alp+l3.r_alp+l4.r_alp+l5.r_alp+l6.r_alp)/(r_samp+l.r_samp+l2.r_samp+l3.r_samp+l4.r_samp+l5.r_samp+l6.r_samp)
gen coalition_polls=1-alp_polls
ren r_samp polls_samp
drop temp* r_alp

* Polls into probabilities
for any acn galaxy morg news polls: gen sd_X=((coalition_X*(1-coalition_X))/X_samp)^.5 \ gen z_X=(coalition_X-.5)/sd_X \ gen pc_X=norm(z_X) \ drop z_X sd_X
sort dateno
list dateno pc_acn pc_galaxy pc_morg pc_news
* Damn, look at the spread on those poll probability numbers...
sum pc_polls, d
sum pc_polls if dateno>=16312, d
list dateno pc_polls if dateno>=16312

* Now, to make graphs less jagged, we make each poll equal its last value (up to 2 weeks back) if missing
for X in any morg news acn galaxy polls: for Y in num 1/14: replace coalition_X=lY.coalition_X if coalition_X==. 
for X in any morg news acn galaxy polls: for Y in num 1/14: replace pc_X=lY.pc_X if pc_X==. 

for any c: la var pX_galaxy "Galaxy" \ la var pX_acn "AC Nielsen" \ la var pX_morg "Morgan" \ la var pX_news "Newspoll" \ la var pX_polls "7-day sample-weighted poll average"
for any coalition: la var X_galaxy "Galaxy" \ la var X_acn "AC Nielsen" \ la var X_morg "Morgan" \ la var X_news "Newspoll" \ la var X_polls "7-day sample-weighted poll average"
la var pc_centrebet "Centrebet"

* Pumping up the standard error of polls - setting SE=10%
for any polls: gen sd_X=.1 \ gen z_X=(coalition_polls-.5)/sd_X \ gen pc_polls2=norm(z_X) \ drop z_X sd_X
sum pc_centrebet pc_poll* 
la var pc_polls2 "7-day sample-weighted poll average (adjusted variance)"

* What's the average Coalition probability from polls & betting markets? (on days when we have data from both)
sum pc_polls pc_centrebet if pc_polls~=. & pc_centrebet~=.

* Interpolating Centrebet data (since we have all price changes, this is kosher)
replace pc_centrebet=l.pc_centrebet if pc_centrebet==. 

format dateno %dDmY

* Getting predictions from Betfair & Centrebet at various horizons (16353 is polling day)
for num 1 90 365: list dateno pc_centrebet coalition_bf if dateno==16353-X

/*
* Figures 1, 4 & 5
sort dateno
set scheme s1mono
twoway (line coalition_bf dateno) (line pc_centrebet dateno) (line coalition_ias dateno) (line coalition_sb dateno) (line coalition_sa dateno) if dateno>=15872, xtitle("") ytitle("Chance of Coalition win") subtitle("Figure 1: Comparing Betting Markets Over the 16 Months Before the Election")
twoway (line coalition_acn dateno, clstyle(p2)) (line coalition_galaxy dateno, clstyle(p3)) (line coalition_morg dateno, clstyle(p4)) (line coalition_news dateno, clstyle(p5)) (line coalition_polls dateno, clstyle(p1) clwidth(medthick)), xtitle("") ytitle("Coalition 2PP Vote") ti("Figure 4: Pollsters Over the Election Cycle") yline(.5274) 
twoway (line pc_polls dateno) (line pc_polls2 dateno) (line pc_centrebet dateno), xtitle("") ytitle("Chance of Coalition win") subtitle("Figure 5: Comparing Polls and Betting Markets Over the Election Cycle") legend(col(1)) yline(.5)
*/

* Arbitrage opportunities?
for X in any coalition_bf pc_centrebet coalition_ias coalition_sb coalition_sa: for Y in any coalition_bf pc_centrebet coalition_ias coalition_sb coalition_sa: gen XY=X-Y
for var coalition_bfc* pc_centrebetc* coalition_iasc* coalition_sbc* coalition_sac*: list X dateno if (X>.09 | X<-.09) & X~=.

* Dickey-Fuller & KPSS tests
tsset dateno
for any pc_centrebet: dfuller X, regress \ kpss X if dateno>=16311 \ gen temp=d.X \ reg temp l.temp l2.temp l3.temp, r \ test l.temp l2.temp l3.temp \ twoway scatter temp l3.temp, mlabel(dateno) \ drop temp
for any pc_centrebet: gen temp=d.coalition_polls \ reg d.X l.temp l2.temp l3.temp,r \ test l.temp l2.temp l3.temp \ drop temp
gen betfair_temp=coalition_bf if betfair_noninterp==1
drop if betfair_temp==.
sort dateno
gen n=_n
tsset n
for any betfair_temp: dfuller X, regress \ kpss X \ gen temp=d.X \ reg temp l.temp l2.temp l3.temp, r \ test l.temp l2.temp l3.temp \ drop temp
for any betfair_temp: gen temp=d.coalition_polls \ reg d.X l.temp l2.temp l3.temp,r \ test l.temp l2.temp l3.temp \ drop temp

* GRAPHING MARGINALS DATA (Figure 3)
set scheme s1mono
use "C:\Documents and Settings\Andrew Leigh\My publications\Aust - forecasting elections 2004\marginals_betting.dta", clear
for var centrebet*: la var X "Probability of winning (Centrebet)"
la var voteshare "Share of two-party preferred vote"
twoway scatter voteshare centrebet_21_9, mlabel(seatname) ti("3 Weeks Before Poll") yline(50) xline(50) ytitle("Share of two-party preferred vote", margin(5)) graphregion(margin(r+10)) saving(threeweeks,replace)
twoway scatter voteshare centrebet_8_10, mlabel(seatname) ti("Election Eve") yline(50) xline(50) ytitle("") graphregion(margin(r+10)) saving(electioneve,replace)
graph combine threeweeks.gph electioneve.gph, fysize(70) ycommon xcommon ti("Figure 3: Seat-by-Seat Betting")

* Inferring sample size of betting markets in marginal seats
for any voteshare centrebet_8_10: replace X=X/100
gen temp1=voteshare-.5
gen temp2=invnorm(centrebet_8_10)
reg temp1 temp2, r
reg temp1 temp2, nocons r
gen sampsize=voteshare*(1-voteshare)/((_b[temp2])^2)
sum sampsize
drop temp* sampsize

* Calculating RMSE of marginal seat polls
use "C:\Documents and Settings\Andrew Leigh\My publications\Aust - forecasting elections 2004\polls_bookies_marginals.dta", clear
gen error=abs(result-poll)
sum error
reg result poll, nocons
reg result poll
reg result bookie
reg result poll bookie
