Scrape ski hill snow reports with XPaths

This was a fun little project to see if I could build a web page that would gather snow reports from ski hills every day.

The first issue is that most of the web pages need javascript to run in order to fill out the data which makes it a bit of a pain. I chose Chrome headless to get the complete web page and the code is pretty simple:

1
2
3
4
5
6
7
8
9
html = subprocess.check_output(['google-chrome',
                               '--no-sandbox',
                               '--headless',
                               '--user-agent="' + USER_AGENT + '"',
                               '--disable-gpu',
                               '--blink-settings=imagesEnabled=false',
                               '--dump-dom',
                               '--virtual-time-budget=10000',
                               url])

Once you have the text, you can convert it to an element tree and extract an element like this:

1
2
tree = html.fromstring(html)
element = tree.xpath(xpath)

Now we need an actualy XPath. XPaths are pretty flexible in that you can find an element containing text, go up to it’s parent and extract values from another child. Using www.whistlerblackcomb.com as an example the XPaths would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Overnight
//*[contains(text(),"12 Hour")]/parent::div/h5

Last 24 hours
//*[contains(text(),"24 Hour")]/parent::div/h5

Last 48 hours
//*[contains(text(),"48 Hour")]/parent::div/h5

Last 7 days
//*[contains(text(),"7 Day")]/parent::div/h5

Total
//*[contains(text(),"Current")]/parent::div/h5

Base
//*[contains(text(),"Base")]/parent::div/h5

Doing similar things for a bunch of ski hills and you end up with something that looks like this: