Skip to main content

Crawler

For static web sites you can use a HTTP request to get the page content in HTML and then use JavaScript to extract the information. Most websites use JavaScript to dynamically create HTML and on those sites this approach does not work. This integration uses the Google Puppeteer library for Node to emulate a web driver with JavaScript support.

Usage: Extract data from a web site

Web service

To install the web service on your local cluster run:

helm install crawler spacetime/crawler

Script

In this example we will scrape the contents of a weather forecast web site.

The script below is a standard puppeteer script. You can develop and test the script in nodejs. Create a new Script and add the following to the definition:

const get = async (url) => {
const browser = await puppeteer.launch({ args: ['--no-sandbox'] })
const page = await browser.newPage()
const navigationPromise = page.waitForNavigation()


await page.setViewport({ width: 1386, height: 878 })
await page.goto(url)

await navigationPromise


var forecast = {
precipitation: { next_hour: -1, tomorrow: -1, day_after_tomorrow: -1 },
temperature: {
min: { tomorrow: -30, day_after_tomorrow: -30 },
max: { tomorrow: -30, day_after_tomorrow: -30 }
}
}

var selector = '.forecast:nth-child(1) tr > td:nth-child(8)'
await page.waitForSelector(selector)
forecast.precipitation.next_hour = await page.$eval(selector, data => { return data.innerText })
var i = 1;
for (let field of Object.keys(forecast.precipitation)) {
selector = '.forecast:nth-child(' + i + ') tr > td:nth-child(8)'
var rainPerHour = await page.$$eval(selector, anchors => {
return anchors.map(anchor => {
return parseFloat(anchor.textContent.split(' ')[0].replace(',', '.'))
})
})
var total = 0;
print(rainPerHour)
if (i === 1) {
forecast.precipitation[field] = rainPerHour[0];
} else {
rainPerHour.forEach(p => { total += p })
forecast.precipitation[field] = Math.round(10 * total) / 10
}
++i
}
i = 2
for (let field of Object.keys(forecast.temperature.min)) {
selector = '.forecast:nth-child(' + i + ') tr > td:nth-child(3)'
var tempPerHour = await page.$$eval(selector, anchors => {
return anchors.map(anchor => parseInt(anchor.textContent.split('°')[0])
)
})
var max = -30
var min = 50
console.log(tempPerHour)
tempPerHour.forEach(p => { max = Math.max(max, p), min = Math.min(min, p) })
forecast.temperature.max[field] = Math.round(max)
forecast.temperature.min[field] = Math.round(min)
++i
}
await browser.close()
return forecast
}

To test the script enter the code below in the test section and select Run.

var forecast = await get('https://www.buienradar.nl/weer/eersel/nl/2756342/5daagse')
print("Forecast:" + JSON.stringify(forecast))

Secret

Create a new secret to set this script as the data source.

Importer

Create a new importer and set the Datasource to the name of the secret. In the object mapping you can link to topic to the URL that is passed to the script. The key mapping extracts the data from the returned JSON objects.