Crawler
For static web sites you can use a HTTP request to get the page content in HTML and then use JavaScript to extract the information. Most websites use JavaScript to dynamically create HTML and on those sites this approach does not work. This integration uses the Google Puppeteer library for Node to emulate a web driver with JavaScript support.
Usage: Extract data from a web site
Web service
To install the web service on your local cluster run:
helm install crawler spacetime/crawler
Script
In this example we will scrape the contents of a weather forecast web site.
The script below is a standard puppeteer script. You can develop and test the script in nodejs. Create a new Script
and add the following to the definition:
const get = async (url) => {
const browser = await puppeteer.launch({ args: ['--no-sandbox'] })
const page = await browser.newPage()
const navigationPromise = page.waitForNavigation()
await page.setViewport({ width: 1386, height: 878 })
await page.goto(url)
await navigationPromise
var forecast = {
precipitation: { next_hour: -1, tomorrow: -1, day_after_tomorrow: -1 },
temperature: {
min: { tomorrow: -30, day_after_tomorrow: -30 },
max: { tomorrow: -30, day_after_tomorrow: -30 }
}
}
var selector = '.forecast:nth-child(1) tr > td:nth-child(8)'
await page.waitForSelector(selector)
forecast.precipitation.next_hour = await page.$eval(selector, data => { return data.innerText })
var i = 1;
for (let field of Object.keys(forecast.precipitation)) {
selector = '.forecast:nth-child(' + i + ') tr > td:nth-child(8)'
var rainPerHour = await page.$$eval(selector, anchors => {
return anchors.map(anchor => {
return parseFloat(anchor.textContent.split(' ')[0].replace(',', '.'))
})
})
var total = 0;
print(rainPerHour)
if (i === 1) {
forecast.precipitation[field] = rainPerHour[0];
} else {
rainPerHour.forEach(p => { total += p })
forecast.precipitation[field] = Math.round(10 * total) / 10
}
++i
}
i = 2
for (let field of Object.keys(forecast.temperature.min)) {
selector = '.forecast:nth-child(' + i + ') tr > td:nth-child(3)'
var tempPerHour = await page.$$eval(selector, anchors => {
return anchors.map(anchor => parseInt(anchor.textContent.split('°')[0])
)
})
var max = -30
var min = 50
console.log(tempPerHour)
tempPerHour.forEach(p => { max = Math.max(max, p), min = Math.min(min, p) })
forecast.temperature.max[field] = Math.round(max)
forecast.temperature.min[field] = Math.round(min)
++i
}
await browser.close()
return forecast
}
To test the script enter the code below in the test
section and select Run
.
var forecast = await get('https://www.buienradar.nl/weer/eersel/nl/2756342/5daagse')
print("Forecast:" + JSON.stringify(forecast))
Secret
Create a new secret to set this script as the data source.
Importer
Create a new importer and set the Datasource
to the name of the secret. In the object mapping you can link to topic to the URL that is passed to the script. The key mapping extracts the data from the returned JSON objects.