We need 10 Ruby developers with experience scraping websites to our scraper team. The monthly salary will be up to 1200 USD depending on your results.
We currently have a setup which scrapes 300 websites every hour.
We need to add a couple of thousands new spiders to our setup and that would be your first job.
Afterwards there will be a continuous need for your help in maintaining all the spiders - fixing them when the website changes etc.
We use the Kimura Framework(Ruby) for writing the spiders. A basic knowledge of XPath is needed and some experience in scraping is definitely required. We only use Mechanize for our spiders. No headless chrome or other solutions for javascript rendering(too slow), so a bit of trickery is often needed.
It is a relatively easy job writing the spiders, but it is quite time consuming because of the scale we need. I've made the setup and the spiders we currently have, and will be available for assistance in the beginning.
Questions to answer in application:
You've got a HTML table with rows on a website. The first column in the row contains the text "Room size". The next column in this table row contains the value "20 m2". Build xpath for extracting the value "20 m2" based on the information provided here.
You need to scrape a website that you realise does not render the content on page load. It fetches the data from an API with a XHR request and then renders it with Angular. Which steps would you take to scrape this website?
After having successfully written a spider, you realise that the spider has started being blocked by the website. The wordpress website uses Wordfence which blocks our traffic because we perform to many requests. How would you go about circumventing this?
We've got a website which protects a phone number we want, by printing the phone number on a simple white background image. How do you go about extracting the phone number?