How to Stop or Profit from Robots Scraping Your Data

Boca Chamber Member Update:

What Is Data Scraping?

Data scraping refers to extracting data or content from a website or series of websites, a database, enterprise application, or legacy system. This data is exported into a file or program that is then used for a specific purpose or to be integrated/migrated into a new system.

As you read this, remember that HUMANS find it economically interesting to pay other humans to build web-based robots to scrape your site, which contains valuable information. If your site’s information wasn’t valuable, nobody would be getting paid to write a scraper to scrape your site.

If you’re interested in learning more about data scraping, data scrapers, or need to protect yourself from scrapers, contact The SilverLogic, a Boca Raton-based software development company.

Web vs Screen Scraping

Although these terms may at times be used interchangeably, web and screen scraping are two separate scraping techniques. The lines become blurred as screen scraping can be completed on the web, or web scraping is sometimes used during migrations, but it’s easiest to view web scraping primarily as a tool for “Data Analysis, Acquisition, & Research” and screen scraping as a tool for “Integration & Migration”.

The difference between web and screen scraping is:

– Web Scraping: Extracts data or content from the web. Content scraping is a component of web scraping. Primarily used for research, analysis, comparison, strategy, extracting specific information from one massive source or multiple sources

– Screen Scraping: Extracts screen and other data from an application, desktop, web, or legacy system. Primarily used for scraping ERP or CRM data to integrate into a new system, mirror the display of legacy system, migrate content, business process automation, etc.

Web or content scraping can take place manually by a human or automatically through a program. Screen scraping is accomplished through a program. It is a versatile tool for data migration as it enables accurate extraction and integration of legacy systems data into a newer, more cost efficient and effective platform. Unfortunately, if you never signed up to have your site or software scraped, someone might be taking advantage of your data.

Can You Protect Your Site Against Scrapers?

If you’ve noticed that prompts asking you to prove you’re not a robot by picking out objects from blurrier and blurrier images are cropping up all over the web, you’ve experienced a webmaster trying to block bots from scraping their sites.

Although this might stop the hobbyist scraper, it’s not going to stop a professional developer – after all, all you’re doing every time you pick out the images that have a bicycle or a traffic light is training robots to pass these tests. Nevertheless, adding a captcha to your site is a good idea as it will block some number of scammers and bots.

If you really want to block other people from successfully scraping your site, consider doing the following:

– Regularly change the labels on your site. The robots scraping your site might be looking for a label called “email” – consider regularly changing around the position and names of labels on key parts of your website

– Make users have to complete a small task before viewing important info, and rotate this task. For example, if your site stores important emails, make the email available on click one month but not another

– Throttle API calls users can make. A human moving at human speeds will only be able to make so many calls to your database per day, even if they spend all day copy+pasting from your site. Block users from making too many calls to the database per hour

– Start a subscription program! If people are willing to pay money to make more calls than what would make sense for the casual user, start a subscription program allowing users willing to pay for access to your data to do so.

Good luck!

About the Author

Leave a comment

XHTML: You can use these html tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>