Enroll Course

100% Online Study
Web & Video Lectures
Earn Diploma Certificate
Access to Job Openings
Access to CV Builder



online courses

How to get the most out of web scraping with this simple guide

In the expansive domain of data engineering, one of the crucial responsibilities involves aggregating data to support machine learning model training. Data sources range from well-organized databases to unstructured information scattered across the Internet. Data engineers often turn to Scrapy, an open-source scraping framework developed in Python, for its effectiveness in extracting data from the open web.

Amidst the realm of data engineering, Python's extensive libraries play a pivotal role, providing a robust foundation for the development of powerful tools such as Scrapy. In this article, we'll delve into an advanced technique to optimize the efficiency of web scraping by harnessing the capabilities of Python's extensive libraries, particularly when dealing with structured data. This strategic approach proves especially beneficial in contrast to the inherent challenges posed by unstructured web pages.

Let's explore this advanced technique and unravel the potential it holds for enhancing the effectiveness of data extraction processes.

Comparing Vanilla Web Pages to REST APIs

Why choose to crawl REST API responses over web pages? The answer lies in the nature of web scraping, which involves extracting structured data from inherently unstructured sources. Web pages, designed for human consumption, often present raw, intricate data unsuitable for machine interpretation. However, with the rise of modern web application frameworks like React and Vue.js, many websites now employ REST APIs to transmit and receive data, making the data inherently structured and accessible for effective scraping.

While this article focuses on scraping from APIs, it's important to note that this method may not be universally suitable for all web scraping scenarios. The effectiveness of this approach depends on the specific characteristics of the target site. Readers are encouraged to assess and choose the most fitting method based on their unique use cases.

Tutorial: Web Scraping Using REST API with Scrapy

To grasp the intricacies of web scraping from APIs, let's dive into the process by creating a scraper. In this tutorial, we'll guide you through developing a scraper to extract posts from TechCrunch, a popular online publisher, using its own REST API.

Before proceeding, ensure that Scrapy is installed and you have completed the official tutorial. The code provided in this article is written in Python 3, utilizing Scrapy version 1.6.0. You can find the finalized codes for this tutorial in the GitHub repository: canyousayyes/scrapy-web-crawler-by-rest-api.

Setting Up the Project

  1. Create a New Scrapy Project: Open your terminal and execute the following command to start a new Scrapy project.

    scrapy startproject web_scraper
  2. Navigate to the Project Folder: Move into the newly created project folder.

    cd web_scraper
  3. Generate a CrawlSpider: Generate a new CrawlSpider specifically for TechCrunch.

    scrapy genspider -t crawl techcrunch techcrunch.com

Now, your Scrapy project is set up, and you have a dedicated CrawlSpider ready for customization based on the TechCrunch website.

Defining Item Structure

To streamline the scraping process, it's beneficial to outline the schema for items in advance. Open the items.py file within your Scrapy project and insert the following schema:

import scrapy class WebScraperItem(scrapy.Item): title = scrapy.Field() publish_date = scrapy.Field() content = scrapy.Field() image_urls = scrapy.Field() links = scrapy.Field()

This schema provides a foundation for organizing the scraped data, with specific fields for title, publish date, content, image URLs, and links.

Defining the Link Extractor Pattern

To effectively navigate the TechCrunch home page and identify relevant posts, establish a pattern for extracting article links. The links follow a consistent structure, and we can define a corresponding regex in the LinkExtractor within the spider's rules. Add the following code to your spider:

from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class TechCrunchSpider(CrawlSpider): # ... (previously defined attributes and methods) rules = ( Rule( LinkExtractor( allow=r'\d+/\d+/\d+/.+/', process_value=self.process_value ), callback='parse_item' ), ) def parse_item(self, response): # Implementation of item parsing logic will be covered in the next steps pass

This code configures the LinkExtractor with a regex pattern matching the structure of TechCrunch article links.

Leveraging the REST API

Now, let's delve into the core of our web scraping logic: extracting information from the article pages using TechCrunch's REST API. TechCrunch is built on WordPress, which officially supports REST API starting from version 4.7.0 (released in 2016). The WordPress REST API covers a wide range of resources related to the content management system (CMS), making our scraping task more manageable.

In the TechCrunchSpider class, implement a process_value callback in the LinkExtractor object to transform a page URL into a REST API endpoint:

from urllib.parse import urlparse class TechCrunchSpider(CrawlSpider): # ... (previously defined attributes and methods) def process_value(self, value): parsed_url = urlparse(value) path_segments = parsed_url.path.split('/') slug = path_segments[-2] if path_segments[-1] == '' else path_segments[-1] api_endpoint = f'https://{parsed_url.netloc}/wp-json/wp/v2/posts?slug={slug}' return api_endpoint def parse_item(self, response): # Implementation of item parsing logic will be covered in the next steps pass

This code transforms a page URL into a REST API endpoint for the corresponding post.

Now, rerun the crawler using the command scrapy crawl techcrunch. This modification enables your scraper to navigate the TechCrunch site using the REST API.

Some Side Notes...

According to statistics, approximately one-third of all websites are powered by WordPress. Acquiring the skill to scrape WordPress sites can prove immensely beneficial for your projects. Consider exploring the WordPress Posts API Reference for comprehensive details.

CMS framework market share statistics reveal that not all WordPress sites are identical. Adjust your API endpoints based on the specific structure of the target site. Additionally, be cautious about extensive keyword searches, as they can consume significant resources.

For WordPress blog maintainers, consider disabling the REST API functionality if not required to prevent potential misuse and enhance site security.

Parsing the REST API Response

Now, the task is to extract data from the API response, which is in JSON format. Leverage the defined item schema and use the ItemLoader for efficient data extraction. Observe the result in action:

from scrapy.loader import ItemLoader from web_scraper.items import WebScraperItem from scrapy.http import TextResponse class TechCrunchSpider(CrawlSpider): # ... (previously defined attributes and methods) def parse_item(self, response): loader = ItemLoader(item=WebScraperItem(), response=response) api_data = response.json() loader.add_value('title', api_data[0]['title']['rendered']) loader.add_value('publish_date', api_data[0]['date_gmt']) content_response = TextResponse(url=response.url, body=api_data[0]['content']['rendered'], encoding='utf-8') loader.context['loader_response'] = content_response loader.add_xpath('content', '//*[@class="entry-content"]/p//text()') return loader.load_item()

This code efficiently extracts data from the API response, adhering to the defined item schema. It uses XPath selectors to extract text content from the HTML, ensuring that the extracted text content is concatenated into a single string.

This approach showcases the power of parsing HTML from the REST API response over parsing the actual page, as it provides cleaner and more efficient data extraction.

Bonus: Inspecting REST API in Other Sites

While WordPress is a common framework, some sites are built on different platforms. Explore if the site provides a REST API for public use. For instance, Wikipedia, based on the "Wikimedia" framework, supports an API as documented here.

Alternatively, you can inspect network activities to identify the API format. Consider Medium as an example:

  1. Visit the Medium home page using a browser with developer tools (Chrome is recommended).
  2. Open the developer tools and switch to the "Network" tab, filtering to view only "XHR" type requests.
  3. Click into any post and observe the activities in the Network tab.
  4. The first request item is often the target. Click into it and preview the response.

Replicate the request using the copied cURL command, adjusting headers as needed.

It's crucial to acknowledge that not all sites offer a REST API. In such cases, fall back to traditional scraping approaches.

Conclusion

In this tutorial, we explored the effectiveness of utilizing REST APIs for web scraping using Scrapy, focusing on the TechCrunch website as our example. Leveraging the inherent structure of data provided by REST APIs can significantly streamline the scraping process.

To further enhance the reliability and efficiency of your web scraping endeavors, consider integrating tools like Zenrows, a powerful URL normalization tool. By seamlessly incorporating such tools into your workflow, you can benefit from normalized and standardized URLs, contributing to the overall robustness of your web scraping tool.

When choosing web scraping methods, always consider the specific requirements of your use case and evaluate the efficacy of available tools to optimize your data extraction process.

 
 

SIIT Courses and Certification

Full List Of IT Professional Courses & Technical Certification Courses Online
Also Online IT Certification Courses & Online Technical Certificate Programs