January 2020 – Dr.-Ing. Anton Sigitov

In this tutorial, we are going to implement a small web scraper application that will take in an amazon URL to a product and return the title and the price of the product. We will implement the web scraper inside a UWP application, so we can use GUI elements for input and output. The application will not be really useful, but the aim is to show the basics. Later, you can easily extend the application so it can take in multiple URLs, save them, and fetch the information with some intervals. That, for instance, will allow you to track the price changes of multiple products or update a news feed.

We will start by creating a new Visual Studio Project (for this tutorial I used Visual Studio 2019).

Start the Visual Studio IDE, select the Blank App (Universal Windows) project template, and click Next.

Configure the project by giving it and the solution a name, and selecting the location. Next, click on the Create button.

In the next window, select the Target and the Minimum Windows version for the application. Click OK and wait until the project is created.

After the project was created, right-click on the References item in the solution explorer and select Manage NuGet Packages…

In the NuGet window, activate the Browse tab and search for the HtmlAgilityPack and System.Net.Http packages. Install the current versions of both packages into the project. We will need these packages to request and process web sites.

Next, we are going to implement a basic GUI for our application. The GUI will allow for input of an Amazon URL address and presenting the user with the results of the web scraping by showing the title of the product and its price.

Double-click on the MainPage.xaml file in the solution explorer to open it and replace its content with the following:

<Page
    x:Class="AmazonShowPrice.MainPage"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    xmlns:local="using:AmazonShowPrice"
    xmlns:d="http://schemas.microsoft.com/expression/blend/2008"
    xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006"
    mc:Ignorable="d"
    Background="{ThemeResource ApplicationPageBackgroundThemeBrush}">

    <Grid Margin="10">
        <StackPanel Orientation="Vertical">
            <Grid Margin="10" Height="32">
                <Grid.ColumnDefinitions>
                    <ColumnDefinition Width="100"></ColumnDefinition>
                    <ColumnDefinition Width="*"></ColumnDefinition>
                    <ColumnDefinition Width="150"></ColumnDefinition>
                </Grid.ColumnDefinitions>

                <TextBlock Margin="10,0,0,0" VerticalAlignment="Center">Amazon URL:</TextBlock>
                <TextBox x:Name="AmazonURL" Margin="10,0,0,0" Grid.Column="1"></TextBox>
                <Button x:Name="FetchPrice" Width="120" Grid.Column="2" HorizontalAlignment="Center" Height="32" Click="FetchPrice_Click">GO!</Button>
            </Grid>

            <TextBlock x:Name="Title" FontSize="28" TextAlignment="Center" TextWrapping="WrapWholeWords"></TextBlock>
            <TextBlock x:Name="Price" FontSize="204" TextAlignment="Center"></TextBlock>
        </StackPanel>
    </Grid>
</Page>

The code above creates a simple user interface. The main components of the interface are TextBox AmazonURL, Button FetchPrice, TextBlock Title, and TextBlock Price. The TextBox AmazonURL allows the user to input an Amazon URL that the user wants to process. The Button FetchPrice will execute the web scraping routine. The TextBlocks Title and Price will be used for displaying the results.

Finally, we will implement the web scraping routine itself.

Double-click on the MainPage.xaml.cs file to open it and replace its content with the following code:

using Windows.UI.Xaml;
using Windows.UI.Xaml.Controls;
using System.Net.Http;
using HtmlAgilityPack;
using System.Web;

namespace AmazonShowPrice
{
    public sealed partial class MainPage : Page
    {
        const string titleNode = "//span[@id='productTitle']";
        const string priceNodeV1 = "//span[@class='a-size-medium a-color-price offer-price a-text-normal']";
        const string priceNodeV2 = "//span[@id='priceblock_ourprice']";
        const string error = "Error";

        public MainPage()
        {
            this.InitializeComponent();
        }

        private async void FetchPrice_Click(object sender, RoutedEventArgs e)
        {
            using (HttpClient httpClient = new HttpClient())
            {
                HttpResponseMessage response = await httpClient.GetAsync(AmazonURL.Text);
                HttpContent content = response.Content;

                HtmlDocument document = new HtmlDocument();
                document.LoadHtml(await content.ReadAsStringAsync());

                // Get Title
                var title = document.DocumentNode.SelectSingleNode(titleNode);

                // Get and Show Price
                var price = document.DocumentNode.SelectSingleNode(priceNodeV1);

                if (price == null)
                    price = document.DocumentNode.SelectSingleNode(priceNodeV2);

                // Show Title and Price
                if (title != null)
                    Title.Text = HttpUtility.HtmlDecode(title.InnerText).Trim();
                else
                    Title.Text = error;

                if (price != null)
                    Price.Text = price.InnerText;
                else
                    Price.Text = error;
            }
        }
    }
}

As you can see, we first define some string constants. The constants describe the HTML nodes that contain the information we are looking for (title and price). For instance, the span element with the id equals productTitle contains the title of the product. This kind of information can be easily found through examination of the source code of a web site with on-board tools of any browser. Unfortunately, Amazon product pages are not consistent, so different pages might contain the required information in different HTML nodes. That is the reason why we have two constants for the price.

Additionally to the string constants, we have the async method FetchPrice_Click which is called upon the click on the Go button in our GUI. The methods contain the scraping routine.

The routine creates first an instance of the HttpClient class. Next, we use that client and the URL provided by the user in the TextBox AmazonURL to acquire the content of the web site. Subsequently, we load the content into an instance of the HtmlDocument class, thereby the content represent as a string will be parsed into an easy to handle hierarchical structure that will allow us to access individual Html nodes.

Next, we make use of the method SelectSingleNode that will find for us the Html nodes based on the descriptions we defined in the constants. Note, that if multiple elements in the structure that fit the description then the first element will be returned.

Finally, we access and assign the content of the founded nodes to the Text properties of the Title and Price TextBlocks.

That is all! The working application looks like in the screenshot below.

Month: January 2020

How to Implement a Web Scraper Using C# and UWP

Like this:

Share this:

Like this: