Web Scraping with C#

Web Scraping with C#
Posted on
Aug 02, 2022

Web scraping is the transfer of data posted on the Internet in the form of HTML pages (on a website) to some kind of storage (be it a text file or a database).

The advantage of C# programming language in web scraping is that it allows to integrate the browser directly into forms using the C# WebBrowser. The scraped data can be saved to any output file, or displayed on the screen.

Web Scraping Fundamentals in ASP Net Using C#

For the C# development environment, you can use Visual Studio or Visual Studio Code. The choice depends on the development goals and PC capabilities.

Visual Studio is an environment for full-fledged development of desktop, mobile and server applications with pre-built templates and the ability to graphically edit the program being developed.

Whereas Visual Studio Code is some basic shell on which one can install the required packages. It takes up much less space and CPU time.

If one selects Visual Studio Code, he also have to install .NET Framework and .NET Core 3.1. To make sure that all components are installed correctly, at the command line, enter the command:

dotnet --version

If everything works correctly, the command should return the version of .NET installed.

Check Core
Check Core

Instrumental for Web Scraping

In order to use web scraping in C# most effectively, it is worth using additional libraries. The most used are PhantomJS with Selenium, HtmlAgilityPack with ScrapySharp and Puppeteer Sharp.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

Get data with Selenium & PhantomJS Web Scraping C#

In order to use PhantomJS, it must be installed. The easiest way to do this is via the NuGet Package Manager. To include a package in Visual Studio, right-click on the "References" tab in the project and type "PhantomJS" in the search bar.

What is the Selenium Library

However, when anyone speak about PhantomJS, he means PhantomJS with Selenium. In order to use PhantomJS with Selenium, one need to install the following packages additionally:

  1. Selenium.WebDriver.

  2. Selenium Support.

It can be done by using NuGet. To use NuGet in Visual Studio Code just install NuGet package manager with GUI:

NuGet with GUI
NuGet with GUI

Or not:

NuGet without GUI
NuGet without GUI

So, to get all titles on the page just use:

using (var driver = new PhantomJSDriver())
{
    driver.Navigate().GoToUrl("http://example.com/");
    var titles = driver.FindElements(By.ClassName("title"));
    foreach (var title in titles)
    {
        Console.WriteLine(title.Text);
    }
}

This simple code will find all elements with class title and return all text it contains.

Selenium contains a lot of functions to find required element. An element can be searched by XPath, CSS Selector, or HTML tag. For example, to find any element by it's XPath, like an input field, and pass in some value, just use:

driver.FindElement(By.XPath(@".//div")).SendKeys("c#");

In order to click on any element, for example, the confirm button, one can use the following code:

driver.FindElement(By.XPath(@".//input[@id='searchsubmit']")).Click();

What is the Selenium Library for

Selenium is a cross-platform library and works with most programming languages, has complete and well-written documentation, and an active community.

Using Selenium with PhantomJS is a good solution that allows to solve a wide range of scraping tasks, including dynamic page scraping. However, it is rather resource-intensive.

HtmlAgilityPack for Quick Start

If the site does not have protection against bots and all the necessary content is given immediately, then one can use a simple solution - use the Html Agility Pack library. This is one of the most popular libraries for scraping in C#. It also connects via NuGet package.

What is the HtmlAgilityPac

This library builds a DOM tree from HTML. The problem is that one have to load the page code himself. To load the page just use the next code:

using (WebClient client = new WebClient())
{    string html = client.DownloadString("http://example.com");
    //Do something with html then
}

After that, to find some elements on the page, one can use, for example, XPath:

HtmlNodeCollection links = document.DocumentNode.SelectNodes(".//h2/a");
foreach (HtmlNode link in links)
    Console.WriteLine("{0} - {1}", link.InnerText, link.GetAttributeValue("href", "")); 

What is the HtmlAgilityPac for

The HtmlAgilityPac is an easier option to start with and is well suited for beginners. It also has its own website with examples of use. It is simpler than Selenium and does not support some features, but it is well suited for not too complex projects.

Html Agility Pack allows you to embed the browser in windows form, creating a complete desktop application.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote

C# Web Scraping with ScrapySharp

To include a package in Visual Studio, right-click on the "References" tab in the project and type "ScrapySharp" in the search bar.

To add this package in Visual Studio Code write in command line:

dotnet add package ScrapySharp

What is the ScrapySharp Library 

ScrapySharp is an open-source web scraping library for C# programming language which has a NuGet package. Moreover ScrapySharp is an Html Agility Pack extension to scrape data structure using CSS selectors and supporting dynamic web pages.

To get HTML document using ScrapySharp one can use the following code:

static ScrapingBrowser browser = new ScrapingBrowser(); 
static HtmlNode GetHtml(string url){
    WebPage webPage = browser.NavigateToPage(new Uri(url));
    return webPage.Html;
}

What is the ScrapySharp Library for

So, it is not as resource-intensive as Selenium, however, it also supports the ability to scrape dynamic web pages. However, if ScrapySharp is enough to solve everyday tasks, for more complex tasks it will be better to use Selenium.

Puppeteer for Headless Scraping

Puppeteer is a Node.js library which provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol. But it also has a wrapper for using in C# - Puppeteer Sharp, which has a NuGet package.

What is the Puppeteer Library 

Puppeteer provides the ability to work with headless browsers and integrates into most applications.

Puppeteer has well-written documentation and usage examples on the official website. For example, the simplest application is:

using var browserFetcher = new BrowserFetcher();
await browserFetcher.DownloadAsync(BrowserFetcher.DefaultChromiumRevision);
var browser = await Puppeteer.LaunchAsync(new LaunchOptions
{
    Headless = true
});
var page = await browser.NewPageAsync();
await page.GoToAsync("http://www.google.com");
await page.ScreenshotAsync(outputFile);

What is the Puppeteer Library for

Puppeteer also supports headless mode, which allows you to reduce the CPU time and RAM consumed.

C# Proxy for Web Scraping

Some sites have checks and traps to detect bots that will prevent the scraper from collecting a lot of data. However, there are also workarounds.

For example, you can use headless browsers to mimic the actions of a real user, increase the delay between iterations, or use a proxy.

Using proxy example:

public static void proxyConnect()‌;
WebProxy proxy = new WebProxy();‌
proxy.address = “http://IP:Port”;‌
HTTPWebRequest req = (HTTPWebRequest); 
WebRequest.Create(“https://example.com/”);‌
req.proxy = proxy;‌

A proxy server is a must have for any C# web scraper. There are many options, such as SmartProxy services, Luminati Network, Blazing SEO. Free proxies are not always suitable for such purposes: they are often slow and unreliable. One can also create his own proxy network on the server, for example, using Scrapoxy, an open source API.

If one uses one proxy for too long time, he risks getting an IP ban or blacklist. To avoid blocking, one can use rotating residential proxies. By choosing a specific place to scrape and constantly changing the IP address, the scraper can become virtually immune to blocking IP addresses.

An alternative solution is to use our API within your scraper, which allows to collect the requested data and effectively avoid blocking without entering any captcha.

Conclusion and Takeaways

C# is a good option when it comes to creating a desktop scraper. There are fewer libraries for it than for NodeJS or Python, however, it is not worse them in terms of functionality. Moreover, if a highly secure parser is required, C# provides more options for implementation.

The considered libraries allow to create more complex projects that can require data, parse multiple pages and extract data. Of course, not all libraries were listed in the article, but only the most functional and with good documentation. However, in addition to those listed, there are other third-party libraries, not all of which have the NuGet package.

Tired of getting blocked while scraping the web?

Try out Web Scraping API with proxy rotation, CAPTCHA bypass, and Javascript rendering.

  • 1,000 Free API Credits
  • No Credit Card Required
  • 30-Day Trial
Try now for free

Get structured data in the format you need!

We offer customized web scraping solutions that can provide any data you need, on time and with no hassle!

  • Regular, custom data delivery
  • Pay after you receive sample dataset
  • A range of output formats
Get a Quote
Valentina Skakun
Valentina Skakun
Valentina Skakun’s profile

I'm a technical writer who believes that data parsing can help in getting and analyzing data. I'll tell about what parsing is and how to use it.

Request a Quote

Tell us more about you and your project information.