Web Scraping and “Elastic Search” it

Web scraping is probably the FIRST project of a lot of developers or data scientist. You can practice the language just learned and it will generate “data” for later projects.

There has been tons of tutorials out there teaching how to scrape via Python, and some other languages. However, when you are scraping a lot of pages, it might give you performance issue. In this article, we will

  • Web Scrape IMDB top 250 movies
  • We will use 3 different languages: Python, JavaScript, Go
  • It is static web page scraping. For dynamic web pages, it requires Selenium, not covered in this article
  • For the information we scrape, we will store in Elastic Search. (This is also a typical use case when you find some useful information without search function provided, you can create your own)
  • A simple front end web page using Vue.js will visualize the search demo

1. Web Scrape using Python, JavaScript and Go

Scraping task is pretty simple, first step, go to https://www.imdb.com/chart/top/, get the url list of top 250 movies.

IMDB top 250 movies page, screen cap from IMDB.com

Second, for any movie url, get the relevant information you want to scrape. In this demo, we will just get title, summary, director, country, actors, genre, date, src(poster image) and url.

The Shawshank Redemption page screen cap from IMDB.com

For detailed code implementation, please visit https://github.com/datalearningpr/WebScraping-ElasticSearch

In terms of these 3 different languages, here is the comparison:

As we can see, the concurrency advantage of Go gives the best result. Although Python is the first go-to language when it comes to “Web Scraping”, but considering performance problem, you will give Go or even JavaScript a try.

To be fair, a lot of development regarding concurrency has been done in Python as well, for instance, Python has asynchronous HTTP client library called aiohttp. Use it instead of requests will definitely boost up the performance. However, asynchronous libraries support is not yet great in Python.

2. Store the scrape result in Elastic Search

In the scripts, we have already stored the result in a special JSON format that suits Elastic Search bulk insert. You can start up a Elastic Search instance very easily using Docker.

// get the Elastic Search image
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.1
// start Elastic Search container
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.1
// check container is running ok
curl -X GET "localhost:9200/_cat/nodes?v&pretty"
// bulk index, movies.json can be generated via any scripts of (Go, JavaScript, Python)
curl -X POST "localhost:9200/movie/_bulk" -H 'Content-Type: application/json' --data-binary @movies.json

3. Visualize our search function

Now that we have scraped some data and stored in some back end. We can make a front end web page to visualize our “work”.

We will use Vue.js to create a very simple single page web to demo the search function. For detailed code implementation, please visit https://github.com/datalearningpr/WebScraping-ElasticSearch

Let us conduct some search, say “brad pitt”

Nice, it works!!!

Hope you find this article useful, thank you all.

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Use Spring Framework, React, and PostgreSQL to create a simple To-do Application #2

Android “Content Provider” Frequently Asked Interview Questions

A Jump Start on Terraform Testing

Nupokati, or contract-based CI/CD of mobile apps

UNDERSTANDING US — PROGRAMMERS

Unity Dev: How to Host Your Unity Game on the Web

What is a motherboard?

Why moving your Business to Cloud can be a Game-changer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
datalearningpr

datalearningpr

More from Medium

Speed up API development by generating your APIs

pip Install Third-party Library Error

pip Install Third-party Library Error

Getting Started — FastAPI + ArangoDB

Top 5 Python Frameworks And Libraries | Hyperlink InfoSystem

Top 5 Python Frameworks And Libraries | Hyperlink InfoSystem