Web Scraping and “Elastic Search” it
Web scraping is probably the FIRST project of a lot of developers or data scientist. You can practice the language just learned and it will generate “data” for later projects.
There has been tons of tutorials out there teaching how to scrape via Python, and some other languages. However, when you are scraping a lot of pages, it might give you performance issue. In this article, we will
- Web Scrape IMDB top 250 movies
- It is static web page scraping. For dynamic web pages, it requires Selenium, not covered in this article
- For the information we scrape, we will store in Elastic Search. (This is also a typical use case when you find some useful information without search function provided, you can create your own)
- A simple front end web page using Vue.js will visualize the search demo
Scraping task is pretty simple, first step, go to https://www.imdb.com/chart/top/, get the url list of top 250 movies.
Second, for any movie url, get the relevant information you want to scrape. In this demo, we will just get title, summary, director, country, actors, genre, date, src(poster image) and url.
For detailed code implementation, please visit https://github.com/datalearningpr/WebScraping-ElasticSearch
In terms of these 3 different languages, here is the comparison:
To be fair, a lot of development regarding concurrency has been done in Python as well, for instance, Python has asynchronous HTTP client library called aiohttp. Use it instead of requests will definitely boost up the performance. However, asynchronous libraries support is not yet great in Python.
2. Store the scrape result in Elastic Search
In the scripts, we have already stored the result in a special JSON format that suits Elastic Search bulk insert. You can start up a Elastic Search instance very easily using Docker.
// get the Elastic Search image
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.1// start Elastic Search container
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.1// check container is running ok
curl -X POST "localhost:9200/movie/_bulk" -H 'Content-Type: application/json' --data-binary @movies.json
3. Visualize our search function
Now that we have scraped some data and stored in some back end. We can make a front end web page to visualize our “work”.
We will use Vue.js to create a very simple single page web to demo the search function. For detailed code implementation, please visit https://github.com/datalearningpr/WebScraping-ElasticSearch
Let us conduct some search, say “brad pitt”
Nice, it works!!!
- Indeed, Python web scraping is easy to use, but performance issue shall be considered
- Elastic Search is great No-SQL database for search solution. If you find data you are interested in some website without search function, web scraping + Elastic Search can solve the problem
- Visualization with a front end is fun.
- Please find all codes and scripts from: https://github.com/datalearningpr/WebScraping-ElasticSearch
Hope you find this article useful, thank you all.