How to Scrape Metadata from Website using Python
How to Scrape Metadata from Website using Python
Send download link to:
In simple terms, metadata is data that describes other data.
Your website metadata consists of a page title and meta description for every page. These provide search engines like Google with important information about the content and purpose of each individual page on your website, and help them determine whether your website is relevant enough to display in search results.
The page title is used to tell search engines exactly what that page on your website is about. This is the clickable headline that is displayed in search results, and needs to concisely and accurately summarise the content found on the page.
Page titles are important as search engines use them to establish what information the website contains, so they can directly influence your website’s ranking in search results.
As well as being shown in search results, your page title will appear in the browser tab and also on external websites such as social networks when a link to your website is shared.
The meta description is displayed below the page title in search results, and is there to provide more descriptive information about the content on the website page.
Meta descriptions are not a Google ranking factor, so will not directly affect your website’s position in search results. However, they are still a key part of SEO as when written effectively, they can encourage more people to click through to your website from search results.
As well as displaying in search results, your meta description will be shown alongside your page title when your website page is shared on an external website such as Facebook or Twitter.
To scrape meta data we will use a new package called metadata_parser. This package is specifically designed to scrape meta data and can scrape all of the meta data from any website. You can read more here https://pypi.org/project/metadata-parser/0.4.13/.
We will go to IMDB and TripAdvisor and scrape all the meta data from both sites:
https://www.imdb.com/list/ls053501318/
https://www.tripadvisor.in/Hotels-g187147-Paris_Ile_de_France-Hotels.html
See complete code below:
#Install metadata_parser
pip install metadata_parser
#Get meta data:
url = 'https://www.imdb.com/list/ls053501318/'
import metadata_parser
page = metadata_parser.MetadataParser(url)
print(page.metadata)
Output:
Get Meta Data for Tripadvisor:
page = metadata_parser.MetadataParser('https://www.tripadvisor.in/Hotels-g187147-Paris_Ile_de_France-Hotels.html')
print(page.metadata)
Output: