Scrape Naver News Results with Python

Naver (Korean: 네이버) is a search engine and internet portal based in South Korea. The site was founded in June 1999 by former Samsung employees and is South Korea’s first internet portal with its own search engine. Today, it operates under the Naver Corporation.

According to internet research service comScore, with 2 billion search queries in August 2007, 70 percent of all search queries in Korea were made with Naver. Naver is South Korea’s most popular and the world’s 6th largest search engine (Google, Baidu, Bing, Yahoo, Yandex). It is also the homepage of 25 million internet users in Korea.

Naver, which is very popular, also offers news service. Many apps today use Naver news data as their dataset. In this article, we will examine how to scrape Naver news with the Python programming language. Let’s get started.

Scraping Naver news with Python programming language

To scrape the news shared on Naver news with Python, we must first create a Python file. Then we need to install the necessary libraries by running the following commands from the command line.

 pip install requests 
 pip install lxml 
 pip install beautifulsoup4 

 

  • requests: The Requests module, which is one of the modules that enables Python to interact with data on the internet, provides communication between your project and your source data on the web with HTTP requests.
  • lxml: It is a high performance, fast, HTML and XML parsing Python library. It works pretty well when you aim to scrape large datasets. The combination of request and lxml is very common in web scraping.
  • beautifulsoup4: BeautifulSoup is a Python library for extracting data from HTML and XML files.

 

Then we paste the following code block into the Python file we created.

import requests, lxml, json
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
    Chrome/100.0.4896.127 Safari/537.36"
}

params = {
    "query": "minecraft",
    "where": "news",
}

def extract_naver_news_from_url():
    html = requests.get("https://search.naver.com/search.naver", params=params, headers=headers)
    .text soup = BeautifulSoup(html, "lxml")

    naver_news_arr= []

    for news_result in soup.select(".list_news .bx"):
        naver_news_title = news_result.select_one(".news_tit").text
        naver_news_link = news_result.select_one(".news_tit")["href"]
        naver_news_thumbnail = news_result.select_one(".dsc_thumb img")["src"]
        naver_news_snippet = news_result.select_one(".news_dsc").text
        naver_news_press_name = news_result.select_one(".info.press").text
        naver_news_date = news_result.select_one("span.info").text

        naver_news_arr.append({
            "title": naver_news_title,
            "link": naver_news_link,
            "thumbnail": naver_news_thumbnail,
            "snippet": naver_news_snippet,
            "press_name": naver_news_press_name,
            "news_date": naver_news_date
        })

    print(json.dumps(news_data, indent=2, ensure_ascii=False))

 
When this code is run, we get the following response.

 [
  {
    "title": "경산시,'압독국 미래를 만나 영원불멸을 꿈꾸다' 운영",
    "link": "http://www.breaknews.com/890115",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic
    .net%2Fimage%2Forigin%2F5297%2F2022%2F04%2F25%2F570674.
    jpg&type=ff264_180&expire=2&refresh=true",
    "snippet": "  'Imagining Abdok-guk Minecraft', 'Apnyang Cultural Exploration Team Meets
    Abdok and Imdang Relics'... 'Imagining Abdok Country in Minecraft' is a metaverse
    environment program where students can directly feel... ",
    "press_name": "브레이크뉴스",
    "news_date": "6일 전"
  },
  {
    "title": "인천크래프트 애니메이션 영상 공개",
    "link": "http://www.joongdo.co.kr/web/view.php?key=20220502010000305",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic
    .net%2Fimage%2Forigin%2F5340%2F2022%2F05%2F02%2F776813
    .jpg&type=ofullfill264_180_gray&expire=2&refresh=true",
    "snippet": "  인천크래프트는 세계적인 게임 '마인크래프트'(Minecraft)
    를 활용해 인천을 가상세계로 구현한 메타버스(가상공간)다. 이번에 공개되는'인천크래프트:
    시간여행자의 도시'에선 마인크래프트로 구현한 인천의 역사와... ",
    "press_name": "중도일보",
    "news_date": "4시간 전"
  },
  {
    "title": "라코스테(LACOSTE) X 마인크래프트(Minecraft) 콜라보레이션 공개",
    "link": "http://www.thefirstmedia.net/news/articleView.html?idxno=90272",
    "thumbnail": "https://search.pstatic.net/common/?src=https%3A%2F%2Fimgnews.pstatic
    .net%2Fimage%2Forigin%2F5560%2F2022%2F03%2F16%2F47413.
    jpg&type=ofullfill264_180_gray&expire=2&refresh=true",
    "snippet": "  마인크래프트(MINECRAFT)와의 콜라보레이션이 공개됐다.
    업체 측은 \"패션과 게임 업계를 대표하는 두... 이번 협업을 통해 탄생한 콜라보레이션 컬렉션뿐만
    아니라 가상의 세계인 마인크래프트(MINECRAFT)에서... ",
    "press_name": "더퍼스트",
    "news_date": "2022.03.16."
  },
  [...]
]

 
Conclusion

Web scraping processes are very simple with Python’s useful libraries. Thanks to Python’s simplicity and fast code development feature, we created a smooth JSON output by scraping the news shared on Naver news in just milliseconds and then parsing it. You can obtain news data with web scraping processes easily.