Google has come a long way from being just a search engine. Over the years, the company has developed an impressive array of tools, and while some are highly specialized, there are a few tools worth knowing about, no matter what you use the web for. Google Images, aka Google Image Search, is just one of these tools.
Google Images is a web-based product from Google for searching images online. While it performs the same basic querying and results-getting functions as Google images API. Google’s flagship search engine, it’s better understood as a specialized branch.
Google Search produces web pages with text-based content by directly crawling text-based content, while Google Images returns image media based on keywords entered, so its process looks a little different under the hood. The main factor in determining which images fill your results page is how closely the search terms match the image filenames. This on its own is often not enough, so Google Images also relies on text-based contextual information on the same page as an image.
Google images have become the source of much new content today. In particular, image datasets in machine learning algorithms and image processing processes are mostly fed from Google images. In this article, it will be discussed how to do web scraping with the target word on Google images with PythonVGoogle images and how to download the scraped image to the local computer, especially for its use in machine learning algorithms and image processing processes.
Table of Contents
Scraping Google images with Python programming language
First, we create a Python file, which we can call “web-scraping-main.py”. Then we run the command below and download the necessary libraries.
pip install selenium, requests, pillow
After downloading the necessary libraries, we paste the following codes into the “web-scraping-main.py” file.
import os from ImageScraperimport ImageScraper from patch import webdriver_executable if __name__ == "__main__": webdriver_path = os.path.normpath(os.path.join(os.getcwd(), 'webdriver', webdriver_executable())) image = os.path.normpath(os.path.join(os.getcwd(), 'photos')) keys= [cat,'t-shirt'] image_counts = 2 headless = False min_re=(0,0) max_re=(9999,9999) for search_key in search_keys: image_scrapper = ImageScraper(webdriver_path,image,keys,image_counts,headless,min_re,max_re) image_urls = image_scrapper.find_image_urls() image_scrapper.save_images(image_urls) del image_scrapper
Now let’s create the “google-scraper.py” file, which is the business layer of the application that we carry out the actual scraper operations, and paste the following codes.
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException import time import urllib.request import os import requests import io from PIL import Image import patch class ImageScraper(): def __init__(self,webdriver_path,image_path, search_key="cat", number_of_images=1,headless=False,min_re=(0,0),max_re=(1920,1080)): image_path = os.path.join(image_path, search_key) while(True): try: options = Options() if(headless): options.add_argument('--headless') driver = webdriver.Chrome(webdriver_path, chrome_options=options) driver.set_window_size(1400,1050) driver.get("https://www.google.com") break except: try: driver except NameError: is_patched = patch.download_lastest_chromedriver() else: is_patched = patch.download_lastest_chromedriver (driver.capabilities['version']) self.driver = driver self.search_key = search_key self.number_of_images = number_of_images self.webdriver_path = webdriver_path self.image_path = image_path self.url = "https://www.google.com/search?q=%s&source=lnms&tbm=isch&sa=X&ved= 2ahUKEwie44_AnqLpAhUhBWMBHUFGD90Q_AUoAXoECBUQAw&biw=1920&bih=947"%(search_key) self.headless=headless self.min_re = min_re self.max_re = max_re def find_image_urls(self): image_urls=[] count = 0 missed_count = 0 self.driver.get(self.url) time.sleep(3) indx = 1 while self.number_of_images > count: try: imgurl = self.driver.find_element_by_xpath ('//*[@id="islrg"]/div[1]/div[%s]/a[1]/div[1]/img'%(str(indx))) imgurl.click() missed_count = 0 except Exception: missed_count = missed_count + 1 if (missed_count>10): print("[INFO] No more photos.") break try: time.sleep(1) class_names = ["n3VNCb"] images = [self.driver.find_elements_by_class_name(class_name) for class_name in class_names if len (self.driver.find_elements_by_class_name(class_name)) != 0 ][0] for image in images: src_link = image.get_attribute("src") if(("http" in src_link) and (not "encrypted" in src_link)): print("[INFO] %d. %s"%(count,src_link)) image_urls.append(src_link) count +=1 break except Exception: try: if(count%3==0): self.driver.execute_script("window.scrollTo(0, "+str(indx*60)+");") element = self.driver.find_element_by_class_name("mye4qd") element.click() time.sleep(3) except Exception: time.sleep(1) indx += 1 self.driver.quit() return image_urls def save_images(self,image_urls): for indx,image_url in enumerate(image_urls): try: search_string = ''.join(e for e in self.search_key if e.isalnum()) image = requests.get(image_url,timeout=5) if image.status_code == 200: with Image.open(io.BytesIO(image.content)) as image_from_web: try: filename = "%s%s.%s"%(search_string,str(indx), image_from_web.format.lower()) image_path = os.path.join(self.image_path, filename) image_from_web.save(image_path) except OSError: rgb_im = image_from_web.convert('RGB') rgb_im.save(image_path) image_resolution = image_from_web.size if image_resolution != None: if image_resolution[0]<self.min_re[0] or image_resolution[1]<self.min_re[1] or image_resolution[0]> self.max_resolution[0] or image_resolution[1]>self.max_resolution[1]: image_from_web.close() os.remove(image_path) image_from_web.close() except Exception as e: pass
By running the command below, you will see that the photos from Google images are downloaded to the local file path according to the settings we have specified in the “web-scraper-main.py” file.
Conclusion
Web scraping is the best way to automate data sets nowadays. As seen in the example application, we obtained image data from Google images in a very short time by performing web scraping with Python programming language. You can scrape Google images this way in line with your needs.