Scrape and download Google Images with Python

Google has come a long way from being just a search engine. Over the years, the company has developed an impressive array of tools, and while some are highly specialized, there are a few tools worth knowing about, no matter what you use the web for. Google Images, aka Google Image Search, is just one of these tools.

Google Images is a web-based product from Google for searching images online. While it performs the same basic querying and results-getting functions as Google’s flagship search engine, it’s better understood as a specialized branch.

Google Search produces web pages with text-based content by directly crawling text-based content, while Google Images returns image media based on keywords entered, so its process looks a little different under the hood. The main factor in determining which images fill your results page is how closely the search terms match the image filenames. This on its own is often not enough, so Google Images also relies on text-based contextual information on the same page as an image.

Google images have become the source of much new content today. In particular, image datasets in machine learning algorithms and image processing processes are mostly fed from Google images. In this article, it will be discussed how to do web scraping with the target word on Google images with Python and how to download the scraped image to the local computer, especially for its use in machine learning algorithms and image processing processes.

Scraping Google images with Python programming language

First we create a Python file, we can call it “web-scraping-main.py”. Then we run the command below and download the necessary libraries.

 pip install selenium, requests, pillow

 

After downloading the necessary libraries, we paste the following codes into the “web-scraping-main.py” file.

 
 import os
from ImageScraperimport ImageScraper
from patch import webdriver_executable

if __name__ == "__main__":
webdriver_path = os.path.normpath(os.path.join(os.getcwd(), 'webdriver', webdriver_executable()))
    image = os.path.normpath(os.path.join(os.getcwd(), 'photos'))

    keys= [cat,'t-shirt']

    image_counts = 2
    headless = False
    min_re=(0,0)
    max_re=(9999,9999)

    for search_key in search_keys:
    image_scrapper = ImageScraper(webdriver_path,image,keys,image_counts,headless,min_re,max_re)
        image_urls = image_scrapper.find_image_urls()
        image_scrapper.save_images(image_urls)
    
       
    del image_scrapper

 

Now let’s create the “google-scraper.py” file, which is the business layer of the application that we carry out the actual scraper operations, and paste the following codes.

 

 
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException       

import time
import urllib.request
import os
import requests
import io
from PIL import Image

import patch 

class ImageScraper():
    def __init__(self,webdriver_path,image_path, search_key="cat",
    number_of_images=1,headless=False,min_re=(0,0),max_re=(1920,1080)):
        image_path = os.path.join(image_path, search_key)
        while(True):
            try:
                options = Options()
                if(headless):
                    options.add_argument('--headless')
                driver = webdriver.Chrome(webdriver_path, chrome_options=options)
                driver.set_window_size(1400,1050)
                driver.get("https://www.google.com")
                break
            except:
                try:
                    driver
                except NameError:
                    is_patched = patch.download_lastest_chromedriver()
                else:
                    is_patched = patch.download_lastest_chromedriver
                (driver.capabilities['version'])
                                    
        self.driver = driver
        self.search_key = search_key
        self.number_of_images = number_of_images
        self.webdriver_path = webdriver_path
        self.image_path = image_path
        self.url = "https://www.google.com/search?q=%s&source=lnms&tbm=isch&sa=X&ved=
        2ahUKEwie44_AnqLpAhUhBWMBHUFGD90Q_AUoAXoECBUQAw&biw=1920&bih=947"%(search_key)
        self.headless=headless
        self.min_re = min_re
        self.max_re = max_re
        
    def find_image_urls(self):
        image_urls=[]
        count = 0
        missed_count = 0
        self.driver.get(self.url)
        time.sleep(3)
        indx = 1
        while self.number_of_images > count:
            try:
                imgurl = self.driver.find_element_by_xpath
                ('//*[@id="islrg"]/div[1]/div[%s]/a[1]/div[1]/img'%(str(indx)))
                imgurl.click()
                missed_count = 0 
            except Exception:
                missed_count = missed_count + 1
                if (missed_count>10):
                    print("[INFO] No more photos.")
                    break
                 
            try:
                time.sleep(1)
                class_names = ["n3VNCb"]
                images = [self.driver.find_elements_by_class_name(class_name) 
                for class_name in class_names if len
                (self.driver.find_elements_by_class_name(class_name)) != 0 ][0]
                for image in images:
                    src_link = image.get_attribute("src")
                    if(("http" in  src_link) and (not "encrypted" in src_link)):
                        print("[INFO] %d. %s"%(count,src_link))
                        image_urls.append(src_link)
                        count +=1
                        break
            except Exception: 
                
            try:
                if(count%3==0):
                    self.driver.execute_script("window.scrollTo(0, "+str(indx*60)+");")
                element = self.driver.find_element_by_class_name("mye4qd")
                element.click()
                time.sleep(3)
            except Exception:  
                time.sleep(1)
            indx += 1

        
        self.driver.quit()
        return image_urls

    def save_images(self,image_urls):
        for indx,image_url in enumerate(image_urls):
            try:
                search_string = ''.join(e for e in self.search_key if e.isalnum())
                image = requests.get(image_url,timeout=5)
                if image.status_code == 200:
                    with Image.open(io.BytesIO(image.content)) as image_from_web:
                        try:
                            filename = "%s%s.%s"%(search_string,str(indx),
                            image_from_web.format.lower())
                            image_path = os.path.join(self.image_path, filename)

                            image_from_web.save(image_path)
                        except OSError:
                            rgb_im = image_from_web.convert('RGB')
                            rgb_im.save(image_path)
                        image_resolution = image_from_web.size
                        if image_resolution != None:
                            if image_resolution[0]<self.min_re[0] or 
                        image_resolution[1]<self.min_re[1] or image_resolution[0]>
                        self.max_resolution[0] or image_resolution[1]>self.max_resolution[1]:
                                image_from_web.close()
                                os.remove(image_path)

                        image_from_web.close()
            except Exception as e:              
                pass

 

By running the command below, you will see that the photos from Google images are downloaded to the local file path according to the settings we have specified in the “web-scraper-main.py” file.

Conclusion

Web scraping is the best way to automate data sets nowadays. As seen in the example application, we obtained image data from Google images in a very short time by performing web scraping with Python programming language. You can scrape Google images this way in line with your needs.

Resources

https://github.com/ohyicong/Google-Image-Scraper