Coffee Space


Listen:

Scraper

Preview Image

Preview Image

What?

I decided to write a quick dataset creation tool, the idea being to train a neural network on a wide range of data. Essentially you put in a number of images for it to try and source, several queries of interest and then let it do its thing. I did originally want to use DuckDuckGo, but the search engine requires javascript to view images, so nope. I used Yahoo images many years ago for a similar project and yet again it proves valuable for scraping.

The code itself is very simple and inefficient, it's literally just a dirty hack in order to scrape for images. One caveat is that urllib doesn't seem to want to play with HTTPS, so a number of web servers simply block HTTP requests. Additionally, it seems a number of images arrive corrupted somehow and my image reader refuses to display them - it's probably urllib doing something strange, because they all work fine in several browsers before downloading.

In the future it should:

It's basic, it works, don't sue me.

Code

The following is the source, use at your own peril:

# scrape.py by B[]
#
# Created Sep 2019. Last updated Sep 2019.
#
# Search Yahoo images for a handful of images using keywords, then download
# them and apply appropriate names.
#
# Warning: Use at your own risk. If downloading lots of data, consider not
# spamming web servers, DNS servers, Yahoo search, etc.

import sys
import urllib

# Check that we have arguments to loop
if len(sys.argv) <= 1 :
  print("scrape.py <NUM> <SEARCH> [SEARCH] ..")
  print("")
  print("  NUM")
  print("    Number of images to find for each search")
  print("")
  print("  SEARCH")
  print("    Search query to download images for")
  sys.exit(0)
# Loop over command line arguments
num = -1
for i in range(len(sys.argv)) :
  if sys.argv[i] != "scrape.py" :
    # First valid argument is
    if num < 0 :
      num = int(sys.argv[i])
    else :
      sys.argv[i].replace(" ", "+")
      u = "https://images.search.yahoo.com/search/images?p=" + sys.argv[i]
      print("url -> " + u)
      f = urllib.urlopen(u)
      s = f.read().split("\"")
      count = 0
      skip = False
      # Loop over possible images
      for x in range(len(s)) :
        # Check for image type
        img = ""
        if ".jpg" in s[x] :
          img = ".jpg"
        if ".jpeg" in s[x] :
          img = ".jpeg"
        if ".png" in s[x] :
          img = ".png"
        # Make sure we have a valid image
        if img != "" and count < num and "\\/" in s[x] :
          # Avoid double images
          if not skip :
            s[x] = s[x].replace("\\", "")
            print("Downloading -> " + s[x])
            t = urllib.URLopener()
            # Attempt to download image and display result
            try :
              t.retrieve(s[x], sys.argv[i] + "_" + str(count) + img)
              count += 1
              skip = True
              print("  Success")
            except :
              print("  Failed")
          else :
            skip = False