Search the Community
Showing results for tags 'recursive download'.
-
Version 1.0.0
996 downloads
This zip file includes example Wget commands for downloading files from the PDS Geosciences Node. The first example demonstrates downloading a PDS data set from the PDS Geosciences Node archive. The second example demonstrates using Wget to download an Orbital Data Explorer (ODE) cart request.-
- wget
- recursive download
-
(and 2 more)
Tagged with:
-
Below I have included a Python 3.6 sample script for downloading files from the PDS Geosciences Node. The sample includes a configuration for downloading data files from both a PDS Geosciences Node archive and files from an Orbital Data Explorer (ODE) cart request. The script supports multiple levels of sub directories, as well. The script includes variables that should be set by the user for his or her environment. The example PDS data set and ODE cart request both exist, and they are available for test executions of the script. Python 3.6 is required for the script to function. This script is also available for download in the downloads section of the forum. # PDSGeosciencesNode_FileDownload.py # Dan Scholes 2/19/18 # Pypthon 3.6 compatible version # Example of downloading data files using # links from HTTP PDS Geosciences Node Data Archive # or Orbital Data Explorer (ODE) Cart location # Note: One drawback of this script is that it downloads one file at a time, rather than multiple streams. # Additional Note: In the future, changes to the PDS Geosciences Node website and Orbital Data Explorer website may cause this example to no longer function. # Disclaimer: This sample code is provided "as is", without warranty of any kind, express or implied. In no event shall the author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the sample code or the use or other dealings with the sample code. # Phython download website: https://www.python.org/downloads/ import urllib.request import re import time from pathlib import Path # Variables for user to populate---------- saveFilesToThisDirectory = 'c:/temp/data/' # local destination path to save files #next two lines are for downloading from the PDS Geosciences Node archive url = "http://pds-geosciences.wustl.edu/mro/mro-m-rss-5-sdp-v1/mrors_1xxx/" #enter the directory you would like to download relativeLinkPathBase = "http://pds-geosciences.wustl.edu" #this is the default location for the relative paths on the website (just leave this value) #next two lines are for downloading an ODE cart request #url = "http://ode.rsl.wustl.edu/cartdownload/data/sample/" #enter the directory you would like to download #relativeLinkPathBase = "http://ode.rsl.wustl.edu/" #this is the default location for the relative paths on the ode cart website (just leave this value) recursiveVal = True # True/False whether to download files in subdirectories of the specified location in the url variable verboseMessages = False # True/False whether to display verbose messages during the script processing # End of variables for user to populate---------- relativeLinkPathBase = relativeLinkPathBase.rstrip('/') maxDownloadAttempts = 3 filesToDownloadList = [] def get_pageLinks(inUrl,inRecursive): if verboseMessages: print("Cataloging Directory: ",inUrl) #directory to process myURLReader = urllib.request.urlopen(inUrl.rstrip('/')) myResults = myURLReader.read().decode('utf-8').replace("<a href=","<A HREF=").replace("</a>","</A>") myURLReader.close() data=myResults.split("</A>") tag="<A HREF=\"" endtag="\">" for item in data: if "<A HREF" in item: try: ind = item.index(tag) item=item[ind+len(tag):] end=item.index(endtag) except: pass else: #The link is found itemToDownload = item[:end] if "." in itemToDownload: #the link is to a file if relativeLinkPathBase not in itemToDownload: #is the path relative, so we add the base url itemToDownload = relativeLinkPathBase + itemToDownload filesToDownloadList.append(itemToDownload) else: # it's a directory, so let's go into it if recursive is chosen if inRecursive: if itemToDownload not in inUrl: #we make sure it isn't a link to parent directory if relativeLinkPathBase not in itemToDownload: itemToDownload = relativeLinkPathBase + itemToDownload # the directory is a subdirectory, so we will follow it if verboseMessages: print("subdirectory to process ", itemToDownload) get_pageLinks(itemToDownload,inRecursive) def download_files(): # download the files that were identified # this is refering to the global list of files to download localSuccessfulDownloads = 0 print("==Downloads starting ==============") for link in filesToDownloadList: downloadAttempts = 0 fileDownloaded = False if verboseMessages: print("downloading file: ",link) local_link = link; local_link = saveFilesToThisDirectory + local_link.replace(relativeLinkPathBase,"") local_filename = link.split('/')[-1] #make sure the local directory stucture has been created path = Path(local_link.replace(local_filename,"")) path.mkdir(parents=True, exist_ok=True) while not fileDownloaded and downloadAttempts < maxDownloadAttempts: try: urllib.request.urlretrieve(link,local_link) localSuccessfulDownloads += 1 fileDownloaded = True except urllib.error.URLError as e: downloadAttempts += 1 #we will retry the download the number of times allowed by maxDownloadAttempts variable if verboseMessages: print("downloadError: ",e.reason) if verboseMessages: print("downloadErrorFile: ",link," attempt:",downloadAttempts) if downloadAttempts < maxDownloadAttempts: time.sleep(15) #wait 15 seconds before the next attempt else: print("Could not successfully download: ",link," after ",downloadAttempts," download attempts") print("==Downloads complete ==============") print("SuccessfulDownloads: ",localSuccessfulDownloads," out of ",len(filesToDownloadList)) print('==Process is starting ===================') #get the file links get_pageLinks(url, recursiveVal) print("==Collected ", len(filesToDownloadList), " file links ======") #now download the files download_files()
-
- python 3.6
- recursive download
-
(and 2 more)
Tagged with:
-
Below I have included example Wget commands for downloading files from the PDS Geosciences Node. The first example demonstrates downloading a PDS data set from the PDS Geosciences Node archive. The second example demonstrates using Wget to download an Orbital Data Explorer (ODE) cart request. Dan Scholes 2/20/18 Example of downloading data files using links from HTTP PDS Geosciences Node Data Archive or Orbital Data Explorer (ODE) Cart location Note: In the future, changes to the PDS Geosciences Node website and Orbital Data Explorer website may cause this example to no longer function. Disclaimer: This sample code is provided "as is", without warranty of any kind, express or implied. In no event shall the author be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the sample code or the use or other dealings with the sample code. Wget website: https://www.gnu.org/software/wget/ Example of downloading a PDS Geosciences Node archive subdirectory wget -rkpN -P c:\temp\data -nH --cut-dirs 2 --level=15 --no-parent --reject "index.html*" -e robots=off http://pds-geosciences.wustl.edu/mro/mro-m-crism-4-typespec-v1/mrocr_8001/ Example of downloading ODE Cart Request wget -rkpN -P c:\temp\data -nH --cut-dirs 2 --level=15 --no-parent --reject "index.html*" -e robots=off http://ode.rsl.wustl.edu/cartdownload/data/sample -r means recursively download files -k means convert links. Links on the webpage will be localhost instead of example.com/path. -p means get all webpage resources, so wget will obtain images and javascript files to make website work properly. -N is to retrieve timestamps, so if local files are newer than files on remote website, the remote files will be skipped. -P sets the local destination directory for the downloaded files. -e is a flag option that must be set for the robots=off to work. robots=off means ignore robots file. -c allows the command to pick up where it left off if the connection is dropped and the command I re-run. --no-parent keeps the command from downloading all the files in the directories above the requested level. --reject "index.html*" keeps wget from downloading every directory's default index.html. -nH will disable the generation of the host-prefixed directories. In the example above, a directory ode.rsl.wustl.edu will not be created locally. --cut-dirs 2 Ignore the count of directory components. Basically, this example will omit the first 2 directory levels from the path it creates locally for the files that are downloaded. Example: http://ode.rsl.wustl.edu/cartdownload/data/sample The first directory in the destination directory will be "sample". --level=depth --level=15 Levels to recursively search. The default is 5, but we will need to go farther with ODE cart and PDS Geosciences Archive. --------------------------------------------------------------------------------------------------------------------------------------------------------- -nd or --no-directories it is used to put all the requested files in one directory. We are not using this feature, but a user may prefer this option.
-
- wget
- recursive download
-
(and 2 more)
Tagged with: