Page tree
Skip to end of metadata
Go to start of metadata

Tracking URL Redirects Using OpenRefine
When reconciling a list of Sirsi records against a publisher's title list using the URL as a matchpoint, one list may use a redirecting URL (such as a DOI) that takes the user to the URL used on the other list. The following process will find the ending point of the redirecting URL so that the two lists can be accurately compared.

In the database, copy the redirecting URLs from the appropriate table into a text file. Then, in OpenRefine, create a project by downloading the text file. On the dropdown menu on the "URL" column, select "Edit column" > "Add column based on this column...". Select "Jython" from the "Language" drop down menu and paste the following Jython code:

import httplib
conn = httplib.HTTPConnection("www.mhebooklibrary.com")
url = value[29:]
conn.request("HEAD", url)
res = conn.getresponse()
return res.getheader('location')

Two changes will be needed for this to work with a new publisher title list:

1. For "conn = httplib.HTTPConnection("www.mhebooklibrary.com")," replace "www.mhebooklibrary.com" with the base URL used by the publisher.

2. For "url =value[29:]", replace the "29" with the number of characters that must be skipped in order to get past the URL host name. For example, in this URL, "http://www.mhebooklibrary.com/product/schaums-outline-intermediate-algebra-second-edition", 29 characters must be skipped to identify the unique string: "/product/schaums-outline-intermediate-algebra-second-edition"

another ex. "link.springer.com" from http://link.springer.com/openurl?genre=book&isbn=978-981-287-645-4
import httplib
conn = httplib.HTTPConnection("dx.doi.org")
url = value25
conn.request("HEAD", url)
res = conn.getresponse()
return res.getheader('location')

where 25 characters must be skipped: "http://www.link.springer.com" to identify the unique string:

"/openurl?genre=book&isbn=978-981-287-645-4"

another ex. "dx.doi.org" from http://dx.doi.org/10.1079/9780851988160.0000
import httplib
conn = httplib.HTTPConnection("dx.doi.org")
url = value17
conn.request("HEAD", url)
res = conn.getresponse()
return res.getheader('location')

where 17 characters must be skipped to identify the unique string:

"/10.1079/9780851988160.0000"

If the code is correct, a preview the values for the new column will load after a few seconds. If it is incorrect, an "Error:null" message will appear in the column. Try taking off the "www." Also reassess the url = value count by removing

  • conn.request("HEAD", url)
  • res = conn.getresponse()
  • return res.getheader('location')

from the text box, and replacing with

  • return url

After the preview loads, Click "OK" to add the column with the new URLs.

Some URLs, such as Springer, may redirect more than once. Below is a template for a URL that redirects twice:

import httplib
conn = httplib.HTTPConnection("link.springer.com")
url = value24
conn.request("HEAD", url)
res = conn.getresponse()
first = res.getheader('location')
conn2 = httplib.HTTPConnection("link.springer.com")
url = first24
conn2.request("HEAD", url)
res2 = conn2.getresponse()
return res2.getheader('location')

The script does not always work for Elsevier titles on the ScienceDirect platform.  This script requires more memory and for large lists of URLs, it may be necessary to break up the list into smaller files.

?^ElsevierRedirectOpenRefineScript.txt|\

After redirecting URLs are found, export the project into an Excel file. Copy the new URLs into the Sirsi table in the Ebook Reconciliation Phase 2 database.

  • No labels