Author: Umesh Sunnapu
Links: Enterprise Solutions, Business Intelligence and Appliance Solutions
Welcome to the second episode of the Dell Business Intelligence Project Using USPTO Data. This episode answers the following questions:
The USPTO project uses Dell Boomi, Dell Quickstart Data Warehouse Appliance and Dell Toad products to analyze publicly available data. For more information about the goal and scope of the project, as well as a breakdown of the episodes, follow this link episode 1.
Download the USPTO Patent Grant Full Text data by navigating to: http://www.google.com/googlebooks/uspto-patents-grants-text.html and targeting the desired files. Each week the patents are compiled into a single XML file and loaded to the Google website. This process produces a total of 52 files for a given year. The Google website consists of patent grants from 1976 to 2012. However, it is important to note that the USPTO data is parsed differently from (1976 to 2001) and (2001 to present).
From the Google USPTO site:
Patent Grant Full Text (2001 to present):
Contains the full text including tables, sequence data and "in-line" mathematical expressions of each patent grant issued weekly (Tuesdays) from January. The file is a concatenation of the Standard Generalized Markup Language (SGML) in accordance with the U.S. Patent Grant Version 2.4 Document Type Definition (DTD) and eXtensible Markup Language (XML) in accordance with the U.S. Patent Grant Version 2.5; 4.0 International Common Element (ICE); 4.1 ICE; 4.2 ICE Document Type Definitions (DTDs). Sequence data XML text in accordance with the ICE SEQLST V1.2 DTD (us-sequence-listing-2004-03-09.dtd) is concatenated next to the containing grant SGML or XML text. References to the following external files are present but the external files are not present:
Patent Grant Full Text (1976 to 2001):
Contains the full text of each patent grant issued weekly (Tuesdays) from January 1976 to December 2001. The file format is ASCII text (a.k.a. Patent Grant Green Book). Included are tables and "in-line" mathematical equations, where appropriate, appearing as text data. Chemical structures are not present, but their location is indicated by a structure call-out. Includes patent number, series code and application number, type of patent, filing date, title, issue date, inventor information, assignee name at time of issue, foreign priority information, related US patent documents, classification information, US and foreign references, attorney, agent or firm/legal representative, Patent Cooperation Treaty (PCT) information, abstract, specification, and claims. Approximately 4,000 patent grants per week. Refer to the following link for additional Patent Grant Data/APS documentation:
A document type definition (DTD) is a set of markup declarations that define a document type. In the Google USPTO, the document type is XML. The DTD file contains syntax to capture the proper elements and references that are found in the Google USPTO XML file.
The DTD files for the data between (2001 to present) are found within the USPTO sample data site.
Figure 1. USPTO website for XML resourses
If the system cannot locate an imported XML file, it is commonly due to misplacing the XML files. This error occurs when attempting to open any XML file from a folder that did not include the DTD files and all its subdirectories. From our example, we included the XML file, US07861317-20110104.XML, to the location:
<zip_download_loc>\I20110104 Sample\DTDS\DTDS\PTO-ICE-GRANT-2007\DTDS location.
Figure 2. When XML files are misplaced, the system cannot locate the object specified.
Another common error found with the data on the Google Bulk Downloads page is that Microsoft Excel can only read one XML document at a time. For example, download and extract the file named ipq120103.zip. Within this zip file, the XML file contains multiple embedded XML files. We know that this is due to the XML header that is placed prior to each patent. The XML header reads as follows:
<?xml version="1.0" encoding="UTF-8"?>
If you try to open the XML document directory from the zip file, you see the following error, “File cannot be opened because: Invalid xml declaration.”
Figure 3. File error when opening the XML document directory from inside a zipped directory.
To properly open a file in Microsoft Excel, parse each XML file to include only one <?xml version="1.0" encoding="UTF-8"?> declaration.
<?xml version="1.0" encoding="UTF-8"?>...Patent Data...</us-patent-grant>
3.Save new test.xml file document.