Author: Robert Pound
Links: Enterprise Solutions, Business Intelligence and Appliance Solutions
This episode uses the process created in Episode 8 to extract data from the US patent office website and store it in a SQL database. We will modify the process in Episode 8 to collect one year’s worth of US Patent Office data, and then execute the SQL stored procedure discussed in Episode 9 to remove duplicate data.
The USPTO project uses Dell Boomi, Dell Quickstart, Data Warehouse Appliance, and Toad products to analyze publicly-available data. For more information about the goal and scope of the project, as well as a breakdown of the episodes, go to Episode 1.
This final Dell Boomi episode combines all processes into a single process that will extract and translate one year’s worth of US Patent Office data into meaningful SQL data that is ready to be analyzed. This episode contains the following sections:
There are two solutions for collecting a year’s worth of data in Boomi.
The first solution involves manually creating a list of files to download so that the previously-created process can execute on each file. Because of the simplicity and short time to implement, we chose this option for the project.
The second solution involves using Groovy programming to create a custom script in the data process shape to automatically parse the US Patent Office main page for all current files for a given year and then feed that into the HTTP connector. This would involve a long development cycle; therefore, this will be a future topic.
The current solution involves creating a flat file that contains all files for a given year and saving it to the Quickstart Data Warehouse Appliance server. We are creating a disk connect similar to those created in previous episodes.
1. Highlight the desired file names at the website http://www.google.com/googlebooks/uspto-patents-grants-text.html. The example in Figure 1 shows all of 2013 files.
Figure 1. Example files
2. Copy the highlighted file names.
3. Open Notepad.
4. Paste the contents into Notepad.
5. Format the Notepad contents to one file per line as shown in Figure 2.
Figure 2. Sample Notepad contents
6. Select File -> Save As. For example, save as c:\uspto\2013_list.txt.
Create a new Connector Operation:
1. In the Create Component window, select Connector Operation.
2. Name the component 2013_list.
3. Select the folder in which the previous project was created.
4. Select type disk.
5. Click Create.
Modify the Connector Operation:
1. Select Get.
2. File Filter: 2013_list.txt
3. File Matching Type: *
4. Click Save, and then click Close.
To create the data process shape, the process from the previous episode must first be opened.
1. Drag and drop a data process shape to the process.
2. Label: Split year.
3. Add Step.
a. Process Type: Split Document
b. Profile Type: Flat file
c. Split Option: Split by line
4. Click OK.
A more complete solution to pulling one year or multiple years of data is by programmatically parsing the bulk data page for the US Patent Office and collecting all file names for a given year. This would involve using Groovy, which is the language used in the custom scripting option of data process shapes. You would use Groovy to pull in data from the HTML page and then output the list of files in a line-by-line format. This would eliminate the dependence on the local disk connector as well as make the process more flexible and reduce the amount of manual intervention.
Since there was only one addition, Split Year, from the steps above, the process should look like Figure 3 below.
Figure 3. Modifying the previous project
To finalize the process, we will first add the multiple file content created at the start of this episode:
1. Configure start shape.
a. Select No Data.
b. Click OK.
2. Drag and drop the connector shape.
a. Label: get 2013 list
b. Connector: Disk
c. Action: Get
d. Connection: local USPTO
e. Operation: 2013_list
f. Click OK.
3. Drag and drop connector shape
a. Label: http multi
b. Connector: HTTP Client
d. Connection: New
i. Label: 2013 bulk
ii. URL: http://storage.googleapis.com/patents/grant_full_text/2013/
iii. Click Save, and then click Close.
e. Operation: New
i. Label: 2013 in
ii. Under Resource Path, create new variable:
2. Is replacement variable?: checked
iii. Click Save and then Close.
f. Parameters tab
i. Select new (+)
1. Input: USPTO_file
2. Type: Profile Element
3. Profile Type: Flat File
4. Profile: Create New (+)
a. Label: List Profile
b. New Element
i. Data Name: Name
c. Click Save, and then click Close.
5. Element: Name
ii. Click OK.
g. Click OK.
4. Connect new shapes to previous content in the following order:
a. start shape -> get 2013 list -> Split year -> http multi
Figure 4. Connecting the new shapes
5. Click Save.
Using the stored procedure, the SQL connector, and Program Command shape discussed in the previous episode, we will add the final section to the process to create a complete Dell Boomi process for collecting and storing data into SQL.
1. Create shape:
a. Drag and drop Program Command
i. Label: clean and insert
ii. Type: Stored Procedure
iii. Connection: SQL connection
iv. Procedure Name: dbo.del_dup
2. Connect shape just before stop.
3. The final process should look like Figure 5.
Figure 5. Final process