Part 4 – Power Query API – Accumulating Data Indeed!

This is the fourth part of the Indeed! Series and for me at least the most exciting part of the series! Executing parameterized queries to an API is great; but as someone who works in Business Intelligence I know that accumulating data and the insights garnered from evaluating data over time are invaluable. In Part 4 we will be merging the two concepts presented on Chris Webb’s Blog and Faisal Mohamood’s (or Devin Knight’s Blog if you prefer). All the workbooks for this series can be downloaded here.


First we will get started with some cleanup, then move on to some data manipulation and finally add some code to accumulate our data.

Cleanup

In Part 1 of the series you may recall that we cleaned up our data by renaming columns. This was just to help get you familiar with some of the capabilities in Power Query and make it little less frightening. In the last post we created a separate query and never cleaned up the column names. We don’t need to rename columns twice as it’s an extra step and wasted CPU cycles. It would be nice though if instead of “Custom.Company” in our final worksheet we had just “Company” as our column name.

In our “Query” worksheet open up the advanced editor and remove “RenamedColumns”. You will also have to remove the comma on the end of “SplitColumnDelimiter” and change the inner “In” clause. Your query will end up looking similar to this

**(I shortened up the longer lines for presentation purposes and replaced them with …, you can open up the final file for the full code).

let

GetJobs = (zipcode as text,salary as text,searchcriteria as text,publisher as text)=>
let
Source = Xml.Tables(Web.Contents(“http://api.indeed.com/ads/apisearch?publisher=”&publisher&”…
#”Expand results” = Table.ExpandTableColumn(Source, “results”, {“result”}, {“results.result”}),
#”Expand results.result” = Table.ExpandTableColumn(#”Expand results”, “results.result”, {“city”…
RemovedColumns = Table.RemoveColumns(#”Expand results.result”,{“query”, “location”, “dupefilter”,…
SplitColumnDelimiter = Table.SplitColumn(RemovedColumns,”results.result.formattedLocation”,Splitter.SplitTextByDelimiter(“,”),2)
in
SplitColumnDelimiter
in
GetJobs

If you click on “Refresh All” from the data tab your “Parameterized Query Results” will no longer function. Our “ParametermizedQuery” is expecting specific columns names from our “GetJobs” function; we just broke that connection by removing the rename lines in “GetJobs”. So to get everything linked up again open up the advanced query editor for our “ParameterizedQuery” function and remove the lines #Expand Custom” and “DuplicatesRemoved” add a comma to “InsertedCustom” and change the in clause.

let
Source = Excel.CurrentWorkbook(){[Name=”Table3″]}[Content],
InsertedCustom = Table.AddColumn(Source, “Custom”, each GetJobs([ZipCode],[Salary],[SearchCriteria],[PublisherKey]))
in
InsertedCustom

This is the same pattern we followed in the “GetJobs” query and one that we will likely use again in the future to adjust queries in the advanced editor. You can also eliminate this code by clicking on the “X” for a particular function under “Steps”. After all this you’ll need to re-do the steps from Part 3 in this series; expand the custom query results, rename the columns and de-duplicate the data on the Job Key. The end results should look similar to the screen shot below.

This was fairly repetitious and if you don’t want to go through it I understand (but you should just to get some more experience with Power Query!); you can download the cleaned version of the file (Cleaned – Part 4 – Power Query API- Indeed! – Cleaned.zip) here.

Data Manipulation

When we start accumulating data we will probably be looking at information across time. This is so we can answer questions like “What is the average number of jobs per day in Santa Monica?” or “Is the job market trending up or down this month?”. It’s one of the standard dimensions found in most data modeling and one that you will likely have in the majority of your models as well.

In order to accumulate data in from Power Query we need to load it into Power Pivot. Power Pivot is a column store index of data and a modeling tool directly in Excel; if this is over your head that’s ok. A good way to think of how it stores the information is by column and only unique values are effectively stored; unlike a traditional database that stores information as a row. As a good practice you almost always want to break your dates into two columns; one for date and one for time. This will limit the size of your workbook by condensing the date to a maximum of 366 days per year (leap years) and 86,400 unique time values (if the lowest grain is seconds). There are other space saving tips that could be used here; like splitting up the job key into separate columns but I’ll leave that for another time.

All that being said; it’s just good practice so we are going to make two columns based on our Posting_Date, one for Date and one for time. The first step is to open up the advanced query editor again for our “ParameterizedQuery” and change the Posting_Date to an actual Date/Time data type.

Next we need to use “Duplicate Column” on the “Posting_Date” column. This will create a new column on the end of our query called “Copy of Posting Date”. I moved the column next to “Posting Date” and renamed to “Posting_Time”. The end result is two columns with identical information that should look something like this.

Next we will transform both the “Posting_Date” and “Posting_Time” columns to reflect Date and Time seperatly. Right click on the respective columns, choose transform and the proper type.

If you click done at this point your sheet should reflect the changes and have a Date and Time column.

Accumulating Data

The first step in accumulating data is to click the “Load to data model” feature.

Because we enabled “Power Pivot” in Part 1 we can actually go check out the model we just loaded using the Manage feature. This is not necessary for accumulating data; you can accumulate directly in Excel without using Power Pivot. However I consider this a bad practice. Power Pivot will keep things compact in terms of size and there is no inherent row limitation as there is in an excel workbook. What happens to my query when we hit the one million row limit in Excel? I don’t really want to find out. It’s just one of those things to keep in mind when grabbing data and making a model.

Step two is to go over to the “Design” tab under “Table Tools” and take a look at the table name of the “Parameterized Query Results” sheet.

In our case it’s named Table4; we can change the name by going to the “Formulas” tab and clicking on the name manager. I would not do this; but it’s good to know about some of these features for the future.

Select Edit..

** You can change the name to something less arbitrary. This is not entirely necessary; but it does help others identify what is going on in Excel and Power Query if they are looking at your code.

Please note that if you change “Table1” or “Table3” table name at this point you will break the code because Power Query is looking for “Table3” and “Table1”. It is better to make these changes as you go from step to step in your query so that you don’t have unintended consequences later on. I’ll be leaving these names alone for now in this Excel and covering productionalizing Power Query code later on.

The next step is to open up the advanced editor for our “ParameterizedQuery” and add some custom code. The first thing to add is a new column; this is the date when the data was refreshed. It’s a nice to have; but is also a convenient way to see if your accumulation is working. Add a custom column to the end of your Power Query and in the pop up box after the =’s sign add this function; DateTimeZone.LocalNow() (you can also use DateTimeZone.UtcNow() ). The LocalNow function will get the current date time anytime this query is refreshed from Excel. Follow the same steps we performed earlier to make a Load Date and Load Time column.

The second to last thing we need to do is add some custom code to the end of our Power Query. This code pulls in Table4’s data (“Parameterized Query Results”) and combine’s the currently executing data with earlier contents.

GetCurrentData = Excel.CurrentWorkbook(){[Name=”Table4″]}[Content],
CombineWithExistingData = Table.Combine({GetCurrentData ,TransformedColumn1 })
in
CombineWithExistingData

If you click “Done” in Power Query and check out the data you will be able to see the distinction between those rows before we added our “Load_Date” and “Load_Time” columns.

Blank rows were jobs that were present in our table before our changes and the new rows all have a date and time associated with them. Pretty cool right?

We have one final step to perform; we introduced a bug by adding the combine functionality at the end of our query. Can you guess what it is? Hit refresh on the data again and I’ll bet you’ll notice the problem after scrolling around.

How did we get so many rows? After just three data refresh’s?

We are combining the data after the “Remove Duplicates” we previously setup. So we will get only unique values from our query when we refresh; but we will append those onto our table even if they already exist. This is not the behavior we originally intended; I want all the jobs to be unique even as we accumulate them. So head back in to the advanced query editor and “Remove Duplicates” for “JobKey”.

This will place the “Remove Duplicates” after the combines of existing and new contents and we will not store any duplicate jobs.

We have a parameterized accumulating query at this point; every time we refresh the data we get new unique jobs loaded to our model! Once again; it’s nice but … I would rather have this run automatically and then I can just open it up whenever I want to poke around for jobs. Also; since indeed is limiting us to 25 result I would want to run it very frequently in order to get the maximum number of results and miss as few jobs as possible. So next time we will cover how to schedule a refresh of our workbook automatically and finally focus in on some visualizations using Power View and Power Maps!

Continue to Part 5….

2 thoughts on “Part 4 – Power Query API – Accumulating Data Indeed!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s