Dataverse (formerly Common Data Service - CDS) is Microsoft’s unified data storage for all business applications marketed under the Power Apps umbrella. As many companies are using Microsoft’s technology stack, Dataverse is an important data source for analysts and data scientists. The goal of this package is to make it dead simple to get your data from (and to) your Dataverse Web API instance and leverage its rich metadata to automate data analysis tasks. The package offers a suite of pipeable, simple to use functions shielding the user from the unmemorizable syntax of OData requests and ensuring correct conversion of data types (looking at you picklists!).
To install simply run:
devtools::install_github("jarekkupisz/cdsR")
As you may notice the package is not on CRAN, there are no runnable examples and no tests folder. The reason is that for the package to work you need to initialize a connection to the Dataverse’s Web API. This is not currently possible outside your organization’s Microsoft Power App’s tenant and without creating an application user with unique and not sherable credentials.
Despite this, the package is fully featured, documented, and is ready for production-grade tasks. I am not affiliated with Microsoft in any way and this is not an official API by any means! However, as I use it daily in my work, I thought that other #rstats souls working with MS stack might find it very useful.
Why cdsR
? As the package was created by me months before MS decided to
rebrand Common Data Service, I just stayed with the old name. What is
more, Dataverse just has no sensible abbreviation which would work
nicely with the package’s naming convention.
If you want to get your data and use other functions in the package, first you need to authenticate. Simply run:
library(cdsR)
initialize_cds_token(
cds_instance_name = "prod", #choose any name you like for a Dataverse instance
cds_instance_url = "https://yourorg.crm3.dynamics.com",
tenant_id = "your tenant guid",
client_id = "your user guid",
client_secret = "your user pass",
web_api_version = "9.1" #choose any you like, 9.1 is the default for now
)
This function will create a cds_token
S3 object in the options()
of
your R session. You can set as many tokens as you like, as long as you
give each a unique cds_instance_name
. I typically create one for
“prod”, “testing” and “dev” instances. Get all initialized tokens with
getOption("cdsR.tokens")
.
They also automatically refresh! You don’t need to keep calling
initialize_cds_token()
. Just do it once at the start of your work or
in .Rprofile
. Before each function call the token checks how much time
passed before it got its credentials, and if too much, it will get fresh
ones!
Ready to get some data? If you know your entity’s logical plural name then:
download_cds("contacts")
Wait, that’s it? The short answer is yes! If you want more features though follow along.
First, how come download_cds()
knows where to connect? If there is
only one cdsR token set it will use it automatically. That’s just so
convenient in interactive sessions where you toy with your data. But
what if you have scripts that need to run on multiple Dataverse
instances or you want to write a readable, production-ready code? That’s
where the cds_instance_name
argument comes in. A cds_token
object
from options()
with the specified name will be selected:
download_cds("contacts", cds_instance_name = "prod")
The pattern below is something I use all the time if I want to download the same entity from multiple instances:
c("prod", "testing") %>% purrr::map(~ download_cds("contacts", .x))
Okay then but this query will simply pull all possible columns and
records but typically that’s not what we want. We want either a subset
of columns, apply a specific filter or just to download a few rows to
play with the data. Hopefully, you want one thing at a time, because
writing your own OData HTTP requests is one of those moments that make
you hate your job. Fortunately, other arguments of download_cds()
come
to the rescue:
download_cds(
"contacts",
select = c("fullname", "pa_salesscore", "_owninguser_value"),
filter = cds_filter_fun("LastXMonths", "createdon", PropertyValue = 3),
order_by = "pa_salesscore desc, fullname",
top = 100
)
Woah a complex OData query using only understandable R expressions! There’s a couple of things to unpack, so let’s dive in:
- Each of the presented arguments represents respective OData system
operators:
$select
,$filter
,$order_by
and$top
. You can pass any character vector that would be a valid part of an OData operand. If such vector had a longer length than 1, it would be collapsed to an OData expression (see aboveselect
). In the case offilter
the expressions will be pasted withand
similarly to dplyr’sfilter
.top
of course requires a single integer. - The toughest part of these requests are filtering
functions.
cdsR
provides a helpercds_filter_fun()
that takes typically 2 strings – a Web API filter function name and a field on which you wish to subset. For certain conditions you’ll need more than 2, then pass named arguments as the…
. - The function will monitor its progress after each page of the results is downloaded (every 5k rows). Unfortunately, there is no quick and reliable way of getting the count of records for bigger entities, that’s why a percentage of completion estimate is off the table.
So the result that you get from download_cds()
loos like a tibble
but when you investigate closely you can see some pleasant surprises.
This all comes down to the fact that the package strips off the
weirdness of some of the OData data types and converts everything into
human and R readable format:
- All enum type fields such as picklists, state, status fields are converted to character vectors representing option set labels.
- Lookup fields get special treatment. Each of them is represented as
two columns: one with an original name containing
GUID
s character vectors and one with a display attribute of the target entity suffixed with.display
. If you want to know which user is a record’s owner look for anowninguser
column to get their GUID andowninguser.display
for their name. - Multiselect picklists are represented as a list of character vectors. Each entry in a vector represents a choice made in a field for a row.
- All date and datetime fields are converted to POSIX vector’s using
lubridate
magic. Remember that the Dataverse works only in UTC! - Boolean flags are converted to logical vectors.
- Nulls are converted to
NA
s.
Apart from sensible type conversions that allow you to easily use the
data, there is more utility hiding in data frames returned by
download_cds()
. Actually, I lied to you as technically what you get is
a cds_df
object, which is a tibble with some addons. In the attributes
of cds_df
s’ you’ll find 3 entries that contain the whole Dataverse
metadata
of the entity you just downloaded!
There is a helper function get_cds_md()
which allows you to specify
which definition you wish to extract with which_md
argument:
”entity”
the default, represents a tibble (an actual one this time!) with MD about the entity itself”attributes”
gets you a tibble with MD containing all information about entity’s fields”picklists”
returns a list of tibbles for each enum/picklist type field in thecds_df
holding pair of keys and labels for each option
Let’s give this rich metadata a spin. Let’s assume you forgot what is the display name for the entity you are working with:
contacts <- download_cds("contacts")
contacts %>% get_cds_md() %>% dplyr::select(DisplayName)
This time you want to select all picklists for your model to run on:
contacts %>%
select(
get_cds_md(., "attributes") %>%
filter(AttributeType == "Picklist") %>%
pull(LogicalName)
)
As long as the info you need hides somewhere in the Dataverse’s metadata you can use it in your code.
Manipulating data in your session is fun for our analytical minds, but the most potent way we can contribute to our organizations is to return the results of our numerical work to those who will use it. As all Power Apps work on Dataverse, updating records in your instance will benefit all of your business applications at once! I hate the word, but this is a powerful concept from Microsoft.
You’ve just trained a customer churn prediction algorithm and want to
help your sales reps by sorting their contact list by a propensity to go
away. update_cds_records
is your buddy this time. We’ll use contacts
entity we downloaded earlier:
contacts %>%
inner_join(churn_algo_results, by = "contactid") %>%
select(contactid, ap_churnrisk = prob_to_churn) %>%
update_cds_records(
cds_instance_name = "prod",
fail_safe_entity_name = "contact",
force_operation = TRUE
)
This call will even give you an ETA for the end of the update job and warn about any errors no the way. This function though works a bit differently:
cds_instance_name
has to be always specified this time. Because we are launching potentially a destructive operation (and anyone who tried to recover data from Dataverse gets MS hell at its “finest”) the package needs to make sure you know what you are doing.fail_safe_entity_name
serves basically as another defence to save you from dire consequences. This function is easy to call at the end of the pipe and I got used to providing the first argument too quickly. This argument forces you to explicitly type which Dataverse table you want to update and the function will confirm withcds_df
’s metadata that indeed you are good to go.force_operation
is yet another fail-safe. By default, it isFALSE
and it uses Dataverse Web API optimistic concurrency control to prevent you from updating records that might have been changed before you downloaded them. If you don’t wish to overwrite any changes you’ll need to include the@odata.etag
column which is included in everydownload_cds()
result.
Ok then, simple enough for small requests like the one above. What about
state
column? Picklists, multipicklists, dates? They all need to be
parsed to integer keys or specially formatted strings for the Web API to
accept a POST call, right? Well update_cds_recrods()
does everything
for you! Simply use the human-readable labels that were returned to you
with download_cds()
and pass the mutated and filtered cds_df
you are
working with to update_cds_records()
! There are a few quirks though:
- Lookups don’t work that nicely. This is something I will try to
improve, but the only way to change the value in the lookup is to
create a column named
lookup_relationship_name@odata.bind
and have a character vector with the following template for values:target_entity_plural_name(GUID _of_the_target_record)
. An example would be
contacts %>%
mutate(
`ownerid@odata.bind`="systemusers(2zcaa784-0dd5-dr11-81e4-123c2981575g)"
)
- When launching the
update_cds_records()
a provided data frame has to be acds_df
with all the metadata in itsattributes()
, otherwise, the function errors.dplyr
’s verbs, unfortunately, get rid of attributes. That’s whycdsR
registers S3 generic methods formutate
,*_join
,select
and all else that was possible to register to preserve attributes. In my daily work I rarely stumble onto this problem, but if you do then you can useis.cds_df()
to troubleshoot if you are still dealing with acds_df
. - If you wish to work with integer keys instead of character labels
for enum fields for any reason (I have not found a good one yet, so
please let me know when you do!) you can leverage
parse_cds_enums(cds_df, to = "key")
Cleaning time! Duplicates and legacy records are a common plague in any CRM-like application. As a data scientist, you’ve just trained an ML model that recognizes low utility entries. Just pipe away:
contacts %>%
filter(contactid %in% cleaning_algo_results$junk_contacts_guids) %>%
delete_cds_records("prod", "contact")
All the fail-safe features are the same as in the
update_cds_recrods()
. For example, the above code would not remove any
rows that were changed before you used download_cds()
to get your
contacts. Please refer to the function’s documentation or previous
section for the details.
This seems like the least useful operation, however, if you worked with
any CRM-type system in the past you know that some kind of data
migration/integration awaits you. That’s why I decided to include a
simple function create_cds_recrods()
. It allows you to create new
records from a fresh data frame. This function does not include any of
the fancy type conversions of update_cds_records()
. I suggest
leveraging get_cds_md()
on a cds_df
representing a Dataverse table
you wish to produce records in.
new_contacts <- tribble(
~firstname, ~lastname, ~department,
"Andy", "Doe", 1,
"John", "Smith", 3
)
create_cds_records(
new_contacts,
cds_instance_name = "prod",
target_entity_set_name = "contacts"
)