Automate governance in Microsoft Teams: Implementation, v1

Part 3 of the series already, finally time to go in some technical details. I'll start with the first attempt on the application, in which we made some bad decisions and where the technology failed us. In production only, of course...

Technologies & Components

ASP.NET Core Web Application with React.js

A Teams app is in essence just a website that is loaded inside the Microsoft Teams application (web, desktop and now even mobile). We ended up going for an ASP.NET Core Web Application with React.js front end, due to various reasons:

The template is provided by Visual Studio
We had no previous experience with Node.js, Visual Studio Code or other front-end only tools
We had limited experience with React.js from previous SharePoint Framework (SPFx) projects
The SDK to let the website interact with Microsoft Teams is only available for javascript.
We didn't know about the Yeoman Generator for Microsoft Teams and the ability to integrate SharePoint Framework webparts into Microsoft Teams wasn't available yet

It provided us with a way to serve up the application integrated within Microsoft Teams AND as a standalone website. This was particularly important because there was no Tenant Apps Catalog yet, and no official mention if and when it would come. It was an educated guess, so we developed with side-loading and launched with just a website.

Authentication was another key point that was giving us trouble. Most of the samples for tab authentication talk about using the MSAL.js library, but there were two major issues (then, not sure about now):

MSAL didn't support a "silent" authentication flow. Users are already logged in into the Microsoft Teams application and we didn't want them to re-login specifically for our application.
MSAL only allowed giving consent for Microsoft Graph, but we had other components (like an Azure Function) to which we wanted access from the application.

We had no choice but ADAL.js to have both use cases covered. Lucky, there was one sample covering this but only for Node.js applications. It was a challenge to transform this to a strong-typed, Typescript based React.js application.

The Microsoft Teams Client SDK is the last important library used in the website. It tries to help with the authentication, but also passes context information from the Microsoft Teams application to your application. It was just available when we started, it seems like a lot has changed already. We mostly used it to get the theme information and made our application be pretty in dark and high contrast modes.

The UI uses the components from the Office UI Fabric React toolkit, with a custom CSS overriding the styles to make them look like they are native to Microsoft Teams (actually 3 CSS's, one for each theme). In the mean time, the App Studio for Microsoft Teams and the associated App Studio Controls got released. They are probably a better choice now.

Our application would post data to an Azure Function and Microsoft Graph, it would use the same Azure Function to get data from Microsoft Graph. We would cache the data it received for performance improvements, and put it in the localStorage of the browser.
Good idea, bad execution. Who knew that localStorage has a size limit? Who would have guessed we'd hit the size limit on day 3 of being in production?
We knew about the size limit, but too late to change it before go live. It was deemed an acceptable risk, we wouldn't hit the limit before we would put out a bug fix release. We were wrong.

Microsoft Flow / Azure Logic App

Initially, we started with Microsoft Flow for the orchestration of the request process. A request for a new Teams team would come in and a Microsoft Flow would orchestrate the provisioning. Very early in the development, we switched to Azure Logic Apps instead.

The biggest reason for switching was the portability of the "workflow". While it was possible with Microsoft Flow, we found the process of exporting and deploying into multiple environments much easier with Logic Apps. It allowed us to store the Logic Apps template into source control and make adjustments through Visual Studio.

The workflow itself was rather simple:

Request comes in, Office 365 Group gets created
Naming conventions get applied
Group gets promoted to a Microsoft Teams team
Owners and Members get added
Specific set of channels gets added
Group expiration policy gets applied

It was so simple that when it failed, there was no way it could recover. Azure Logic Apps have a resubmit functionality, but the flow was only built with a happy path in mind. It's not a good design choice when you depend on external and mostly beta API's, we learned that the hard way.

Microsoft Graph

Microsoft Graph was our "one API endpoint to rule them all" but while it's a great marketing slogan, we had our fair share of issues. Microsoft Teams wasn't the application that was defined by its API's, for a long time there was nothing available. We even started the development on the promise that API's would come, and started with creating just Office 365 Groups, adding the Teams part later when the API came available (reverse engineered from the PowerShell cmdlets, because no documentation existed yet).

These issues aside, we used Graph for everything: creating Groups, adding Teams, configuring Teams, adding Channels, listing and searching for Groups, etc.
We used the Microsoft Graph Schema Extensions to store the additional metadata on the Group object. It was difficult choice:

Schema Extensions allow filtering on the custom data, but don't have support for multi-value properties
Open Extensions don't allow filtering on the custom data, but do have support for multi-value properties

We decided that we could workaround the multi-value issue, but we needed filtering to support all the search use cases we defined. I'd choose differently now, but that's for the next post.

Azure Functions

One more thing I regret doing in this version of the application, but at the time it seemed the best solution: creating an Azure Function that would act as a proxy to Microsoft Graph. So instead of going directly to Microsoft Graph, all requests (from both Azure Logic App and the ASP.net website) would go to our Azure Function and then get passed on to the Graph.

This way, we could do all our requests from Azure Logic Apps with client id and client secret. All of the Microsoft Teams API's in the beta endpoint only supported delegated permissions, but we wanted to use application permissions for everything. The Azure Function then did the call with a service account in the background.

An additional advantage was the ability to create our own endpoints that combined multiple Graph endpoints into one call.

We also used it to bypass pagination: the Azure Function would do all the paging and then return the whole dataset for Users and Groups.
Microsoft Graph doesn't support wildcard full-text search queries against the Groups endpoint, so with the whole dataset we could do the querying at the client side. It seemed like a good idea. It wasn't.

The Azure Function was slow, so slow that at times we even questioned if the application was even still online. Cold start of an Azure Function v1 application is very slow, combine this with the enormous datasets we had to pass over the wire (because of no pagination) and it became a disaster.

SharePoint

While Azure Logic Apps was responsible for the orchestration, the trigger came from SharePoint. All requests were stored in a SharePoint list and SharePoint also stored some additional configurations:

list of countries
list of business units and departments
types of Teams with specific settings

Most issues up until now were our own doing, but in this case the technology failed us. The SharePoint trigger in Logic Apps should start on a new entry in the SharePoint list, but we saw more and more weird issues:

trigger happening more then once for the same list item, within seconds of each other
trigger happening once but missing all context information, it didn't pass in the title, id or any other field of the list item in SharePoint
update actions to the SharePoint list items randomly failing with "item does not exist in the list"

Another thing that we needed to address, as if there was nothing else already on the to-do list. Lucky us.

Conclusion

It broke, big time. We should have done a lot of things differently, from a technical point but also other things we learned:

a test environment is only representative if it contains the same amount of data as production
people need to be forced to test, with real test scenario's
there is nothing wrong with a soft launch with people that understand that things can break, our awesome big bang launch party turned into very long days fixing stuff
beta API's don't belong in production. It's not that we weren't warned, we just couldn't wait.

The application became a success eventually, with some major improvements in v2. Continue the series to see how we addressed these shortcomings.

This post is part of the series Automate governance in Microsoft Teams: