Microsoft uses unsupervised learning techniques to extract knowledge about cloud service disruptions. In an article published on the preprint server Arxiv.org, researchers at SoftNER describe a framework that was provided internally by Microsoft to collect information on 400 storage, computing and other cloud failures. They claim that there is no longer a need to annotate large training data, while scaling to a high volume of timeouts, slow connections, and other product interruptions.
Structured information has an inherent value, especially in the areas of cloud and high-stakes web operations. Not only can it be used to create AI models tailored for tasks such as triaging, it can also save time and effort for engineers by automating processes such as performing resource checks.
The SoftNER framework tries to extract knowledge by analyzing unstructured text, recognizing entities in failure descriptions and classifying entities into categories. Components are used that identify structure patterns in the descriptions to boot training data, as well as label propagation and a multitask model to generalize data beyond the patterns and extract entities from the descriptions.
SoftNER starts every run with the noise suppression of data. SoftNER records incident statements, conversations, batch traces, shell scripts and summaries from sources such as Microsoft customers, feature engineers, and automated surveillance systems, and normalizes descriptions by cropping tables with more than two columns and removing unnecessary tags (such as HTML tags). . Then the descriptions are divided into sentences and the sentences are divided into words.
After performing the entity identification (e.g. problem types, exception messages, storage locations and status codes as well as the data type identification (for IP addresses, URLs, subscription IDs etc.), SoftNER passes the types of the entity values to all incident descriptions. For example, if the IP If the address “1
In experiments, the researchers evaluated SoftNER’s performance by applying it over a period of two months to 41,000 failures at Microsoft from “large online systems” with “a wide distribution of users”, each containing an average of 472 words. They report that the framework has managed to extract 77 valid entities per 100 from descriptions with an accuracy of over 96% (averaged over 70 different entity types). It also says that SoftNER is accurate enough to perform automatic triaging at Microsoft.
The researchers say they plan to use SoftNER in the future to evaluate bug reports and improve existing incident reporting and management tools. “Incident management is an integral part of building and operating large scale cloud services,” they write. “We show that the extracted knowledge can be used to create much more accurate models for critical incident management tasks.”