Problem
The vendor provides software that manages employee engagement, including collecting feedback through surveys from large organizations with hundreds or thousands of employees. They were rolling out an AI powered feature to analyze employee comments and provide summaries of insights on key themes and sentiments. Before the roll out, they wanted to be sure that the system was reliable - that it provided accurate summaries representing the sentiments expressed in the comments, and that, as an HR tool, it did not exhibit any demographic bias. This was a SaaS company using a commercial generative AI platform as a tool and did not have internal expertise to do a comprehensive evaluation, and so they looked for an external provider that could provide confidence that the system was being responsibly deployed and was aware of relevant HR regulation.
Armilla AI’s Solution
Armilla provided an assessment of the correctness of the system outputs, including a review of “hallucination” performance, testing the system’s propensity to invent or misrepresent information, an important concern with generative AI. Data and ground truth labels were lacking: the customer did not have an AI team and was not accustomed to testing AI models. Armilla worked from a small, unlabeled data sample to generate proxy labels, synthesize additional data, and impute demographic labels in accordance with accepted standards, in order to support robust testing of the system across the relevant areas. We also performed a bias assessment compatible with New York Local Law 144 on automated employment decisions. Although this tool would not have been strictly covered by the law we often use it as a standard when conducting such assessments. The excerpt from the report below shows “impact ratio” calculations in accordance with the law, demonstrating low variation between demographic categories that would be acceptable under the law.
Outcome
Armilla identified weaknesses related to the consistency of summary generation, with a cause rooted in how the data was being provided to the model, suggesting corrective action before final roll-out of the feature. We also provided evidence of demographic fairness, giving confidence to the client’s product team that the model was being deployed responsibly.
“We needed absolute confidence that our AI features could process employee feedback accurately and without bias. Armilla AI exceeded our expectations with a thorough assessment that went beyond standard testing, offering clear insights into how the system performs across diverse demographics and scenarios. Thanks to their expertise, we launched with confidence, knowing we were delivering a solution that truly serves our customers and their employees.”