AIML Blog

Last modified by Julia Gilmore on 2025-01-29, 12:03

Dec 10 2025

OCUL Strengthens AI Strategy Through Collaborative Advisory Committee

The Ontario Council of University Libraries (OCUL) has formed a new AI Advisory Committee – an important step toward advancing the responsible use of AI across Ontario’s academic libraries.

Representing a range of university sizes and locations, this ad hoc committee will shape a longer-term plan for engagement with AI at the consortial level and assess the sustainability of OCUL’s current AI initiatives.

Committee Members

Following a call for nominations from the OCUL membership, nine individuals have been appointed to the AI Advisory Committee. They represent a diverse range of university sizes, locations and library roles, from technology specialists to library leadership to metadata experts.

  • Andrew Colgoni (Brock University)
  • Melanie Parlette-Stewart (University of Guelph)
  • Qing (Jason) Zou (Lakehead University)
  • Casey Hoeve (McMaster University)
  • Scott Johns (OCAD University)
  • Yoo Young Lee (University of Ottawa)
  • Amaz Taufique (University of Toronto)
  • Weijing Yuan (University of Toronto)
  • Beth Sandore Namachchivaya (University of Waterloo)

“OCUL remains steadfast in our commitment to collaboration and innovation,” explains Amy Greenberg, OCUL Executive Director. “This new committee represents an exciting opportunity to ensure AI is leveraged thoughtfully and sustainably at OCUL in support of academic research and learning.” 

The committee begins its work in early 2026 and will also serve in an advisory capacity to the OCUL AI and Machine Learning leadership team, providing guidance for current and future capacity building related to AI across OCUL member libraries.

For More Information

Questions about the OCUL AI Advisory Committee can be emailed to Amy Greenberg; amy.greenberg@ocul.on.ca.

Nov 18 2025

Enhancing Digital Accessibility through Automated Transcription: The Whisper Pilot Project

By Gabby Crowley, Client Services Librarian, Scholars Portal

Transcriptions of audio and video materials are essential for digital accessibility, supporting users who are Deaf, hard of hearing, or who prefer text-based formats. As part of the OCUL AIML Initiative, OCUL and Scholars Portal have launched a pilot project using Whisper, an automatic speech recognition system developed by OpenAI, to transcribe audio files, setting up a pipeline through Scholars Portal and providing access and technical documentation to OCUL member libraries who wish to generate their own transcripts without the overhead of installing and managing the tool.  

Scholars Portal hosts a local instance of Whisper that processes the files transferred via Globus, a secure research data transfer and sharing platform. Each institution has designated folders in Globus for uploading and accessing files, along with an options.txt file that defines settings such as language (English, French, or auto-detect), model type, and processing parameters. When files are uploaded, Whisper automatically generates transcripts according to these specifications.  

We are testing Whisper using collections from the University of Toronto Archives and Records Management Services and the eCampusOntario Open Library. In addition to testing and documenting the process, we are evaluating the quality of the transcripts for accuracy, punctuation, style, timing, and language detection. After some internal testing, Scholars Portal staff Rachel Wang and Meghan Xu adjusted Whisper’s language detection settings to improve French-language accuracy and reduce repetition and “hallucination” errors. 

Ten staff from OCUL libraries have volunteered to do transcript evaluation, and a final report will be presented to the OCUL Council and the AIML team in early 2026.  

The goal with this work is not just to test Whisper, but to better understand the workflow needed for Scholars Portal to mount and host open source AI workflows for member libraries to access, explore, and evaluate. We hope that insights gained from this project will inform best practices for deploying machine learning tools in a sustainable, privacy-conscious, and service-oriented way, laying the groundwork for expanded accessibility initiatives and innovative applications of AI across the OCUL community.

 

Nov 04 2025

Government Documents Metadata

By Stefania Kuczynski, Scholars Portal student, on behalf of OCUL Gov Docs Project Team. 

The ambiguities that have made cataloguing standards hard to, well, standardize, are not going anywhere in the age of AI. We are very aware of this as we work on the Gov Docs project, an OCUL project designed to enhance metadata for over 50,000 digitized government documents. 

For the last month we’ve been testing various AI models (including Qwen, GPT, Mistral, and Gemma), prompting them to give us a set of metadata elements based on the OCR’d text of a scanned document. While different models often give different results, mistakes are not always because of the models!  After all, some metadata questions are both technical and philosophical: We have many documents produced by a Ministry, but that also have a specific named author. Other times, the Ministry itself will be the only named author. For our purposes, would it make sense for both of them to be authors?  Is the first case a standard author and the second a Corporate Author? In both cases, the Ministry is also the publisher. Should these all be captured under a more general “Responsible entities”? We would like to grab as much information as possible, but is it helpful to differentiate these types of authors if we’re not able to do it consistently (because the source materials are not consistent)? Many of these issues only come up as we encounter them, and we are fortunate to have a community of government information experts to consult with. 

To generate the metadata, we feed a prompt to a chosen model, asking it to extract information from the text. Each model has its own quirks, and the prompts require editing, testing, and fine tuning as we add in more and different types of documents. An example of a prompt related to a publisher is:

For publisher, look for the name of the organization, company, government body, or individual explicitly listed as the publisher of the document. This is often found on the title page, cover, or near the end of the text, and may be preceded by phrases such as "Published by" or "Printed by" or the copyright symbol.

While it is very exciting to test these models and see all the library possibilities, there is also a significant amount of thinking, consulting, and checking that goes into making an AI tool give us a good answer. Much of the accuracy, efficiency, and optimization that AI promises still relies on humans and our effort and critical thinking.  

 

Oct 28 2025

Artificial Intelligence and Optical Character Recognition (OCR)

By Jacqueline Whyte Appleby

As part of the OCUL AIML Learning Initiative, we at Scholars Portal are working to create metadata records for undescribed or underdescribed government documents previously housed on the 5th floor of Robarts.

When we began exploring AI tools for metadata creation, we quickly realized that to get good metadata, we were going to need to ensure we had near-perfect OCR (optical character recognition, i.e. an accurate capture of the text on the page, rather than simply an image of the page). Our metadata project thus also became an OCR project, which might seem a bit confusing - hasn’t OCR been around a long time? Is this the AI revolution we’ve been promised?

This also got me thinking: Is OCR itself AI? Most of the tools we refer to as ‘intelligent’ are just recognizing patterns and calculating statistical probability, things even classic OCR tools can do. Tesseract, arguably the most ubiquitous OCR software, has been around since the 1980s, when it worked fairly well on carefully typed office documents. It has been continually developed, and long ago incorporated features like page layout analysis and language detection to improve its output.

There are now hundreds of OCR tools available, many of them very sophisticated. Quite a few of them do rely on Large Language Models (LLMs) to create or check results, which improves their accuracy. Over the summer we tested a mix of tools and found that nearly all of them were good at some things, but few were good at everything. LLM-enabled OCR tools sometimes reached closer to a ‘perfect’ score, but they needed a lot of compute and significant time to generate results, more than we could give for a project that was looking to OCR 50,000 documents. If you’re doing an OCR project that’s mostly typed monographs, using a more lightweight tool like Tesseract may be totally sufficient, and get you results a lot faster. For our project, where we’ve got old, messy documents with lots of tables, finding the balance of speed and accuracy required quite a bit of testing. 

For the curious, the tools we tested were Tesseract, OCR Flux, Marker, OLM, NanoNets, Monkey OCR, and SmolDocLing. While we initially selected MonkeyOCR to use on all documents, we soon after switched to MinerU, a tool which we continue to find gives great results, even comparing it to the much-hyped DeepSeek OCR release last week.

 

Sep 15 2025

OCUL AIML Initiative Update - September 2025

Centered around a series of pilot projects, the Ontario Council of University Libraries (OCUL) continues its exploration of AI and machine learning’s responsible, ethical use in the academic library environment while building related knowledge and skills across the OCUL membership and beyond.

The team has expanded with the hiring of Kari D. Weaver, who oversees the projects and capacity building programming, and Furquan Hassan, AI Special Projects Developer, bolstering staffing expertise to support the pilot projects and OCUL’s strategic goals. With this larger team, OCUL has extended the timeline for its initial phase of AI and machine learning experimentation to May 2027.

Engagement with the broader OCUL community is critical to the consortium’s programming and project success. An online survey was distributed in May 2025 to gather diverse perspectives that will inform future decision-making and programming coupled with gaining an understanding in 2025 of the training needs of library workers in Ontario universities. A full report of the survey findings and recommendations is now available. Also in May, two working groups composed of representatives from OCUL member libraries began work on the Enhancing Virtual Reference project. These groups will explore the potential of an AI-assisted chatbot from the existing Ask a Librarian vendor to improve virtual reference services, focusing on user experience, privacy and security, accessibility, and service delivery. Based on initial efforts of the working groups, a new Enhancing Virtual Reference Project FAQ was published in August 2025.

In July, OCUL welcomed Joël Rivard (Head of Research Support Services, Carleton University) as a visiting researcher. His work over the coming year will focus on developing a systematic approach to analyzing artificial intelligence enhancements to existing vendor-supplied tools and supporting capacity building initiatives underway at the consortium.

OCUL continues to foster connections across organizations, and in a significant milestone, was recently awarded a grant from the Higher Education Quality Council of Ontario (HEQCO). The grant funds a project that examines how AI-powered tools might improve discoverability and access to a collection of 50,000 historical Canadian government documents. 

OCUL Executive Director Amy Greenberg recently spoke on a panel at the GenAI in Libraries Conference on collaborative approaches to capacity building on AI and Kari D. Weaver delivered three invited presentations at the University of Guelph Teaching with AI Conference on AI in educational development, AI in co-curricular learning, and AI disclosure practices highlighting her internationally-recognized work on the Artificial Intelligence Disclosure (AID) Framework. These efforts demonstrate OCUL’s potential to position librarians as thought leaders and expert contributors on artificial intelligence in higher education.

For ongoing updates and detailed information about each pilot project, please visit SPOTDocs, the OCUL wiki.

Aug 19 2025

New Survey Report Shows How Library Workers Use AI

A new report shows how Ontario library professionals are using AI tools in their day-to-day work and their perspectives on these burgeoning technologies.

This past May, the Ontario Council of University Libraries (OCUL) surveyed its membership to better understand how AI tools are being used and where there are opportunities to develop AI-related professional development programming.

Next for our AI and Machine Learning team is to not only build out professional development offerings aligned with survey findings, but also host follow-up focus groups to gain ongoing, deeper insight into members’ viewpoints on AI. Information on the focus groups and registration for participation will be available in 2026.

Download the full report (PDF) to learn more about key recommendations and survey findings.

Jun 04 2025

Unlocking Government Archives: Enhancing Access with AI-Generated Metadata

By Jacqueline Whyte Appleby

The Ontario Council of University Libraries (OCUL) is a member-based consortium of academic libraries. In 2024, OCUL launched the OCUL Artificial Intelligence and Machine Learning (AIML) Program, which aims to promote responsible, ethical AIML use in the academic library environment while building related knowledge and skills across the OCUL membership and beyond. The program consists of five distinct projects. One of the projects, focused on enhancing access to government documents collections through AI-generated metadata, is part of HEQCO’s Consortium on Generative AI. Visit the OCUL website for more information on OCUL’s AIML program and projects.


It’s a well-trod maxim at this point: there’s too much information out there. Misinformation, disinformation and fake news abound — but with the explosion of online publishing, there’s an overwhelming deluge of high-quality information too! Searching for “cancer cure” on the University of Toronto Libraries site brings up over 84,000 resources, including 47,000 peer-reviewed journal articles, 3,400 book chapters and 1,500 PhD theses. How do we find the resources that are right for our particular line of questioning? How will we know when we should delve deeper into a particular article?

The answer, of course, is metadata! Metadata, data about the data, is all around us, and is essential to our sense-making process. When you look up a movie playing in theatres, you want to know the runtime to plan your evening. When you open an issue of The Economist, knowing its publication date is essential context for whatever you read next.

Libraries are hubs of metadata production, aggregation, harmonization and structuring — making our resources findable and understandable is at the heart of our work. Where metadata is poor, discoverability will be a challenge. Metadata nearly always includes the basics such as title, author and year of publication, but to be really useful, it should also include subjects covered, carefully selected keywords, funding information, author affiliations, unique identifiers (a book’s ISBN, for example) and more.

Creating metadata with this much detail is extremely time consuming. When libraries buy resources, the publisher or distributor is typically in charge of this process. But what about for free, open, digitized or newly produced content? Tools powered by generative AI present opportunities for improving this process.

“Improving Access to Digital Collections Using GenAI in Libraries” is a HEQCO-supported project being run by the Ontario Council of University Libraries (OCUL) and Scholars Portal, the digital service arm of OCUL. This project is one of five taking place as part of OCUL’s AIML program. Our project team includes librarians, developers, systems support specialists and co-op students. We aim to explore the application of generative AI and optical character recognition (OCR) technologies to improve the metadata quality and discoverability of libraries’ digital collections. The project will enrich nearly 50,000 government documents that the University of Toronto Libraries has digitized and made openly available on the Internet Archive. These documents, which include reports, briefings, budgets and inquiries, are a treasure trove of history and public policy, but as historical documents, they are often lacking metadata that would help researchers understand what they contain or why they might be useful.

Here is an example of what you might find while browsing the first page of the collection on archive.org:

https://heqco.ca/wp-content/uploads/2025/05/Blog-1-1024x660.png

Four of these resources don’t have any metadata at all, not even a title. And while Statistics Canada is the producer of four of these documents, they’ve been named three different ways, so there’s no guarantee they’d come up together in a search. We also have two works titled “Financial Flow Accounts” — are these part of a series? How are they related to each other?

Here is an example of a clearly defined series resulting from a title search:

https://heqco.ca/wp-content/uploads/2025/05/Blog-2-1024x567.png

But while Imports by Commodity is a government report produced annually, the metadata for each of these documents has the same publication date (1944), so you’d need to open each document to know what year it’s reporting on.

Digging down to the document level, we see vast amounts of historical information. Here’s an example from Imports by Country, January–December 1972:

https://heqco.ca/wp-content/uploads/2025/05/Blog-3a-1024x584.jpeg

Historical whisky imports from the US seem like something that might be of interest these days. But you’d have to have a fair amount of subject-matter expertise, or look through all these pages, to know it was here.

The promise of digitization has always been full-text searching, but the quality of these scans means this isn’t always possible. And in order to create useful metadata for these documents, we first need to understand what the text is. OCR has long been used to convert scanned images into text — but it’s not a perfect technology, and the OCR generated with the scans for this collection is a bit of a mess!

Here is that same commodities table rendered as plain text on the right:

https://heqco.ca/wp-content/uploads/2025/05/Blog-3a-1024x584.jpeg

https://heqco.ca/wp-content/uploads/2025/05/Blog-4-3-1024x993.png

While it’s possible to understand that this is a table, some of the data points are misrepresented or entirely missing.

And many are much worse than this! Here’s the Index page for the 1972 Imports by Country report in the reader, and after an initial pass through OCR:

https://heqco.ca/wp-content/uploads/2025/06/Blog-5a-1024x595.jpg

https://heqco.ca/wp-content/uploads/2025/06/Blog-6a-1024x522.jpg

After some troubleshooting, we realized the OCR is actually reading through the thin page, pulling characters from the following page.

Step one of this project is metadata, but the long-term goal is to be able to query this corpus of documents at scale — asking questions across documents and finding links between them using a vector database and retrieval-augmented generation. To do any of this successfully, we need to be sure our dataset accurately portrays what’s in the documents.

The essential foundational work of this project is testing different OCR tools to determine the right balance of accuracy, speed and computer requirements, because the most precise tool won’t work if it can’t be scaled for our document corpus. The simplest tool, Tesseract, works well for basic documents, but struggles with the complex layout of many of these reports. More sophisticated tools that make use of large language models (LLMs) are able to capture text and structure with more precision, but they’re slow and resource intensive. Marker OCRolmOCR and smolDocling are three of the AI-powered tools we’re working with — but it’s not always a simple calculation. Because government documents come in many formats, sizes and structures (for example, image-heavy pamphlets, large budget documents or Parliamentary proceedings where English and French run in two parallel columns), some tools will work better for some kinds of documents, regardless of the size of their LLM or speed.

As we move ahead with OCRing and begin extracting metadata, we look forward to sharing that work with you!

Learn more about OCUL and the OCUL AIML Program here. And check out HEQCO’s webpage for more information about the Consortium on Generative AI and the other projects involved.

Jan 29 2025

OCUL AIML Program Update – January 2025

OCUL launched its AI and Machine Learning (AIML) Program in the summer of 2024, starting with five pilot projects. The program’s pilot projects were chosen from community-developed use cases and designed to create reusable tools and techniques that can transform and adapt for future use. As the projects get underway, OCUL members will be invited to participate in the exploration and evaluation of these new technologies.

In the meantime, catch up on the work to date through our Program Update.