Summarization experiments with Hugging Face Transformers – part 1 – ██FR█████ █INTELL███████████

This content originally appeared on DEV Community and was authored by Solve Computer Science

The problem

Summarization is a useful process to condense information while retaining the original message. There are several ways to perform this, but, of course, AI is the trend now. If you follow my YouTube channel you know I tend to prefer self-hosted solutions. These systems guarantee more privacy and independence, but there are some trade offs.

In this post, and subsequent ones, I’ll show you some tests I made for summarization purposes using the Hugging Face Transformers Python library, locally, on CPU.

Reason

My final objective is to automatically generate YouTube video summaries to put in blog posts.

Now, I can’t copy the YouTube video descriptions verbatim for SEO purposes: search engines don’t like duplicate content. The obvious solution is to do some kind of summarization. First of all, I always use whisper and proofread the SRT file containing the recognized audio before uploading it to YouTube. I then copy the SRT file verbatim in an LLM chat, in chunks, and ask ChatGPT to generate a summary once I say so. This last step is tedious and you always rely on a SaaS. Some alternatives include the use local AI models via Ollama, or more specialized software that can run different tasks besides text generation.

The Python Transformers library by Hugging Face does just that and, thanks to its Pipeline API it is trivial to do this work (or it should be ?!).

Hugging face documentation

If you search the official docs for summarization, this is the first page that comes up. The TL;DR of that page is:

fine-tune a model
perform inference to generate a summary

A screenshot of the Transformers summarization page

To keep things simple, I just wanted to implement use a pre-trained model for inference.

Meet the pipeline API

In HF, pipelines are probably the simplest notion I learned until now. If you follow this other tutorial you’ll see that you can perform most tasks, including summarization, with just 3 lines of code. So this is more or less what I tried first (I added newlines to avoid scrolling):

from transformers import pipeline

pipeline = pipeline(task='summarization', model='facebook/bart-large-cnn')

# Text from https://en.wikipedia.org/wiki/Artificial_intelligence
# Creative Commons Attribution-ShareAlike 4.0 License
print(pipeline('Artificial intelligence (AI) is the capability of
computational systems to perform tasks typically associated with
human intelligence, such as learning, reasoning, problem-solving,
perception, and decision-making. It is a field of research in
computer science that develops and studies methods and software that
enable machines to perceive their environment and use learning and
intelligence to take actions that maximize their chances of achieving
defined goals')[0]['summary_text'])

The result is this:

Device set to use cpu

Your max_length is set to 142, but your input_length is only 79.
Since this is a summarization task, where outputs shorter than the
input are typically wanted, you might consider decreasing max_length
manually, e.g. summarizer('...', max_length=39)

Artificial intelligence is the capability of computational systems to
perform tasks typically associated with human intelligence. It is a
field of research in computer science that develops and studies
methods and software that enable machines to perceive their
environment and use learning and intelligence to take actions that
maximize their chances of achieving defined goals.

The first two lines are warnings from the Transformers library which can be suppressed by setting a less verbose logging level.

As I said, I’m doing inference on CPU, so no surprises here. About the max_length parameter, it corresponds to the maximum number of tokens in the output. Apparently the default value for this model is too big for this input: it should be tweaked appropriately so that the summary still makes sense.

Conclusion

Can you see the potential problem here? Although the quality of the summary seems OK for this specific text, the model seems to have been trained for bigger inputs. In-fact, the length of the summary is about 82% of length of the input text.

In the next posts we’ll see how to improve this kind of inference. I also tested other models to see how they measure up, so stay tuned.

This content originally appeared on DEV Community and was authored by Solve Computer Science