Alibaba Cloud Launches Open-Source Large Vision Language Model with Image Comprehension Capability

Alibaba Cloud, the digital technology and intelligence backbone of Alibaba Group, has launched two open-source large vision language models (LVLM), Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. The models can comprehend images, texts and bounding boxes in prompts and facilitate multi-round question answering in both English and Chinese.

Qwen-VL is the multimodal version of Qwen-7B, Alibaba Cloud’s 7-billion-parameter model of its large language model Tongyi Qianwen (also available on ModelScope as open-source). Capable of understanding both image inputs and text prompts in English and Chinese, Qwen-VL can perform various tasks such as responding to open-ended queries related to different images and generating image captions.

Qwen-VL-Chat caters to more complex interaction, such as comparing multiple image inputs and engaging in multi-round question answering. Leveraging alignment techniques, this AI assistant exhibits a range of creative capabilities, which include writing poetry and stories based on input images, summarising the content of multiple pictures, and solving mathematical questions displayed in images.

Contribution to open source and inclusivity

In a bid to democratise AI technologies, Alibaba Cloud has shared the model’s code, weights, and documentation with academics, researchers, and commercial institutions worldwide. This contribution to the open-source community is accessible via Alibaba’s AI model community ModelScope and the collaborative AI platform Hugging Face. For commercial uses, companies with over 100 million monthly active users can request a licence from Alibaba Cloud.

The introduction of these models, with their ability to extract meaning and information from images, holds the potential to revolutionise the interaction with visual content. For instance, leveraging its image comprehension and question answering capability, the models could provide information assistance to visually impaired individuals during online shopping in the future.

The Qwen-VL model was pre-trained on image and text datasets. Compared to other open-source large vision language models that can process and understand images in 224*224 resolution, Qwen-VL can handle image input at a resolution of 448*448, resulting in better image recognition and comprehension.

Based on various benchmarks, Qwen-VL recorded outstanding performances on several visual language tasks, including zero-shot captioning, general visual question answering, text-oriented visual question answering and object detection.

Qwen-VL-Chat has also achieved leading results in both Chinese and English for text-image dialogue and alignment levels with humans, according to the benchmark test of Alibaba Cloud. This test involved over 300 images, 800 questions, and 27 categories.

Earlier this month, Alibaba Cloud open sourced its 7-billion-parameter LLMs, Qwen-7B and Qwen-7B-Chat as its ongoing contribution to the open-source community. The two models have had over 400,000 downloads within a month of their launch.

About Post Author

Mark Baker, UK Tech News Editor

Mark Baker is the Editor of UK Tech News and is an experienced programmer, network engineer and IT Support professional.

Mark has more than ten years experience in helping businesses and organisations solve IT challenges.

To contact Mark, please email editor@uktechnews.co.uk

See author's posts

Latest News

The Evolution of Traditional Logistics: A Modern Metamorphosis

i-PRO Announces Revolutionary New AI On-site Learning Camera Line that Adds AI to Non-AI Cameras

The Versatile Toughbook 55 Regenerated For The Changing Needs Of The Modern Mobile Workforce

Public services, transport and creative industries sectors set for £36m 5G connectivity boost

BlueVoyant is a Proud Participant in the Microsoft Security Copilot Design Advisory Council

Navigating Digital Transformation: Insights for Senior Tech Leaders

MetaCompliance’s DACH expansion gains momentum with IYS acquisition

New Report Highlights How 5G-Advanced Features Can Address the Enterprise Opportunity

LADbible Group launches dedicated technology editorial website, uniladtech.com

Cohesity Expands Collaboration with Microsoft to Bring Enhanced Data Security and Backup Protections to Microsoft 365

Alibaba Cloud Launches Open-Source Large Vision Language Model with Image Comprehension Capability

About Post Author

Mark Baker, UK Tech News Editor

Related

The 7 Simplest Ways to Improve Your Business’s Sustainability Overnight

Reimagining the stories that need to be told

E-Commerce and Tech Leaders Warn of the Need for Fresh Thinking, Innovative Technology and Improved Security Ahead of Black Friday and Cyber Monday

The Hidden Costs of Non-Interoperable Digital Systems in Government Bodies

3 ways Google Cloud simplifies the financial tech stack

How Taylor Swift became an unlikely hero in the fight against click bots

Data, data everywhere, but not enough to train (a model)

Latest News

About Post Author

Share this:

Related

Related Post