Janus Pro: Leading Multimodal Model by DeepSeek (Full Transcript)

Explore Janus Pro, a top multimodal model offering image understanding and generation, outperforming many existing models. Discover its features and usage.

Download Transcript (DOCX)

Speakers

Add new speaker

Speaker 1: Deep Seek just released another amazing model Janus Pro. Unified multimodal understanding and generation with data and model scaling. And this model is free for us to use. And now this is ahead of AI race. This is a multimodal model which means you are able to input image and it's able to describe what that image is. And here you can see Janus Pro 7b is better than the lava version. That is another multimodal open source model. And this is performing top amongst the open source models. When compared with Stable Diffusion, DALL-E3, STXL, Janus Pro 7b is topping the list. Here are some of the examples. So this not only understand image, it can also generate image. So here is the comparison between the previous version and the current version. And the latest version is far better than the previous version. And they use auto regression transformer. For understanding, it uses encoder text tokenizer. Then as output you get text. For image generation, it uses text tokenizer, gen encoder and image decoder to decode the image. It's really good to see both understanding image and generation of an image in one model. When you see another previous model such as lava, this can only understand image but not generate image. But Janus Pro 7b can generate image also. For multimodal understanding, they used DeepSeq VL2 and approximately 90 million samples. These includes image caption datasets, data for table, chart, document understanding. So for visual generation, they incorporate 72 million samples of synthetic aesthetic data. The model coverage is faster when trained on synthetic data. The output is stable and improved aesthetic quality. Here are some examples. Describe the scene in detail and it explains what this scene is about. Then landmark recognition, text recognition and this is accurate. General knowledge. Can you introduce background story of this cake and it is able to identify that. Here you got text to image generation and these all images look stunning. These would have been chosen selectively but still I can see a good possibility of generating high quality images. You can see the code for this model in GitHub repo which I'll put that link in the description below. Here you got detailed instruction about the model and its capability and you got the model weights in HuggingFace which you can directly download and use it. You can also run it locally on your computer and you got the code here. Just copy this code and you should be able to run using Gradio interface or fast API. Do let me know in the comments below if you want me to cover any of these. Also you got a online version so you can upload image. I'm going to upload an image. I've uploaded this image. I'm going to ask who is this person and clicking chat. I'm facing an error maybe because of high demand but this is released by DeepSeek AI in HuggingFace spaces. You should be able to run this locally. Also you got two different options. One is for image understanding. You upload an image and able to ask questions based on that. Next you got image generation. Just type in what you want to generate and click generate images to generate the image and I am super excited about this. Do let me know in the comments below if you want me to test this more thoroughly. Considering you already like DeepSeek, I also created another video which is about the powerful reasoning model released by DeepSeek and I used that to create a rag chat bot which you can run 100% local on your computer. I'll put the link in here and I highly recommend for you to watch and I will see you there.

Summary

Generate a brief summary highlighting the main points of the transcript.

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Key Takeaways

Extract key takeaways from the content of the transcript.

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Enter your query

{{ secondsToHumanTime(time) }}

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

{{ secondsToHumanTime(duration) }}

Select Audio file