Available soon H200 clusters

Takomo Release 2023.05.16

4 min read
Takomo Release 2023.05.16

We're very excited and proud for today's Takomo release. Adding several new models and opening opening Takomo access for those on the whitelist. Below you'll find an overview and explanation of all models we've added.

To get started right away, head over to takomo.ai and try our no-code AI builder. Pipeline creation does not cost anything, only the actual API calls are charged. And it gets even better, the first 10.000 API calls are on us!

Try the new models now: https://www.takomo.ai


GPT-3.5 is a language model developed by OpenAI in January 2022. It is a bridge between GPT-3 and GPT-4, and it aims to increase the speed of the model and reduce the cost of running it compared to GPT-3. The model has three variants with 1.3B, 6B, and 175B parameters, and the main feature of GPT-3.5 is to eliminate toxic output to a certain extent. Additionally, it can understand and generate natural language or code, and the most capable and cost-effective model in the GPT-3.5 family is the gpt-3.5-turbo, which is optimized for chat but works well for traditional completion tasks as well.



GPT-4 is the latest version of OpenAI's GPT (Generative Pre-trained Transformer) language model. It is a large multimodal model that can accept text and image inputs and generate text outputs. GPT-3.5, on the other hand, is the previous version of the GPT language model. Compared to GPT-3.5, GPT-4 is smarter, can handle longer prompts and conversations, and can analyze human input more logically and present a more cohesive, resolute, and human-like output due to being trained on a larger dataset of text and code. Additionally, GPT-4 is more of a data-to-text model while GPT-3.5 is a text-to-text model. This means that GPT-4 can do things that the previous version could not.



Whisper is an automatic speech recognition (ASR) system developed by OpenAI and trained on a large and diverse dataset of multilingual and multitask supervised data collected from the web. The system approaches human-level robustness and accuracy on English speech recognition, and has improved robustness to accents, background noise, and technical language. In addition to speech recognition, Whisper is a multi-task model that can also do language identification and speech translation across a number of languages.



InstructPix2Pix is an instruction-based image editing model developed by Tim Brooks, Aleksander Holynski, and Alexei A. Efros at UC Berkeley. It is a PyTorch implementation that is based on the original CompVis/stable_diffusion repository. InstructPix2Pix allows users to edit images by providing natural language instructions. The model is fine-tuned on a dataset of image editing examples that are paired with corresponding natural language instructions. Users can provide their own instructions to the model, and it will generate an edited image based on those instructions.


ControlNet Canny

ControlNet is a family of neural networks fine-tuned on Stable Diffusion that allows for more structural and artistic control over image generation. ControlNet Canny refers to a specific ControlNet model that uses the Canny preprocessor set to a very high threshold. Widening the gap between the thresholds (i.e. decreasing the low threshold and increasing the high threshold) will give more control to ControlNet as to which edges to keep. This model is one of the most popular models that generated some of the amazing images you are likely seeing on the internet.



BLIP-2 is a lightweight Querying Transformer that is pre-trained in two stages, leveraging frozen pre-trained image encoders and frozen large language models (LLMs) to bridge the gap between vision and language. The first stage of pre-training bootstraps vision-language representation learning from a frozen image encoder, while the second stage bootstraps vision-to-language generative learning from a frozen language model. Despite having significantly fewer trainable parameters than existing methods, BLIP-2 achieves state-of-the-art performance on various vision-language tasks.