Visivo - Multimodal AI Assistant

Visivo is an advanced AI-powered application that combines visual analysis with speech synthesis to provide an accessible and interactive experience. The application allows users to upload images for analysis and receive detailed descriptions with optional realistic speech audio playback and download capabilities.

Demo : Vercel

Speech synthesis doesn't work on deployed version. test it locally.

visivo.webm

Features

Multiple Image Upload: Users can upload up to 3 images at once.
Drag and Drop: Easy image upload via drag and drop functionality.
File Validation: Automatic checks for file type and size (max 20MB).
AI-Powered Image Analysis: Utilizes advanced AI to analyze uploaded images.
Text-to-Speech Synthesis: Converts analysis results into spoken audio.
Audio Playback Controls: Users can play, pause, and seek through the audio description.
Audio Download: Option to download the generated audio file for offline use.
User Authentication: Secure access to features through user authentication.
Multi-Format File Processing: Supports analysis of various file types, including documents, audio, video, and images, with tailored insights for each format.
Chat-Interface: Chat with you files, either documents, music or video files.

Project Workflow

User Authentication: Users sign in to access the application features.
Image Upload: Users can upload up to 3 images through drag-and-drop or file selection.
Image Validation: The system checks file types (JPEG, PNG, WEBP, HEIC, HEIF) and sizes (max 20MB).
Image Analysis: Uploaded images are sent to the AI service for analysis.
Result Display: Analysis results are displayed for each image.
Audio Synthesis: The system generates audio descriptions of the analysis results.
Audio Interaction: Users can listen to audio descriptions and control playback.
Audio Download: Users can download the generated audio files for offline use.

Technologies Used

Next.js 15.0.3 with App Router
React 18.2.0
Google Gemini Gemini 1.5 Flash for image analysis
Azure Cognitive Services Speech SDK for text-to-speech
NextAuth.js for authentication
Framer Motion for animations
Tailwind CSS for styling

Prerequisites

Node.js 18.0 or later [ Download Node.js ]
Google Cloud Platform account with Gemini API access [ Create API Key ]
Azure account with Speech Services set up [ Create API Key ]
GitHub OAuth application (for authentication) [ Setup OAuth ]
Discord Developer application (for authentication) [ Setup OAuth ]

Installation Guide

Clone the repository:

git clone https://github.com/exyreams/Visivo.git
cd Visivo

Install dependencies:
```
npm install
```
Set up environment variables: Create a .env.local or .env file in the root directory with the following content:
```
GEMINI_API_KEY=your_gemini_api_key
AZURE_SPEECH_KEY=your_azure_speech_key
AZURE_SPEECH_REGION=your_azure_speech_region
GITHUB_ID=your_github_oauth_client_id
GITHUB_SECRET=your_github_oauth_client_secret
DISCORD_CLIENT_ID=your_discord_client_id
DISCORD_CLIENT_SECRET=your_discord_client_secret
NEXTAUTH_SECRET=your_nextauth_secret
```
- GEMINI_API_KEY: Create an Api key from AI Studio.
- AZURE Speech Keys : Create New application from Speech .
  - AZURE_SPEECH_KEY: Use your Resource key here.
  - AZURE_SPEECH_REGION: Use the region that you selected while creating application.
- GITHUB : Create new from OAuth application & use ID & Secret.
- DISCORD : Create new application from here, once application is created go to Oauth2 use those credentials.
- NEXTAUTH_SECRET: Generate new key by running following command in your terminal.
```
openssl rand -base64 32
```
Set up authentication providers:
- For using Locally:
  - Create a GitHub OAuth application and add the callback URL: http://localhost:3000/api/auth/callback/github
  - Create a Discord application and add the callback URL: http://localhost:3000/api/auth/callback/discord
- For using in own URL:
  - Create a GitHub OAuth application and add the callback URL: http://your-custom-url /api/auth/callback/github
  - Create a Discord application and add the callback URL: http://your-custom-url /api/auth/callback/discord
Start the development server:
```
npm run dev
    // or
npx next dev
```
Open http://localhost:3000 in your browser to view the application.

Usage

Sign in using GitHub or Discord authentication.
Upload images by dragging and dropping or clicking the upload area.
Click "Analyze Images" to process the uploaded image(s).
View the analysis results for each image.
Use the audio controls to play, pause, or seek through the audio description.
Click the download button to save the audio file to your device.

API Routes

/api/analyze: Handles image analysis using Google Gemini AI and text-to-speech conversion using Azure Speech Services.
/api/auth/[...nextauth]: Handles authentication using NextAuth.js.

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Google Cloud Platform for Gemini AI services
Microsoft Azure for Speech Services
Next.js team for the amazing framework
All contributors and users of the application

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
app		app
assets		assets
components		components
.env.example		.env.example
.eslintrc.json		.eslintrc.json
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
jsconfig.json		jsconfig.json
next.config.mjs		next.config.mjs
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tailwind.config.js		tailwind.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visivo - Multimodal AI Assistant

Demo : Vercel

Features

Project Workflow

Technologies Used

Prerequisites

Installation Guide

Usage

API Routes

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

License

exyreams/Visivo

Folders and files

Latest commit

History

Repository files navigation

Visivo - Multimodal AI Assistant

Demo : Vercel

Features

Project Workflow

Technologies Used

Prerequisites

Installation Guide

Usage

API Routes

Contributing

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages