Visivo is an advanced AI-powered application that combines visual analysis with speech synthesis to provide an accessible and interactive experience. The application allows users to upload images for analysis and receive detailed descriptions with optional realistic speech audio playback and download capabilities.
Demo : Vercel
Speech synthesis doesn't work on deployed version. test it locally.
visivo.webm
- Multiple Image Upload: Users can upload up to 3 images at once.
- Drag and Drop: Easy image upload via drag and drop functionality.
- File Validation: Automatic checks for file type and size (max 20MB).
- AI-Powered Image Analysis: Utilizes advanced AI to analyze uploaded images.
- Text-to-Speech Synthesis: Converts analysis results into spoken audio.
- Audio Playback Controls: Users can play, pause, and seek through the audio description.
- Audio Download: Option to download the generated audio file for offline use.
- User Authentication: Secure access to features through user authentication.
- Multi-Format File Processing: Supports analysis of various file types, including documents, audio, video, and images, with tailored insights for each format.
- Chat-Interface: Chat with you files, either documents, music or video files.
- User Authentication: Users sign in to access the application features.
- Image Upload: Users can upload up to 3 images through drag-and-drop or file selection.
- Image Validation: The system checks file types (JPEG, PNG, WEBP, HEIC, HEIF) and sizes (max 20MB).
- Image Analysis: Uploaded images are sent to the AI service for analysis.
- Result Display: Analysis results are displayed for each image.
- Audio Synthesis: The system generates audio descriptions of the analysis results.
- Audio Interaction: Users can listen to audio descriptions and control playback.
- Audio Download: Users can download the generated audio files for offline use.
- Next.js 15.0.3 with App Router
- React 18.2.0
- Google Gemini Gemini 1.5 Flash for image analysis
- Azure Cognitive Services Speech SDK for text-to-speech
- NextAuth.js for authentication
- Framer Motion for animations
- Tailwind CSS for styling
- Node.js 18.0 or later [ Download Node.js ]
- Google Cloud Platform account with Gemini API access [ Create API Key ]
- Azure account with Speech Services set up [ Create API Key ]
- GitHub OAuth application (for authentication) [ Setup OAuth ]
- Discord Developer application (for authentication) [ Setup OAuth ]
-
Clone the repository:
git clone https://github.com/exyreams/Visivo.git cd Visivo
-
Install dependencies:
npm install
-
Set up environment variables: Create a
.env.local
or.env
file in the root directory with the following content:GEMINI_API_KEY=your_gemini_api_key AZURE_SPEECH_KEY=your_azure_speech_key AZURE_SPEECH_REGION=your_azure_speech_region GITHUB_ID=your_github_oauth_client_id GITHUB_SECRET=your_github_oauth_client_secret DISCORD_CLIENT_ID=your_discord_client_id DISCORD_CLIENT_SECRET=your_discord_client_secret NEXTAUTH_SECRET=your_nextauth_secret
GEMINI_API_KEY
: Create an Api key from AI Studio.AZURE Speech Keys
: Create New application from Speech .AZURE_SPEECH_KEY
: Use your Resource key here.AZURE_SPEECH_REGION
: Use the region that you selected while creating application.
GITHUB
: Create new from OAuth application & use ID & Secret.DISCORD
: Create new application from here, once application is created go to Oauth2 use those credentials.NEXTAUTH_SECRET
: Generate new key by running following command in your terminal.openssl rand -base64 32
-
Set up authentication providers:
- For using Locally:
- Create a GitHub OAuth application and add the callback URL:
http://localhost:3000/api/auth/callback/github
- Create a Discord application and add the callback URL:
http://localhost:3000/api/auth/callback/discord
- Create a GitHub OAuth application and add the callback URL:
- For using in own URL:
- Create a GitHub OAuth application and add the callback URL:
http://
your-custom-url/api/auth/callback/github
- Create a Discord application and add the callback URL:
http://
your-custom-url/api/auth/callback/discord
- Create a GitHub OAuth application and add the callback URL:
- For using Locally:
-
Start the development server:
npm run dev // or npx next dev
-
Open
http://localhost:3000
in your browser to view the application.
- Sign in using GitHub or Discord authentication.
- Upload images by dragging and dropping or clicking the upload area.
- Click "Analyze Images" to process the uploaded image(s).
- View the analysis results for each image.
- Use the audio controls to play, pause, or seek through the audio description.
- Click the download button to save the audio file to your device.
/api/analyze
: Handles image analysis using Google Gemini AI and text-to-speech conversion using Azure Speech Services./api/auth/[...nextauth]
: Handles authentication using NextAuth.js.
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Google Cloud Platform for Gemini AI services
- Microsoft Azure for Speech Services
- Next.js team for the amazing framework
- All contributors and users of the application