We are seeking a skilled Big Data Developer to join our startup, building a cutting-edge platform that leverages AI-driven political narratives to predict niche stock performance (e.g., defense stocks under Republican administrations). You will design and implement a scalable backend data infrastructure to handle large-scale historical and real-time data ingestion, storage, querying, and processing. The system will support user-created AI trader personas, backtesting strategies, and performance simulations, scaling for thousands of users testing 30-40+ stocks across multi-year datasets. Key Responsibilities * Data Ingestion and Pipelines: * Build ETL pipelines to ingest data from financial APIs (e.g., yfinance, Quandl), Google Trends (via pytrends), and X/Twitter APIs. * Manage large datasets (e.g., terabytes of historical stock prices, real-time updates, unstructured sentiment data). * Use orchestration tools (e.g., Apache Airflow, Luigi) for automated daily/hourly data refreshes. * Ensure data quality through cleaning, deduplication, and handling missing values. * Database Design and Management: * Transition from SQLite to a scalable big data solution (e.g., AWS S3 + Athena, Google BigQuery, Hadoop/HDFS with Hive). * Implement caching (e.g., Redis, pandas DataFrames) for frequent queries and backtesting. * Enable complex queries (e.g., “SELECT average returns for defense stocks WHERE president = ’Trump’ AND date BETWEEN ’2017-01-20′ AND ’2021-01-20’”). * Optimize performance with indexing, partitioning (by date/stock), and efficient joins. * Backtesting and Processing Engine: * Integrate Python libraries (e.g., backtrader, backtesting.py) for strategy simulations. * Scale computations using distributed processing (e.g., Apache Spark, Dask) for parallel backtests across large datasets. * Support AI persona queries (e.g., “Test Rep narrative on oil stocks during Trump presidency”).