{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Anoche me reí mucho con el hashtag [#LaGente](https://twitter.com/search?q=%23lagente&src=typd), que se viralizó mientras Alejandro Fantino entrevistaba, una vez más, al inefable candidato presidencial Sergio Massa. \n", "\n", "

"Me comprometo no ante vos, ante #LaGente". Lo harta que me tiene este tipo no tiene nombre.

— Fenicia (@zeinicienta) marzo 12, 2015
\n", "\n", "\n", "

Es insoportable la cantidad de frases hechas y lugares comunes que tira @SergioMassa. Demagogo berreta. #LaGente

— Javier Smaldone (@mis2centavos) marzo 12, 2015
\n", "\n", "\n", "

Quiero contar la cantidad de veces que +a dice #LaGente

— memorex (@memorex) marzo 12, 2015
\n", "\n", "\n", "Me acordé entonces de un [post de Zulko](http://zulko.github.io/blog/2014/06/21/some-more-videogreping-with-python/), cuyo blog es un compilado de gemas ñoñamente divertidas. Allí muestra cómo recortar automáticamente los pedacitos de un video que mencionen una palabra o frase, basándose en las marcas de tiempo del archivo de subtítulos, utilizando su maravillosa biblioteca [Moviepy](http://zulko.github.io/moviepy/) y un poco de Python. Más o menos lo que hace [videogrep](https://github.com/antiboredom/videogrep), pero más prolijo. \n", "\n", "La herramienta [youtube-dl](http://rg3.github.io/youtube-dl/) (que también es genial y hecha en Python), permite no sólo bajar videos de youtube y los subtitulos existentes, sino que también puede bajar el \"subtitulo automático\". En general son bastante malos pero es suficientemente efectivo para encontrar pequeñas frases. \n", "\n", "## Todo sea por \"la gente\": manos a la obra\n", "\n", "Lo primero que necesitamos es una lista de videos donde Sergio Massa hable. Hice una [búsqueda](https://www.youtube.com/results?search_query=entrevista+sergio+massa), decidí ignorar algunos (parodias, por ejemplo) y generé una lista. Hay varias maneras de obtener este listado de las primeras paginas de resultados, yo utilicé el rústico y efectivo webscrapping:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['https://www.youtube.com/watch?v=8pP8G3fSAcY',\n", " 'https://www.youtube.com/watch?v=g6QSwxUo1aw',\n", " 'https://www.youtube.com/watch?v=_9FN6CI8fD4',\n", " 'https://www.youtube.com/watch?v=5wqwNDpkZOo',\n", " 'https://www.youtube.com/watch?v=V865E4mBiHU',\n", " 'https://www.youtube.com/watch?v=TPrGNJnMS9U',\n", " 'https://www.youtube.com/watch?v=SVTl11hG9Gs',\n", " 'https://www.youtube.com/watch?v=Df_dwb5XHQM',\n", " 'https://www.youtube.com/watch?v=sptBkyfq1VU',\n", " 'https://www.youtube.com/watch?v=tzjz1xrNu3k',\n", " 'https://www.youtube.com/watch?v=k-CGbuOo8do',\n", " 'https://www.youtube.com/watch?v=_L-B_wHsEec',\n", " 'https://www.youtube.com/watch?v=iFOABIQdo9Q',\n", " 'https://www.youtube.com/watch?v=WOlRIKGrBWY',\n", " 'https://www.youtube.com/watch?v=a-mCgN6W9ek',\n", " 'https://www.youtube.com/watch?v=x5vhchv3zAY',\n", " 'https://www.youtube.com/watch?v=bi5eK7i59w0',\n", " 'https://www.youtube.com/watch?v=VNHV3D_6o4E',\n", " 'https://www.youtube.com/watch?v=MWVZ6JDU9V8',\n", " 'https://www.youtube.com/watch?v=v-JmdgVZqVc',\n", " 'https://www.youtube.com/watch?v=FBFHpdxsyYU',\n", " 'https://www.youtube.com/watch?v=WXmTc83l1sQ',\n", " 'https://www.youtube.com/watch?v=GfNgds5vS60',\n", " 'https://www.youtube.com/watch?v=UHRa34A6rDg',\n", " 'https://www.youtube.com/watch?v=xVU-EjnuksU',\n", " 'https://www.youtube.com/watch?v=-IXymTZZM6o',\n", " 'https://www.youtube.com/watch?v=tzvwDTPyTHQ',\n", " 'https://www.youtube.com/watch?v=a19z6EVWpQ4',\n", " 'https://www.youtube.com/watch?v=rAOvF8X_nzM',\n", " 'https://www.youtube.com/watch?v=wtvl4esdMGU',\n", " 'https://www.youtube.com/watch?v=1YPHDDH1Az0',\n", " 'https://www.youtube.com/watch?v=w7TnghsrJUo',\n", " 'https://www.youtube.com/watch?v=qBT-6HpSrwc',\n", " 'https://www.youtube.com/watch?v=JM-xblTxLGc',\n", " 'https://www.youtube.com/watch?v=kMymsVsmETY',\n", " 'https://www.youtube.com/watch?v=K1-dfiVfbOI',\n", " 'https://www.youtube.com/watch?v=VnoiHVlR-So',\n", " 'https://www.youtube.com/watch?v=hMTzJyLiXE4',\n", " 'https://www.youtube.com/watch?v=VGQPNQ1Bhkg',\n", " 'https://www.youtube.com/watch?v=0oR4z7SsY14',\n", " 'https://www.youtube.com/watch?v=Cl4r8h_Hlak',\n", " 'https://www.youtube.com/watch?v=gJFmek-YgYo',\n", " 'https://www.youtube.com/watch?v=9VQ7Ov5W_tM',\n", " 'https://www.youtube.com/watch?v=rKwKImVrYu4',\n", " 'https://www.youtube.com/watch?v=LJwj9SHC9EU',\n", " 'https://www.youtube.com/watch?v=-08OEpFThiw',\n", " 'https://www.youtube.com/watch?v=BPJBl5y2P2g',\n", " 'https://www.youtube.com/watch?v=MvkXlg9ZbL4',\n", " 'https://www.youtube.com/watch?v=7KgIa4fX_Ng',\n", " 'https://www.youtube.com/watch?v=upNLrHtzeBI',\n", " 'https://www.youtube.com/watch?v=Y-norf1BKAs',\n", " 'https://www.youtube.com/watch?v=QMvAl_fxQSA',\n", " 'https://www.youtube.com/watch?v=3os_uXUOvcM',\n", " 'https://www.youtube.com/watch?v=ZE_aChIEELo',\n", " 'https://www.youtube.com/watch?v=iKI-8ceuR-A',\n", " 'https://www.youtube.com/watch?v=CASdYLquQII',\n", " 'https://www.youtube.com/watch?v=5cvyi1CcpYs',\n", " 'https://www.youtube.com/watch?v=NVEw-YIAy5A',\n", " 'https://www.youtube.com/watch?v=yMXn04-GQTY',\n", " 'https://www.youtube.com/watch?v=RCCzZGcGg5k',\n", " 'https://www.youtube.com/watch?v=FqMFKGsXLOE',\n", " 'https://www.youtube.com/watch?v=MVOvQb8KBm0',\n", " 'https://www.youtube.com/watch?v=ENvWfMwnJ_0',\n", " 'https://www.youtube.com/watch?v=bs7xGm293Vs',\n", " 'https://www.youtube.com/watch?v=7OvrK-U-axI',\n", " 'https://www.youtube.com/watch?v=VHeWqPqs4vo',\n", " 'https://www.youtube.com/watch?v=nVOEi9FESn8',\n", " 'https://www.youtube.com/watch?v=eikTWAvFwTE',\n", " 'https://www.youtube.com/watch?v=BU2amn3QdWk',\n", " 'https://www.youtube.com/watch?v=GiB1pOuEvqg',\n", " 'https://www.youtube.com/watch?v=GAPN17lTJ9c',\n", " 'https://www.youtube.com/watch?v=4Ja1uZbMM8E',\n", " 'https://www.youtube.com/watch?v=F1dAfCR4rc0',\n", " 'https://www.youtube.com/watch?v=334O9xh-CQY',\n", " 'https://www.youtube.com/watch?v=KgNmw3sJ0g8',\n", " 'https://www.youtube.com/watch?v=-SQSue4-PLk',\n", " 'https://www.youtube.com/watch?v=HPE4PHlYySo']" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from pyquery import PyQuery\n", "links = []\n", "skip = ('M0yuFHbhYLY','TLmMh9Qvmic', 'rY4Hwvn6GlA')\n", "\n", "for page in range(1, 5):\n", " pq = PyQuery('https://www.youtube.com/results?search_query=entrevista+sergio+massa&page=%s' % page)\n", " pq.make_links_absolute()\n", " links.extend([pq(a).attr('href') for a in pq('a.yt-uix-tile-link') if pq(a).attr('href').split('v=')[1] not in skip])\n", "links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Luego, el paso lento: bajar los videos. Al parecer, Youtube no genera un subtitulo automático para videos demasiado largo, así que limité hasta 30 minutos. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "for link in links:\n", " !youtube-dl --write-auto-sub --sub-lang es --max-filesize 30.00m {link}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Con el material crudo disponible (aunque puede ser que no se hayan encontrado subtitulos para todos los videos), podemos copiar descaradamente partes del código de Zulko (levemente adaptado)\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import re \n", "import os\n", "import glob\n", "import random\n", "from moviepy.editor import VideoFileClip, concatenate, TextClip, CompositeVideoClip\n", "\n", "\n", "def convert_time(timestring):\n", " \"\"\" Converts a string into seconds \"\"\"\n", " nums = [float(t) for t in re.findall(r'\\d+', timestring)]\n", " return 3600 * nums[0] + 60*nums[1] + nums[2] + nums[3]/1000\n", "\n", "\n", "def get_time_texts(file):\n", " with open(file) as f:\n", " lines = f.readlines()\n", "\n", " times_texts = []\n", " current_times , current_text = None, \"\"\n", " for line in lines:\n", " times = re.findall(\"[0-9]*:[0-9]*:[0-9]*,[0-9]*\", line)\n", " if times != []:\n", " current_times = [convert_time(t) for t in times]\n", " elif line == '\\n':\n", " times_texts.append((current_times, current_text))\n", " current_times, current_text = None, \"\"\n", " elif current_times is not None:\n", " current_text = current_text + line.replace(\"\\n\",\" \")\n", " return times_texts\n", "\n", "def find_word(word, times_texts, padding=.4):\n", " \"\"\" Finds all 'exact' (t_start, t_end) for a word \"\"\"\n", " matches = [re.search(word, text)\n", " for (t,text) in times_texts]\n", " return [(t1 + m.start()*(t2-t1)/len(text) - padding,\n", " t1 + m.end()*(t2-t1)/len(text) + padding)\n", " for m,((t1,t2),text) in zip(matches, times_texts)\n", " if (m is not None)]\n", "\n", "\n", "def get_subclips(video_path, cuts): \n", " video = VideoFileClip(video_path)\n", " return [video.subclip(start, end) for (start,end) in cuts]\n", "\n", "\n", "def get_all_subclips_for(word, pattern='*.mp4', sub_ext='.es.srt', shuffle=True):\n", " subclips = []\n", " for mp4 in glob.glob(pattern):\n", " sub = os.path.splitext(mp4)[0] + sub_ext\n", " try:\n", " times = find_word(word, get_time_texts(sub))\n", " except IOError:\n", " # ignore video if it hasn't subtitle\n", " continue\n", " cuts = get_subclips(mp4, times)\n", " subclips.extend(cuts)\n", " if shuffle:\n", " random.shuffle(subclips)\n", " return subclips\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "La función `get_all_subclip` recibe la frase a buscar y devuelve un listado de segmentos donde, muy probablemente, se pronuncia.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "77" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gente = get_all_subclips_for('la gente')\n", "len(gente)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "El problema es que aunque es muy probable que sea Sergio Massa el que diga \"la gente\" en sus entrevistas, a veces es el entrevistador, a veces youtube entendió mal al desgrabar y a veces el código recortador la pifia. Por este motivo hay que descartar los segmentos que no sirven. \n", "\n", "Se me ocurrió hacerlo visualmente: los pegué todos, superponiendo el índice al que corresponde cada segmento, para luego anotar los que no sirven y filtrarlos en otra pasada. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def make_preview(subclips):\n", " subclips_ = []\n", " for (i, clip) in enumerate(subclips):\n", " txt_clip = TextClip(str(i),fontsize=70, color='white')\n", " txt_clip = txt_clip.set_pos('center').set_duration(clip.duration)\n", " clip = CompositeVideoClip([clip, txt_clip])\n", " subclips_.append(clip)\n", "\n", " final = concatenate(subclips_, method='compose')\n", " final.write_videofile('preview.webm', codec='libvpx', fps=24) \n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[MoviePy] >>>> Building video preview.webm\n", "[MoviePy] Writing audio in previewTEMP_MPY_wvf_snd.ogg\n", "[MoviePy] Done.\n", "[MoviePy] Writing video preview.webm\n", "[MoviePy] Done.\n", "[MoviePy] >>>> Video ready: preview.webm \n", "\n" ] } ], "source": [ "make_preview(gente)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "El [resultado](http://youtu.be/LU76nVlBqdE) me permitió hacer el tamizado" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ignore = [2, 3, 8, 12, 17, 19, 25, 28, 32, 36, 38, 40, 41, 44, 49, 55, 56, 61, 62, 66, 73, 74]\n", "subclips_cleaned = [i for j, i in enumerate(gente) if j not in ignore]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aunque no tengo idea de edición de videos, y porque de verdad creo que es un demamogo impresentable que no debería presidir ni una junta vecinal, quería darle un toque final, con una pequeña frase" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import numpy as np\n", "from moviepy.video.tools.segmenting import findObjects\n", "\n", "def arrive(screenpos,i,nletters):\n", " v = np.array([-1,0])\n", " d = lambda t : max(0, 3-3*t)\n", " return lambda t: screenpos-400*v*d(t-0.2*i)\n", "\n", "screensize = (640,360)\n", "txtClip = TextClip('Yn tragr ab rf obyhqn'.decode('rot13'), color='white', font=\"Amiri-Bold\", kerning=5, fontsize=50)\n", "cvc = CompositeVideoClip( [txtClip.set_pos('center')],\n", " size=screensize)\n", "\n", "letters = findObjects(cvc) # a list of ImageClips\n", "\n", "def moveLetters(letters, funcpos):\n", " return [ letter.set_pos(funcpos(letter.screenpos,i,len(letters)))\n", " for i,letter in enumerate(letters)]\n", "\n", "ending = CompositeVideoClip(moveLetters(letters, arrive), size=screensize).subclip(0, 10)\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[MoviePy] >>>> Building video massa_lagente_final.webm\n", "[MoviePy] Writing audio in massa_lagente_finalTEMP_MPY_wvf_snd.ogg\n", "[MoviePy] Done.\n", "[MoviePy] Writing video massa_lagente_final.webm\n", "[MoviePy] Done.\n", "[MoviePy] >>>> Video ready: massa_lagente_final.webm \n", "\n" ] } ], "source": [ "# le damos una mezcladita más\n", "random.shuffle(subclips_cleaned)\n", "subclips_cleaned.append(ending)\n", "make_final(subclips_cleaned, 'massa_lagente_final.webm')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Y este es el resultado:\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.0" } }, "nbformat": 4, "nbformat_minor": 0 }