Wired AI • 107일 전

인터넷 최고의 아카이브 도구가 위험에 처했다

IMP

8/10

핵심 요약

인터넷 아카이브의 웹 페이지 보존 도구인 웨이백 머신(Wayback Machine)을 뉴욕타임스, USA 투데이 등 주요 언론사와 레딧 등이 AI 데이터 스크래핑 우려로 인해 차단하고 있습니다. 이에 대해 전자프론티어재단(EFF) 등 옹호 단체와 100명 이상의 기자들은 공개 서한을 통해 웨이백 머신이 공공의 이익을 위한 언론 보존과 팩트체크에 필수적이라며 이를 막아서는 안 된다고 반발하고 있습니다.

번역된 본문

이번 달, USA 투데이(USA Today)는 미국 이민세관단속국(ICE)이 억류 정책의 영향에 대한 핵심 정보 공개를 지연시킨 방법을 폭로한 훌륭한 보도를 내놓았다. 이 기사의 작성자들은 인터넷 아카이브(Internet Archive)의 웨이백 머신(Wayback Machine)을 사용해 ICE의 억류 통계를 수집 및 분석하고 트럼프 행정부에서 이 기관이 어떻게 변화했는지 추적했다.

이 기사는 웹 페이지를 크롤링하고 보존하는 웨이백 머신이 어떻게 공공의 이익을 위해 정보를 보존하는 데 도움을 주었는지 보여주는 수많은 사례 중 하나다. 웨이백 머신의 디렉터인 마크 그레이엄(Mark Graham)은 이 상황이 "약간 아이러니하다"고 말했다. 자사 이름과 같은 신문을 포함해 200개 이상의 추가 미디어 매체를 운영하는 출판 그룹 USA 투데이(옛 이름 Gannett)는 웨이백 머신이 자사의 콘텐츠를 아카이빙하는 것을 금지하고 있다. 그레이엄은 "그들은 웨이백 머신이 존재하기 때문에 기사 연구를 위해 자료를 취합할 수 있었습니다. 동시에 그들은 접근을 차단하고 있습니다"라고 말했다.

최근 뉴욕타임스(The New York Times)를 포함한 여타 주요 언론 기관들도 웨이백 머신이 자사 기사를 아카이빙하는 것을 제한하는 방향으로 움직이고 있다. AI 탐지 스타트업 오리지널리티 AI(Originality AI)의 분석에 따르면, 현재 23개의 주요 뉴스 사이트가 인터넷 아카이브가 웨이백 프로젝트에 일반적으로 사용하는 웹 크롤러인 'ia_archiverbot'을 차단하고 있다. 소셜 플랫폼 레딧(Reddit) 역시 마찬가지다.

다른 매체들도 다른 방식으로 이 프로젝트를 제한하고 있다. 가디언(The Guardian)은 크롤러를 차단하지는 않지만, 인터넷 아카이브 API에서 자사 콘텐츠를 제외하고 웨이백 머신 인터페이스에서 기사를 필터링하여 일반인들이 아카이브된 버전의 기사에 접근하기 어렵게 만들고 있다. USA 투데이의 대변인 라크마리 안톤(Lark-Marie Anton)은 "이러한 노력은 인터넷 아카이브를 구체적으로 차단하려는 것이 아니라" 모든 스크래핑 봇을 차단하려는 회사의 광범위한 노력의 일환이라고 강조했다. 가디언의 비즈니스 어페어스 및 라이선싱 디렉터인 로버트 한(Robert Hahn)은 보존 목적으로 크롤링한 콘텐츠 세트를 AI 기업이 오용할 가능성에 대한 우려와 관련하여 인터넷 아카이브와 대화를 나누어 왔다고 밝혔다.

이제 개별 기자들이 이러한 추세에 맞서고 있다. 이번 주, 전자프론티어재단(Electronic Frontier Foundation)과 파이트 포 더 퓨처(Fight for the Future)와 같은 옹호 단체들은 기자들을 웨이백 머신의 대의에 결집시켰다. 이 연합은 이 도구의 가치를 인정하는 현직 기자들 100명 이상의 서명을 모아 인터넷 아카이브에 지지 서한을 전달했다. 서명자는 TV 진행자인 레이첼 매도(Rachel Maddow)부터 스피트파이어 뉴스(Spitfire News)의 캣 텐바지(Kat Tenbarge), 유저 매그(User Mag)의 테일러 로렌즈(Taylor Lorenz) 같은 독립 기자까지 다양하다.

서한에는 "지난 세대의 기자들은 과거의 보도에 접근하고 현재의 실마리를 역사로 거슬러 올라가 추적하기 위해 지역 신문이나 지역 공공도서관의 물리적 아카이브를 찾았습니다. 많은 신문이 폐업하고 지역 공공도서관이 디지털로만 존재하는 보도를 보존할 명확한 방법이 없는 상황에서, 언론의 기록을 안전하게 지키는 일은 점점 더 인터넷 아카이브의 몫이 되고 있습니다"라고 적혀 있다.

서명자 중 한 명이자 더 인터셉트(The Intercept)의 총괄 팟캐스트 프로듀서인 로라 플린(Laura Flynn)은 인터넷 아카이브가 자신의 경력 전반에 걸쳐 팩트체크와 오디오 클립 발굴에 중요한 역할을 한 '필수적인 도구'였다고 말한다. 또 다른 서명자인 시카고 리더(Chicago Reader)의 기자 미코 카포랄레(Micco Caporale)는 웨이백 머신이 과거의 팬 사이트에 접근할 수 있게 해주어 시간의 흐름 속에 사라질 뻔한 옛 밴드와 문화 인물에 대해 글을 쓸 때 큰 도움이 된다고 말한다.

카포랄레는 이 도구가 노동조합 조직가로서의 역할에도 유용했다고 밝혔다. 카포랄레는 "저는 노조 조직 활동에서 웨이백 머신을 아주 많이 활용했습니다. 회사가 사람을 고용하겠다고 주장한 내용과 실제로 부여한 업무가 무엇인지 알아보거나, 다른 시점에 여러 직무가 어떻게 개편되었는지 확인하기 위해 과거의 채용 공고를 찾을 때 쓰였죠"라고 말했다. "이 게시물들은 또한 도움이 됩니다."

원문 보기

원문 보기 (영어)

Comment Loader Save Story Save this story Comment Loader Save Story Save this story This month, USA Today published an excellent report that revealed how US Immigrations and Customs Enforcement delayed disclosing key information about the impacts of its detainment policies . The authors used the Internet Archive’s Wayback Machine to compile and analyze detention statistics from ICE and track how the agency had changed under the Trump administration. The story is one of countless examples of how the Wayback Machine, which crawls and preserves web pages, has helped preserve information for the public good . It was also, Wayback Machine director Mark Graham says, “a little ironic.” USA Today Co., the publishing conglomerate formerly known as Gannet that runs both its namesake paper and over 200 additional media outlets, bars the Wayback Machine from archiving its work. “They're able to pull together their story research because the Wayback Machine exists. At the same time, they're blocking access,” Graham says. A number of other major journalism organizations have also recently moved to restrict the Wayback Machine from archiving their stories, including The New York Times. According to analysis by the artificial-intelligence-detection startup Originality AI, 23 major news sites are currently blocking ia_archiverbot, the web crawler commonly used by the Internet Archive for the Wayback project. The social platform Reddit is too. Other outlets are limiting the project in different ways: The Guardian does not block the crawler, but it excludes its content from the Internet Archive API and filters out articles from the Wayback Machine interface, which makes it harder for regular people to access archived versions of its articles. USA Today Co. spokesperson Lark-Marie Anton emphasized that “this effort is not about specifically blocking the Internet Archive” but instead part of the company’s broader efforts to block all scraping bots. Robert Hahn, the Guardian’s director of business affairs and licensing, says that it has been in conversation with the Archive over “concerns over potential misuse by AI companies of content sets crawled for preservation purposes.” Now, individual reporters are pushing back on this trend. This week, advocacy organizations including the Electronic Frontier Foundation and Fight for the Future rallied journalists around the Wayback Machine’s cause. The coalition collected more than 100 signatures from working journalists who recognize the tool’s value and presented a letter of support to the Internet Archive. Signatories range from television mainstay Rachel Maddow to independent reporters like Spitfire News’ Kat Tenbarge and User Mag’s Taylor Lorenz. “In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history,” the letter reads. “With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism’s record increasingly falls to the Internet Archive.” Laura Flynn, a signatory and supervising podcast producer at The Intercept, says that the Internet Archive has been an “essential tool” throughout her career, playing an instrumental role in fact checking and surfacing audioclips. Another signatory, Chicago Reader writer Micco Caporale, says the Wayback Machine helps when writing about older bands and cultural figures by providing access to old fan sites that would otherwise be lost to time. Caporale says the tool has also been useful in their role as a union organizer. “I've also been using the Wayback Machine a ton in my union organizing work to find old job listings so we know what the company claimed to hire people for vs. what duties they actually assigned or to see how different positions have been retooled at different points,” Caporale says. “These posts also help us keep track of pay fluctuations across the organization over time.” Other publishers have justified their decision to block the Wayback Machine by pointing to concerns about how tech companies may use the Internet Archive’s data to train artificial intelligence models. New York Times spokesperson Graham James says that “the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.” (The Times declined to clarify whether this was something that was actually happening or rather a hypothetical concern.) Reddit has previously said that concerns about AI also led it to block the Wayback Machine crawler. There’s an ongoing war between publishers and AI companies over the legality of AI tools training on their content without permission; many of the over 100 AI copyright lawsuits in the United States focus on this issue. Tech companies use content from all over the internet, and because the Wayback Machine offers such an extensive trove of material, it is considered a particularly appealing data source. The Internet Archive has been around for 30 years and has archived over a trillion web pages . The nonprofit has weathered several major legal fights since 2020. Most recently, it settled with a group of major music publishers that had been seeking damages of up to $700 million over the Archive’s Great 78s project, which archived vintage recordings. Although there’s no major financial penalty at stake right now, the growing trend of media outlets blocking the Wayback Machine still poses a serious threat to its mission. There is no widely available public tool comparable to the Wayback Machine, and if it continues to lose access to major news sources, its preservation efforts could erode to the point where early digital records of history become much harder to access, or are even lost altogether. Notably, the tool has been used in reporting on The New York Times: In 2016, the paper came under scrutiny for editorial changes it made to an article on US senator and then-presidential candidate Bernie Sanders of Vermont. The revisions were first tracked using the Wayback Machine. If a similar situation arose today, watchdog media reporters may struggle to track older versions of Times articles in the same way. A kneecapped Wayback Machine isn’t just bad news for accountability journalism—it will also be a blow to the legal system, as pages archived by the tool are frequently cited as evidence in litigation across the United States. The Intervent Archive’s Mark Graham hasn’t given up hope that some of the publishers currently blocking its crawlers may eventually change course. He says that the nonprofit is “in conversation” with the Times and other outlets. But for now, Graham says, “there's no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what's going on in our world.”